Architecture¶
Overview¶
GAZE is organized around a multi-turn agentic loop where a VLM reasons over images using tools and structured JSON output.
+------------------------+
| AgenticProcessorBase |
| (your subclass) |
+-----------+------------+
|
+-----------v------------+
| Agentic loop |
| (multi-turn, JSON) |
+-----------+------------+
|
+----------------------+----------------------+
| | |
+--------v---------+ +--------v---------+ +--------v---------+
| Model adapter | | Tool registry | | Prompts |
| (OpenAI, LM | | (visual, | | (Jinja2) |
| Studio, HF) | | search) | | |
+------------------+ +--------+---------+ +------------------+
|
+-----------+-----------+
| |
+--------v---------+ +---------v--------+
| Image manager | | Retrieval |
| (PIL ops) | | (PubMed, |
| | | Open-i) |
+------------------+ +------------------+
Core components¶
AgenticProcessorBase (base.py)¶
The abstract base class that drives the agentic loop. Subclasses implement:
get_system_prompt()-- system message for the modelget_user_message()-- user message with task contextget_response_schema()-- JSON schema for structured outputvalidate_response()-- validates the parsed response
The loop runs until the model returns "continue": false or max_turns is reached.
Types (types.py)¶
All data types are frozen (immutable):
- ToolCall -- a tool invocation with name and arguments
- ToolResult -- tool execution output
- Turn -- one round of model response + tool calls
- AgenticResult -- the complete multi-turn result
Configuration (config.py)¶
Frozen dataclasses for all configuration:
- GazeConfig -- top-level config
- SearchConfig -- PubMed/Open-i search parameters
- ImageProcessingConfig -- image manipulation settings
Use config_context() for task-scoped overrides via ContextVar.
Model adapters (models/)¶
All adapters implement AdapterProtocol:
- OpenAIAdapter -- OpenAI API and OpenRouter
- LMStudioAdapter -- local models via LM Studio (subclasses OpenAIAdapter, no retries, HTTP allowed)
- HuggingFaceAdapter -- direct Transformers inference (lazy-imported to avoid torch dependency)
Tool system (tools/)¶
- ToolRegistry -- manages tool lifecycle and execution
- Visual tools (23) -- image manipulation via PIL
- Search tools (2) -- PubMed and Open-i literature search
Retrieval (retrieval/)¶
- PubMed search -- NCBI E-utilities API
- Open-i image search -- medical image retrieval
Verifiers integration (verifiers/)¶
For RL training with verifiable rewards:
- BaseMultiTurnEnv -- multi-turn RL environment
- Reward functions -- ExactMatchReward, TokenF1Reward, IoUReward, CombinedReward
- GazeAdapter -- bridges processors and verifiers
Design principles¶
- Frozen data -- all types and configs are immutable. Use
deep_freeze()/deep_thaw()for nested structures. - beartype validation -- runtime type checking on all public APIs.
- Specific exceptions --
GazeErrorhierarchy, never bareexcept:. - Async-first -- all I/O operations are async.
- Config isolation --
config_context()uses ContextVar for thread/task-safe overrides.
How the agentic loop works¶
The loop lives in AgenticProcessorBase._run_analysis() (base.py). The
description below tracks that implementation; the constants named in
parentheses are module-level defaults in base.py.
Turn lifecycle¶
Each analyze() call normalizes and loads the input images, optionally
downscales them (max_encode_dimension), builds a tool registry (multi-turn
only), then iterates up to max_turns times. On every turn GAZE:
- Sends the running message list to the adapter via
generate_chat(). - Parses the response into either tool calls or a JSON object.
- Executes any tool calls and appends their results, or accepts the JSON as
the final answer when
continueis false.
The default turn budget is 10 (_DEFAULT_MAX_TURNS) and the hard ceiling is
30 (_MAX_TURNS_LIMIT); larger values are clamped with a warning.
The continue flag¶
In multi-turn mode the system prompt instructs the model to return a boolean
continue field every turn: true to request more tools or analysis,
false to finalize. Local models often omit the field when they mean
"false", so GAZE injects continue: false whenever the key is
absent. It also coerces non-boolean values: null becomes false, 0/1
become booleans, and the strings "true"/"false"/"yes"/"no" are
normalized. A value that cannot be coerced raises AgenticProcessingError.
Single-turn vs multi-turn¶
When max_turns == 1 the behaviour differs in several ways:
- No
POLICYblock is appended to the system prompt (the multi-turn turn budget andcontinueinstructions are skipped). - No tool registry is created and no tools are offered, because the single turn is by definition the final turn and the final turn always withholds tools.
- The response schema is injected into the system prompt as a JSON skeleton
with field descriptions, so models that ignore
response_format(notably local models) still see the expected output shape.
In multi-turn mode the schema skeleton is instead used for recovery nudges and the force-finalize message.
Tool-result handling¶
Tool calls and response_format are not requested together: many providers
return empty content when both are present, so response_format is only
enforced on turns where tools are withheld (including the final turn). Tool
results are sanitized before being fed back to the model: control characters
are stripped, content is truncated to 8000 characters (_MAX_TOOL_CONTENT_CHARS),
and each result is wrapped in a randomized untrusted-content boundary to limit
prompt injection from external data such as PubMed abstracts.
Image-mutating tools (requires_image=True) run sequentially to preserve the
shared ImageManager state, while independent tools (such as search) run
concurrently alongside them. Adapters that cannot accept image payloads in
tool messages (supports_multipart_tool_content == False, e.g.
LMStudioAdapter) receive a text description plus a note that the image
could not be displayed.
GAZE tracks whether coordinate-modifying or intensity-modifying tools
have run (see the _COORD_MODIFYING_TOOLS and _INTENSITY_MODIFYING_TOOLS
sets, documented in Tools). On the final turn it re-attaches the
original image and warns the model that bounding boxes from transformed views
are invalid and that intensity measurements from modified images do not
reflect original tissue. A successful reset clears both flags.
Error recovery¶
The loop is built to keep going when a model misbehaves rather than crashing:
- Nudges. Empty responses, non-JSON text, non-object JSON, truncated
output, and responses that fail
validate_response()each trigger a corrective user message that restates the required structure. After_MAX_CONSECUTIVE_NUDGES(2) consecutive failures GAZE escalates to a stricter force-finalize message. If nudges still do not recover after_MAX_RECOVERY_NUDGEStotal attempts, it raisesAgenticProcessingError. A turn with valid JSON or successful tool calls resets the nudge counter. - Idle-tool force-finalize. In multi-turn mode, if tools are available
but the model has made zero tool calls by
_IDLE_TOOL_TURNS_LIMIT(3) turns, GAZE force-finalizes to avoid wasting tokens. It nudges once; if the model still does not use tools, it accepts the current response. - Final-turn tool stripping. The last turn never offers tools. If the
model still emits tool calls on that turn, GAZE first tries to
salvage a valid JSON answer from the accompanying text; failing that, it
raises
AgenticProcessingError. - Truncation salvage. When the final turn is cut off (
finish_reason == "length"), GAZE attempts to extract partial JSON from the text and, if the salvaged keys match a sub-schema rather than the top level, wraps them under the correct parent key before validating. - Stale image stripping. On turns after the first, base64 image data from earlier rounds is replaced with text placeholders to reduce payload size on subsequent API calls.