Architecture¶

Overview¶

GAZE is organized around a multi-turn agentic loop where a VLM reasons over images using tools and structured JSON output.

                       +------------------------+
                       |  AgenticProcessorBase  |
                       |     (your subclass)    |
                       +-----------+------------+
                                   |
                       +-----------v------------+
                       |      Agentic loop      |
                       |   (multi-turn, JSON)   |
                       +-----------+------------+
                                   |
            +----------------------+----------------------+
            |                      |                      |
   +--------v---------+   +--------v---------+   +--------v---------+
   |  Model adapter   |   |  Tool registry   |   |     Prompts      |
   | (OpenAI, LM      |   |     (visual,     |   |    (Jinja2)      |
   |  Studio, HF)     |   |      search)     |   |                  |
   +------------------+   +--------+---------+   +------------------+
                                   |
                       +-----------+-----------+
                       |                       |
              +--------v---------+   +---------v--------+
              |  Image manager   |   |    Retrieval     |
              |    (PIL ops)     |   |    (PubMed,      |
              |                  |   |     Open-i)      |
              +------------------+   +------------------+

Core components¶

AgenticProcessorBase (`base.py`)¶

The abstract base class that drives the agentic loop. Subclasses implement:

get_system_prompt() -- system message for the model
get_user_message() -- user message with task context
get_response_schema() -- JSON schema for structured output
validate_response() -- validates the parsed response

The loop runs until the model returns "continue": false or max_turns is reached.

Types (`types.py`)¶

All data types are frozen (immutable):

ToolCall -- a tool invocation with name and arguments
ToolResult -- tool execution output
Turn -- one round of model response + tool calls
AgenticResult -- the complete multi-turn result

Configuration (`config.py`)¶

Frozen dataclasses for all configuration:

GazeConfig -- top-level config
SearchConfig -- PubMed/Open-i search parameters
ImageProcessingConfig -- image manipulation settings

Use config_context() for task-scoped overrides via ContextVar.

Model adapters (`models/`)¶

All adapters implement AdapterProtocol:

OpenAIAdapter -- OpenAI API and OpenRouter
LMStudioAdapter -- local models via LM Studio (subclasses OpenAIAdapter, no retries, HTTP allowed)
HuggingFaceAdapter -- direct Transformers inference (lazy-imported to avoid torch dependency)

Tool system (`tools/`)¶

ToolRegistry -- manages tool lifecycle and execution
Visual tools (23) -- image manipulation via PIL
Search tools (2) -- PubMed and Open-i literature search

Retrieval (`retrieval/`)¶

PubMed search -- NCBI E-utilities API
Open-i image search -- medical image retrieval

Verifiers integration (`verifiers/`)¶

For RL training with verifiable rewards:

BaseMultiTurnEnv -- multi-turn RL environment
Reward functions -- ExactMatchReward, TokenF1Reward, IoUReward, CombinedReward
GazeAdapter -- bridges processors and verifiers

Design principles¶

Frozen data -- all types and configs are immutable. Use deep_freeze()/deep_thaw() for nested structures.
beartype validation -- runtime type checking on all public APIs.
Specific exceptions -- GazeError hierarchy, never bare except:.
Async-first -- all I/O operations are async.
Config isolation -- config_context() uses ContextVar for thread/task-safe overrides.

How the agentic loop works¶

The loop lives in AgenticProcessorBase._run_analysis() (base.py). The description below tracks that implementation; the constants named in parentheses are module-level defaults in base.py.

Turn lifecycle¶

Each analyze() call normalizes and loads the input images, optionally downscales them (max_encode_dimension), builds a tool registry (multi-turn only), then iterates up to max_turns times. On every turn GAZE:

Sends the running message list to the adapter via generate_chat().
Parses the response into either tool calls or a JSON object.
Executes any tool calls and appends their results, or accepts the JSON as the final answer when continue is false.

The default turn budget is 10 (_DEFAULT_MAX_TURNS) and the hard ceiling is 30 (_MAX_TURNS_LIMIT); larger values are clamped with a warning.

The `continue` flag¶

In multi-turn mode the system prompt instructs the model to return a boolean continue field every turn: true to request more tools or analysis, false to finalize. Local models often omit the field when they mean "false", so GAZE injects continue: false whenever the key is absent. It also coerces non-boolean values: null becomes false, 0/1 become booleans, and the strings "true"/"false"/"yes"/"no" are normalized. A value that cannot be coerced raises AgenticProcessingError.

Single-turn vs multi-turn¶

When max_turns == 1 the behaviour differs in several ways:

No POLICY block is appended to the system prompt (the multi-turn turn budget and continue instructions are skipped).
No tool registry is created and no tools are offered, because the single turn is by definition the final turn and the final turn always withholds tools.
The response schema is injected into the system prompt as a JSON skeleton with field descriptions, so models that ignore response_format (notably local models) still see the expected output shape.

In multi-turn mode the schema skeleton is instead used for recovery nudges and the force-finalize message.

Tool-result handling¶

Tool calls and response_format are not requested together: many providers return empty content when both are present, so response_format is only enforced on turns where tools are withheld (including the final turn). Tool results are sanitized before being fed back to the model: control characters are stripped, content is truncated to 8000 characters (_MAX_TOOL_CONTENT_CHARS), and each result is wrapped in a randomized untrusted-content boundary to limit prompt injection from external data such as PubMed abstracts.

Image-mutating tools (requires_image=True) run sequentially to preserve the shared ImageManager state, while independent tools (such as search) run concurrently alongside them. Adapters that cannot accept image payloads in tool messages (supports_multipart_tool_content == False, e.g. LMStudioAdapter) receive a text description plus a note that the image could not be displayed.

GAZE tracks whether coordinate-modifying or intensity-modifying tools have run (see the _COORD_MODIFYING_TOOLS and _INTENSITY_MODIFYING_TOOLS sets, documented in Tools). On the final turn it re-attaches the original image and warns the model that bounding boxes from transformed views are invalid and that intensity measurements from modified images do not reflect original tissue. A successful reset clears both flags.

Error recovery¶

The loop is built to keep going when a model misbehaves rather than crashing:

Nudges. Empty responses, non-JSON text, non-object JSON, truncated output, and responses that fail validate_response() each trigger a corrective user message that restates the required structure. After _MAX_CONSECUTIVE_NUDGES (2) consecutive failures GAZE escalates to a stricter force-finalize message. If nudges still do not recover after _MAX_RECOVERY_NUDGES total attempts, it raises AgenticProcessingError. A turn with valid JSON or successful tool calls resets the nudge counter.
Idle-tool force-finalize. In multi-turn mode, if tools are available but the model has made zero tool calls by _IDLE_TOOL_TURNS_LIMIT (3) turns, GAZE force-finalizes to avoid wasting tokens. It nudges once; if the model still does not use tools, it accepts the current response.
Final-turn tool stripping. The last turn never offers tools. If the model still emits tool calls on that turn, GAZE first tries to salvage a valid JSON answer from the accompanying text; failing that, it raises AgenticProcessingError.
Truncation salvage. When the final turn is cut off (finish_reason == "length"), GAZE attempts to extract partial JSON from the text and, if the salvaged keys match a sub-schema rather than the top level, wraps them under the correct parent key before validating.
Stale image stripping. On turns after the first, base64 image data from earlier rounds is replaced with text placeholders to reduce payload size on subsequent API calls.

Architecture¶

Overview¶

Core components¶

AgenticProcessorBase (base.py)¶

Types (types.py)¶

Configuration (config.py)¶

Model adapters (models/)¶

Tool system (tools/)¶

Retrieval (retrieval/)¶

Verifiers integration (verifiers/)¶