Skip to content

Verifiers integration

RL reward functions and a multi-turn environment that plug GAZE processors into the verifiers package for training with verifiable rewards. For the end-to-end setup, see Verifiers integration.

verifiers

Verifiers integration utilities for GAZE.

First-class integration with the verifiers package for multi-turn RL training.

Key components: - VerifiableProcessorMixin: Add verifiers support to any processor - BaseMultiTurnEnv: Standard multi-turn environment template - Reward functions: ExactMatch, TokenF1, IoU, Combined

GazeAdapter, BaseMultiTurnEnv, and VerifiableProcessorMixin require the verifiers and datasets packages at runtime. They are lazily imported so that from gaze.verifiers import ExactMatchReward (and the other reward utilities) works without those heavy optional dependencies.

BaseRewardFunction

Bases: ABC

Base class for reward functions.

Provides a common interface for reward functions that can be used with the verifiers package.

Source code in src/gaze/verifiers/rewards.py
class BaseRewardFunction(ABC):
    """Base class for reward functions.

    Provides a common interface for reward functions that can be
    used with the verifiers package.
    """

    @abstractmethod
    def __call__(
        self,
        prompt: str,
        completion: Any,
        info: dict[str, Any],
    ) -> float:
        """Compute reward for a completion.

        Args:
            prompt: The input prompt
            completion: Model completion
            info: Additional information (e.g., ground truth)

        Returns:
            Reward value (typically 0.0 to 1.0)
        """
        raise NotImplementedError

__call__ abstractmethod

__call__(
    prompt: str, completion: Any, info: dict[str, Any]
) -> float

Compute reward for a completion.

Parameters:

Name Type Description Default
prompt str

The input prompt

required
completion Any

Model completion

required
info dict[str, Any]

Additional information (e.g., ground truth)

required

Returns:

Type Description
float

Reward value (typically 0.0 to 1.0)

Source code in src/gaze/verifiers/rewards.py
@abstractmethod
def __call__(
    self,
    prompt: str,
    completion: Any,
    info: dict[str, Any],
) -> float:
    """Compute reward for a completion.

    Args:
        prompt: The input prompt
        completion: Model completion
        info: Additional information (e.g., ground truth)

    Returns:
        Reward value (typically 0.0 to 1.0)
    """
    raise NotImplementedError

CombinedReward

Bases: BaseRewardFunction

Combine multiple reward functions with weights.

Source code in src/gaze/verifiers/rewards.py
class CombinedReward(BaseRewardFunction):
    """Combine multiple reward functions with weights."""

    def __init__(
        self,
        rewards: list[BaseRewardFunction],
        weights: list[float] | None = None,
        names: list[str] | None = None,
    ) -> None:
        """Initialize combined reward.

        Args:
            rewards: List of reward functions
            weights: List of weights (must sum to 1.0)
            names: Optional names for each reward
        """
        if not rewards:
            raise ValueError("At least one reward function required")

        self.rewards = rewards
        self.weights = weights or [1.0 / len(rewards)] * len(rewards)
        self.names = names or [f"reward_{i}" for i in range(len(rewards))]

        if len(self.weights) != len(self.rewards):
            raise ValueError("Number of weights must match number of rewards")

        if any(w < 0.0 for w in self.weights):
            raise ValueError("CombinedReward: weights must be non-negative")

        total = sum(self.weights)
        if abs(total - 1.0) > 1e-6:
            raise ValueError(f"CombinedReward: weights must sum to 1.0, got {total:.4f}")

    def __call__(
        self,
        prompt: str,
        completion: Any,
        info: dict[str, Any],
    ) -> float:
        """Compute combined reward."""
        total_reward = 0.0
        details: dict[str, float] = {}

        for reward, weight, name in zip(self.rewards, self.weights, self.names, strict=True):
            r = reward(prompt, completion, info)
            total_reward += weight * r
            details[name] = r

        logger.debug(f"CombinedReward details: {details}, total={total_reward:.4f}")

        return total_reward

__init__

__init__(
    rewards: list[BaseRewardFunction],
    weights: list[float] | None = None,
    names: list[str] | None = None,
) -> None

Initialize combined reward.

Parameters:

Name Type Description Default
rewards list[BaseRewardFunction]

List of reward functions

required
weights list[float] | None

List of weights (must sum to 1.0)

None
names list[str] | None

Optional names for each reward

None
Source code in src/gaze/verifiers/rewards.py
def __init__(
    self,
    rewards: list[BaseRewardFunction],
    weights: list[float] | None = None,
    names: list[str] | None = None,
) -> None:
    """Initialize combined reward.

    Args:
        rewards: List of reward functions
        weights: List of weights (must sum to 1.0)
        names: Optional names for each reward
    """
    if not rewards:
        raise ValueError("At least one reward function required")

    self.rewards = rewards
    self.weights = weights or [1.0 / len(rewards)] * len(rewards)
    self.names = names or [f"reward_{i}" for i in range(len(rewards))]

    if len(self.weights) != len(self.rewards):
        raise ValueError("Number of weights must match number of rewards")

    if any(w < 0.0 for w in self.weights):
        raise ValueError("CombinedReward: weights must be non-negative")

    total = sum(self.weights)
    if abs(total - 1.0) > 1e-6:
        raise ValueError(f"CombinedReward: weights must sum to 1.0, got {total:.4f}")

__call__

__call__(
    prompt: str, completion: Any, info: dict[str, Any]
) -> float

Compute combined reward.

Source code in src/gaze/verifiers/rewards.py
def __call__(
    self,
    prompt: str,
    completion: Any,
    info: dict[str, Any],
) -> float:
    """Compute combined reward."""
    total_reward = 0.0
    details: dict[str, float] = {}

    for reward, weight, name in zip(self.rewards, self.weights, self.names, strict=True):
        r = reward(prompt, completion, info)
        total_reward += weight * r
        details[name] = r

    logger.debug(f"CombinedReward details: {details}, total={total_reward:.4f}")

    return total_reward

ExactMatchReward

Bases: BaseRewardFunction

Exact match reward function.

Rewards 1.0 for exact match, 0.0 otherwise. Supports normalization to handle common variations.

Source code in src/gaze/verifiers/rewards.py
class ExactMatchReward(BaseRewardFunction):
    """Exact match reward function.

    Rewards 1.0 for exact match, 0.0 otherwise.
    Supports normalization to handle common variations.
    """

    def __init__(
        self,
        normalize: bool = True,
        case_sensitive: bool = False,
        strip_braces: bool = True,
    ) -> None:
        """Initialize exact match reward.

        Args:
            normalize: Whether to normalize strings (lowercase, strip)
            case_sensitive: If False, compare case-insensitively
            strip_braces: Whether to strip braces and punctuation
        """
        self.normalize = normalize
        self.case_sensitive = case_sensitive
        self.strip_braces = strip_braces

    def __call__(
        self,
        prompt: str,  # noqa: ARG002 - Required by interface
        completion: Any,
        info: dict[str, Any],
    ) -> float:
        """Compute exact match reward."""
        pred = extract_completion_text(completion)
        ref = info.get("gold", info.get("reference", info.get("answer", "")))

        if self.normalize:
            pred = self._normalize(pred)
            ref = self._normalize(ref)

        match = pred == ref if self.case_sensitive else pred.lower() == ref.lower()
        return 1.0 if match else 0.0

    def _normalize(self, text: str) -> str:
        """Normalize text for comparison."""
        if not text:
            return ""

        if self.strip_braces:
            text = text.strip("{}().[],;")
            text = re.sub(r"\s+", " ", text).strip()

        return text

__init__

__init__(
    normalize: bool = True,
    case_sensitive: bool = False,
    strip_braces: bool = True,
) -> None

Initialize exact match reward.

Parameters:

Name Type Description Default
normalize bool

Whether to normalize strings (lowercase, strip)

True
case_sensitive bool

If False, compare case-insensitively

False
strip_braces bool

Whether to strip braces and punctuation

True
Source code in src/gaze/verifiers/rewards.py
def __init__(
    self,
    normalize: bool = True,
    case_sensitive: bool = False,
    strip_braces: bool = True,
) -> None:
    """Initialize exact match reward.

    Args:
        normalize: Whether to normalize strings (lowercase, strip)
        case_sensitive: If False, compare case-insensitively
        strip_braces: Whether to strip braces and punctuation
    """
    self.normalize = normalize
    self.case_sensitive = case_sensitive
    self.strip_braces = strip_braces

__call__

__call__(
    prompt: str, completion: Any, info: dict[str, Any]
) -> float

Compute exact match reward.

Source code in src/gaze/verifiers/rewards.py
def __call__(
    self,
    prompt: str,  # noqa: ARG002 - Required by interface
    completion: Any,
    info: dict[str, Any],
) -> float:
    """Compute exact match reward."""
    pred = extract_completion_text(completion)
    ref = info.get("gold", info.get("reference", info.get("answer", "")))

    if self.normalize:
        pred = self._normalize(pred)
        ref = self._normalize(ref)

    match = pred == ref if self.case_sensitive else pred.lower() == ref.lower()
    return 1.0 if match else 0.0

IoUReward

Bases: BaseRewardFunction

Intersection over Union (IoU) reward for bounding boxes.

Rewards based on spatial overlap between predicted and reference boxes. Uses continuous IoU values by default to provide smooth gradient signal for RL training. A step-function mode is available for binary rewards.

Includes an optional area penalty to discourage degenerate full-image predictions that trivially overlap any ground-truth box.

Source code in src/gaze/verifiers/rewards.py
class IoUReward(BaseRewardFunction):
    """Intersection over Union (IoU) reward for bounding boxes.

    Rewards based on spatial overlap between predicted and reference boxes.
    Uses continuous IoU values by default to provide smooth gradient signal
    for RL training. A step-function mode is available for binary rewards.

    Includes an optional area penalty to discourage degenerate full-image
    predictions that trivially overlap any ground-truth box.
    """

    def __init__(
        self,
        iou_threshold: float = 0.5,
        normalized: bool = True,
        continuous: bool = True,
        area_penalty_start: float = 0.5,
    ) -> None:
        """Initialize IoU reward.

        Args:
            iou_threshold: IoU threshold used only in step mode
            normalized: Whether coordinates are normalized [0,1]
            continuous: If True (default), return raw IoU for smooth gradients.
                If False, return 1.0 when IoU >= threshold, else 0.0.
            area_penalty_start: Area ratio above which penalty begins.
                When normalized=True, image area is 1.0. A box covering >50%
                of the image starts getting penalized. Set to 1.0 to disable.
        """
        self.iou_threshold = iou_threshold
        self.normalized = normalized
        self.continuous = continuous
        self.area_penalty_start = area_penalty_start

    def __call__(
        self,
        prompt: str,  # noqa: ARG002 - Required by interface
        completion: Any,
        info: dict[str, Any],
    ) -> float:
        """Compute IoU reward."""
        pred_box = self._extract_bbox(completion)
        ref_box = info.get("bbox", info.get("reference_bbox", []))

        if not pred_box or not ref_box or len(pred_box) < 4 or len(ref_box) < 4:
            return 0.0

        # Convert to float for compute_iou (handles int coords from JSON)
        pred_floats = [float(x) for x in pred_box[:4]]
        ref_floats = [float(x) for x in ref_box[:4]]
        iou = compute_iou(pred_floats, ref_floats)

        reward = iou if self.continuous else (1.0 if iou >= self.iou_threshold else 0.0)

        # Apply area penalty for degenerate full-image boxes.
        # For normalized coords (in [0,1]), image_area = 1.0.
        # For pixel coords, image_area must be supplied via info dict.
        if self.area_penalty_start < 1.0:
            if self.normalized:
                coords_in_range = all(0.0 <= c <= 1.0 for c in pred_floats)
                if coords_in_range:
                    image_area = 1.0
                else:
                    # Coords are pixel-scale despite normalized=True.
                    # We cannot infer image area from the predicted box itself
                    # (that estimate is always wrong for origin-anchored boxes).
                    # Require image_area in info; fail closed (reward=0) if absent
                    # to prevent gaming via coordinate-space mismatch.
                    raw_area = info.get("image_area")
                    if raw_area is not None:
                        image_area = float(raw_area)
                    else:
                        logger.warning(
                            "IoUReward: pixel-space coords detected but no "
                            "image_area in info dict; returning 0.0 (fail closed)"
                        )
                        return 0.0
            else:
                image_area = float(info.get("image_area", 0.0))
            if image_area > 0:
                pred_area = abs(pred_floats[2] - pred_floats[0]) * abs(
                    pred_floats[3] - pred_floats[1]
                )
                area_ratio = pred_area / image_area
                if area_ratio > self.area_penalty_start:
                    penalty = max(0.0, (1.0 - area_ratio) / (1.0 - self.area_penalty_start))
                    reward *= penalty

        return reward

    def _extract_bbox(self, completion: Any) -> list[float]:
        """Extract bounding box from completion.

        Searches all top-level JSON objects in the text and returns the bbox
        from the LAST matching object.  Models often emit reasoning JSON before
        the final structured response, so the last bbox is most likely correct.
        This matches the regex fallback which also takes the last match.
        """
        text = extract_completion_text(completion)

        # Search all JSON objects and keep the LAST bbox found
        last_bbox: list[float] | None = None
        pos = 0
        while pos < len(text):
            start = text.find("{", pos)
            if start == -1:
                break
            depth = 0
            for i, c in enumerate(text[start:], start):
                if c == "{":
                    depth += 1
                elif c == "}":
                    depth -= 1
                    if depth == 0:
                        json_candidate = text[start : i + 1]
                        try:
                            data = json.loads(json_candidate)
                            if "bbox" in data:
                                last_bbox = data["bbox"]
                            elif "location" in data and "bbox" in data["location"]:
                                last_bbox = data["location"]["bbox"]
                            elif "localization" in data:
                                loc = data["localization"]
                                if isinstance(loc, dict):
                                    locs = loc.get("localizations", [])
                                    if isinstance(locs, list) and locs:
                                        first = locs[0]
                                        if isinstance(first, dict):
                                            bbox = first.get("bounding_box", first.get("bbox"))
                                            if isinstance(bbox, list) and len(bbox) >= 4:
                                                last_bbox = [float(x) for x in bbox[:4]]
                        except json.JSONDecodeError as e:
                            logger.debug(
                                f"IoUReward: JSON parse failed for bbox extraction: {e}. "
                                f"Snippet: {json_candidate[:100]}..."
                            )
                        # Continue searching from after this JSON object
                        pos = i + 1
                        break
            else:
                # Unclosed brace — no more valid JSON possible
                break

        if last_bbox is not None:
            if not isinstance(last_bbox, list) or len(last_bbox) < 4:
                logger.debug(
                    f"IoUReward: bbox from JSON is not a valid list of ≥4 elements: {last_bbox!r}"
                )
                last_bbox = None
            else:
                return [float(x) for x in last_bbox[:4]]

        # Fallback: regex for [x1, y1, x2, y2] pattern.
        # Use findall and take the LAST match — models often emit reasoning
        # arrays before the final bbox, so the last one is most likely correct.
        # Support optional negative sign for coordinates that may be slightly OOB.
        pattern = (
            r"\[(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?),\s*(-?\d+(?:\.\d+)?)\]"
        )
        matches = re.findall(pattern, text)
        if matches:
            return [float(x) for x in matches[-1]]

        logger.debug(f"IoUReward: No bbox found in completion. Text length: {len(text)}")
        return []

__init__

__init__(
    iou_threshold: float = 0.5,
    normalized: bool = True,
    continuous: bool = True,
    area_penalty_start: float = 0.5,
) -> None

Initialize IoU reward.

Parameters:

Name Type Description Default
iou_threshold float

IoU threshold used only in step mode

0.5
normalized bool

Whether coordinates are normalized [0,1]

True
continuous bool

If True (default), return raw IoU for smooth gradients. If False, return 1.0 when IoU >= threshold, else 0.0.

True
area_penalty_start float

Area ratio above which penalty begins. When normalized=True, image area is 1.0. A box covering >50% of the image starts getting penalized. Set to 1.0 to disable.

0.5
Source code in src/gaze/verifiers/rewards.py
def __init__(
    self,
    iou_threshold: float = 0.5,
    normalized: bool = True,
    continuous: bool = True,
    area_penalty_start: float = 0.5,
) -> None:
    """Initialize IoU reward.

    Args:
        iou_threshold: IoU threshold used only in step mode
        normalized: Whether coordinates are normalized [0,1]
        continuous: If True (default), return raw IoU for smooth gradients.
            If False, return 1.0 when IoU >= threshold, else 0.0.
        area_penalty_start: Area ratio above which penalty begins.
            When normalized=True, image area is 1.0. A box covering >50%
            of the image starts getting penalized. Set to 1.0 to disable.
    """
    self.iou_threshold = iou_threshold
    self.normalized = normalized
    self.continuous = continuous
    self.area_penalty_start = area_penalty_start

__call__

__call__(
    prompt: str, completion: Any, info: dict[str, Any]
) -> float

Compute IoU reward.

Source code in src/gaze/verifiers/rewards.py
def __call__(
    self,
    prompt: str,  # noqa: ARG002 - Required by interface
    completion: Any,
    info: dict[str, Any],
) -> float:
    """Compute IoU reward."""
    pred_box = self._extract_bbox(completion)
    ref_box = info.get("bbox", info.get("reference_bbox", []))

    if not pred_box or not ref_box or len(pred_box) < 4 or len(ref_box) < 4:
        return 0.0

    # Convert to float for compute_iou (handles int coords from JSON)
    pred_floats = [float(x) for x in pred_box[:4]]
    ref_floats = [float(x) for x in ref_box[:4]]
    iou = compute_iou(pred_floats, ref_floats)

    reward = iou if self.continuous else (1.0 if iou >= self.iou_threshold else 0.0)

    # Apply area penalty for degenerate full-image boxes.
    # For normalized coords (in [0,1]), image_area = 1.0.
    # For pixel coords, image_area must be supplied via info dict.
    if self.area_penalty_start < 1.0:
        if self.normalized:
            coords_in_range = all(0.0 <= c <= 1.0 for c in pred_floats)
            if coords_in_range:
                image_area = 1.0
            else:
                # Coords are pixel-scale despite normalized=True.
                # We cannot infer image area from the predicted box itself
                # (that estimate is always wrong for origin-anchored boxes).
                # Require image_area in info; fail closed (reward=0) if absent
                # to prevent gaming via coordinate-space mismatch.
                raw_area = info.get("image_area")
                if raw_area is not None:
                    image_area = float(raw_area)
                else:
                    logger.warning(
                        "IoUReward: pixel-space coords detected but no "
                        "image_area in info dict; returning 0.0 (fail closed)"
                    )
                    return 0.0
        else:
            image_area = float(info.get("image_area", 0.0))
        if image_area > 0:
            pred_area = abs(pred_floats[2] - pred_floats[0]) * abs(
                pred_floats[3] - pred_floats[1]
            )
            area_ratio = pred_area / image_area
            if area_ratio > self.area_penalty_start:
                penalty = max(0.0, (1.0 - area_ratio) / (1.0 - self.area_penalty_start))
                reward *= penalty

    return reward

TokenF1Reward

Bases: BaseRewardFunction

Token-level F1 reward function.

Computes token overlap between prediction and reference. Useful for evaluating text generation where exact match is too strict.

Source code in src/gaze/verifiers/rewards.py
class TokenF1Reward(BaseRewardFunction):
    """Token-level F1 reward function.

    Computes token overlap between prediction and reference.
    Useful for evaluating text generation where exact match is too strict.
    """

    # Stopwords that inflate token overlap without carrying semantic content.
    # Kept small and domain-neutral to avoid accidentally filtering medical terms.
    STOPWORDS: frozenset[str] = frozenset(
        {
            "a",
            "an",
            "the",
            "is",
            "are",
            "was",
            "were",
            "be",
            "been",
            "being",
            "have",
            "has",
            "had",
            "do",
            "does",
            "did",
            "will",
            "would",
            "shall",
            "should",
            "may",
            "might",
            "must",
            "can",
            "could",
            "i",
            "me",
            "my",
            "we",
            "our",
            "you",
            "your",
            "he",
            "she",
            "it",
            "they",
            "them",
            "his",
            "her",
            "its",
            "their",
            "this",
            "that",
            "these",
            "those",
            "in",
            "on",
            "at",
            "to",
            "for",
            "of",
            "with",
            "by",
            "from",
            "as",
            "into",
            "about",
            "between",
            "through",
            "during",
            "before",
            "after",
            "and",
            "or",
            "but",
            "nor",
            "not",
            "no",
            "so",
            "if",
            "then",
        }
    )

    def __init__(
        self,
        normalize: bool = True,
        case_sensitive: bool = False,
        tokenize: str = "simple",  # "simple", "word", or "character"
        filter_stopwords: bool = True,
    ) -> None:
        """Initialize token F1 reward.

        Args:
            normalize: Whether to normalize strings
            case_sensitive: Whether comparison is case-sensitive
            tokenize: Tokenization method
            filter_stopwords: Whether to remove stopwords before scoring
        """
        self.normalize = normalize
        self.case_sensitive = case_sensitive
        self.tokenize = tokenize
        self.filter_stopwords = filter_stopwords

    def __call__(
        self,
        prompt: str,  # noqa: ARG002 - Required by interface
        completion: Any,
        info: dict[str, Any],
    ) -> float:
        """Compute token F1 reward."""
        pred = extract_completion_text(completion)
        ref = info.get("gold", info.get("reference", info.get("answer", "")))

        pred_tokens = self._tokenize_text(pred)
        ref_tokens = self._tokenize_text(ref)

        if not pred_tokens and not ref_tokens:
            return 1.0
        if not pred_tokens or not ref_tokens:
            return 0.0

        # Count token occurrences
        pred_counts: dict[str, int] = {}
        ref_counts: dict[str, int] = {}

        for token in pred_tokens:
            pred_counts[token] = pred_counts.get(token, 0) + 1
        for token in ref_tokens:
            ref_counts[token] = ref_counts.get(token, 0) + 1

        # Compute intersection
        intersection = sum(
            min(count, ref_counts.get(token, 0)) for token, count in pred_counts.items()
        )

        precision = intersection / len(pred_tokens)
        recall = intersection / len(ref_tokens)

        if precision + recall == 0:
            return 0.0

        return 2 * precision * recall / (precision + recall)

    def _tokenize_text(self, text: str) -> list[str]:
        """Tokenize text."""
        if not text:
            return []

        # Normalize
        if self.normalize:
            text = text.lower() if not self.case_sensitive else text
            text = re.sub(r"\s+", " ", text.strip())

        # Tokenize
        if self.tokenize == "simple":
            # Split on whitespace and punctuation
            tokens = re.findall(r"\b\w+\b", text)
        elif self.tokenize == "word":
            tokens = text.split()
        elif self.tokenize == "character":
            tokens = list(text)
        else:
            raise ValueError(f"Unknown tokenize method: {self.tokenize}")

        # Remove stopwords — prevents gaming via generic filler text
        if self.filter_stopwords and self.tokenize != "character":
            tokens = [t for t in tokens if t.lower() not in self.STOPWORDS]

        return tokens

__init__

__init__(
    normalize: bool = True,
    case_sensitive: bool = False,
    tokenize: str = "simple",
    filter_stopwords: bool = True,
) -> None

Initialize token F1 reward.

Parameters:

Name Type Description Default
normalize bool

Whether to normalize strings

True
case_sensitive bool

Whether comparison is case-sensitive

False
tokenize str

Tokenization method

'simple'
filter_stopwords bool

Whether to remove stopwords before scoring

True
Source code in src/gaze/verifiers/rewards.py
def __init__(
    self,
    normalize: bool = True,
    case_sensitive: bool = False,
    tokenize: str = "simple",  # "simple", "word", or "character"
    filter_stopwords: bool = True,
) -> None:
    """Initialize token F1 reward.

    Args:
        normalize: Whether to normalize strings
        case_sensitive: Whether comparison is case-sensitive
        tokenize: Tokenization method
        filter_stopwords: Whether to remove stopwords before scoring
    """
    self.normalize = normalize
    self.case_sensitive = case_sensitive
    self.tokenize = tokenize
    self.filter_stopwords = filter_stopwords

__call__

__call__(
    prompt: str, completion: Any, info: dict[str, Any]
) -> float

Compute token F1 reward.

Source code in src/gaze/verifiers/rewards.py
def __call__(
    self,
    prompt: str,  # noqa: ARG002 - Required by interface
    completion: Any,
    info: dict[str, Any],
) -> float:
    """Compute token F1 reward."""
    pred = extract_completion_text(completion)
    ref = info.get("gold", info.get("reference", info.get("answer", "")))

    pred_tokens = self._tokenize_text(pred)
    ref_tokens = self._tokenize_text(ref)

    if not pred_tokens and not ref_tokens:
        return 1.0
    if not pred_tokens or not ref_tokens:
        return 0.0

    # Count token occurrences
    pred_counts: dict[str, int] = {}
    ref_counts: dict[str, int] = {}

    for token in pred_tokens:
        pred_counts[token] = pred_counts.get(token, 0) + 1
    for token in ref_tokens:
        ref_counts[token] = ref_counts.get(token, 0) + 1

    # Compute intersection
    intersection = sum(
        min(count, ref_counts.get(token, 0)) for token, count in pred_counts.items()
    )

    precision = intersection / len(pred_tokens)
    recall = intersection / len(ref_tokens)

    if precision + recall == 0:
        return 0.0

    return 2 * precision * recall / (precision + recall)

GazeAdapter

Adapter for using GAZE processors with verifiers.

Bridges the gap between the two packages: - Converts messages between formats - Collects tool calls and results from processor runs - Manages image metadata

Source code in src/gaze/verifiers/adapter.py
class GazeAdapter:
    """Adapter for using GAZE processors with verifiers.

    Bridges the gap between the two packages:
    - Converts messages between formats
    - Collects tool calls and results from processor runs
    - Manages image metadata
    """

    @beartype
    def __init__(
        self,
        processor: AgenticProcessorBase,
    ) -> None:
        """Initialize adapter.

        Args:
            processor: GAZE processor
        """
        self.processor = processor

    @beartype
    async def process_verifiers_messages(
        self,
        messages: Messages,
        info: dict[str, Any],
    ) -> dict[str, Any]:
        """Process messages using GAZE.

        Args:
            messages: verifiers format messages
            info: Additional information (may include 'image_path')

        Returns:
            Processed result with response and metadata
        """
        user_prompt = self._extract_user_prompt(messages)
        metadata = dict(info)
        if user_prompt:
            metadata.setdefault("user_prompt", user_prompt)

        # Extract image path from info if provided
        image_path = info.get("image_path")
        images: Path | None = None
        if image_path:
            images = Path(image_path) if isinstance(image_path, str) else image_path

        agentic_result = await self.processor.analyze(images=images, metadata=metadata)
        tool_calls = self._collect_tool_calls(agentic_result)
        tool_results = self._collect_tool_results(agentic_result)
        response_payload = deep_thaw(agentic_result.final_response)
        response_text = json.dumps(response_payload)
        should_continue = bool(agentic_result.final_response.get("continue"))

        return {
            "response": response_payload,
            "messages": self._convert_response_to_messages(response_text, tool_calls, tool_results),
            "tool_calls": tool_calls,
            "turns": agentic_result.num_turns,
            "is_complete": not should_continue,
        }

    @beartype
    def _collect_tool_calls(self, result: AgenticResult) -> list[dict[str, Any]]:
        """Flatten tool calls from agentic turns."""
        return [
            {
                "id": tool_call.id,
                "name": tool_call.name,
                "arguments": tool_call.arguments
                if isinstance(tool_call.arguments, str)
                else deep_thaw(tool_call.arguments),
            }
            for turn in result.turns
            for tool_call in turn.tool_calls
        ]

    @beartype
    def _collect_tool_results(self, result: AgenticResult) -> list[dict[str, Any]]:
        """Convert tool results into serializable dictionaries."""
        return [
            {
                "tool_name": tool_result.tool_name,
                "description": tool_result.description,
                "error": tool_result.error,
                "metadata": deep_thaw(tool_result.metadata),
            }
            for turn in result.turns
            for tool_result in turn.tool_results
        ]

    @beartype
    def _convert_response_to_messages(
        self,
        response_text: str,
        tool_calls: list[dict[str, Any]],
        tool_results: list[dict[str, Any]],
    ) -> Messages:
        """Convert a GAZE response to verifiers messages format."""
        messages: Messages = []

        if response_text:
            messages.append(
                {
                    "role": "assistant",
                    "content": response_text,
                }
            )

        if tool_results:
            for idx, tool_result in enumerate(tool_results):
                # Use actual tool call ID when available, fall back to index
                tool_call_id = tool_calls[idx]["id"] if idx < len(tool_calls) else str(idx)
                messages.append(
                    {
                        "role": "tool",
                        "content": json.dumps(tool_result),
                        "tool_call_id": tool_call_id,
                    }
                )

        return messages

    @beartype
    def _extract_user_prompt(self, messages: Messages) -> str:
        """Get the most recent user text from verifiers messages."""
        for msg in reversed(messages):
            if msg.get("role") != "user":
                continue
            content = msg.get("content", "")
            if isinstance(content, str):
                return content
            if isinstance(content, list):
                text_parts = [
                    part.get("text", "")
                    for part in content
                    if isinstance(part, dict) and part.get("type") == "text"
                ]
                combined = "\n".join(part for part in text_parts if part)
                if combined:
                    return combined
        return ""

    @beartype
    def create_environment_class(
        self,
        base_class: type[vf.MultiTurnEnv] | None = None,
        **env_kwargs: Any,
    ) -> type[vf.MultiTurnEnv]:
        """Create a verifiers MultiTurnEnv class that uses this adapter.

        Args:
            base_class: Base environment class to inherit from
            **env_kwargs: Additional arguments for environment

        Returns:
            Environment class
        """
        _ = env_kwargs
        if base_class is None:
            from .base import BaseMultiTurnEnv

            base_class = BaseMultiTurnEnv

        # Capture adapter config in closure
        captured_processor = self.processor

        class AdapterEnv(base_class):
            def __init__(self, *args: Any, **kwargs: Any):
                super().__init__(*args, **kwargs)
                self._adapter = GazeAdapter(
                    processor=captured_processor,
                )

            async def env_response(
                self,
                messages: Messages,
                state: State,
                **kwargs: Any,  # noqa: ARG002 - Required by vf.MultiTurnEnv interface
            ) -> vf.Messages:
                """Generate response using GAZE."""
                info = state.get("info") or {}
                result = await self._adapter.process_verifiers_messages(
                    messages,
                    info,
                )

                state["turn"] = state.get("turn", 0) + 1
                state["tool_uses"] = state.get("tool_uses", 0) + len(result["tool_calls"])
                state["is_complete"] = result["is_complete"]

                return result["messages"]

            @vf.stop
            async def _adapter_complete(self, state: State) -> bool:
                """Stop when adapter signals completion."""
                return state.get("is_complete", False)

        return AdapterEnv

__init__

__init__(processor: AgenticProcessorBase) -> None

Initialize adapter.

Parameters:

Name Type Description Default
processor AgenticProcessorBase

GAZE processor

required
Source code in src/gaze/verifiers/adapter.py
@beartype
def __init__(
    self,
    processor: AgenticProcessorBase,
) -> None:
    """Initialize adapter.

    Args:
        processor: GAZE processor
    """
    self.processor = processor

process_verifiers_messages async

process_verifiers_messages(
    messages: Messages, info: dict[str, Any]
) -> dict[str, Any]

Process messages using GAZE.

Parameters:

Name Type Description Default
messages Messages

verifiers format messages

required
info dict[str, Any]

Additional information (may include 'image_path')

required

Returns:

Type Description
dict[str, Any]

Processed result with response and metadata

Source code in src/gaze/verifiers/adapter.py
@beartype
async def process_verifiers_messages(
    self,
    messages: Messages,
    info: dict[str, Any],
) -> dict[str, Any]:
    """Process messages using GAZE.

    Args:
        messages: verifiers format messages
        info: Additional information (may include 'image_path')

    Returns:
        Processed result with response and metadata
    """
    user_prompt = self._extract_user_prompt(messages)
    metadata = dict(info)
    if user_prompt:
        metadata.setdefault("user_prompt", user_prompt)

    # Extract image path from info if provided
    image_path = info.get("image_path")
    images: Path | None = None
    if image_path:
        images = Path(image_path) if isinstance(image_path, str) else image_path

    agentic_result = await self.processor.analyze(images=images, metadata=metadata)
    tool_calls = self._collect_tool_calls(agentic_result)
    tool_results = self._collect_tool_results(agentic_result)
    response_payload = deep_thaw(agentic_result.final_response)
    response_text = json.dumps(response_payload)
    should_continue = bool(agentic_result.final_response.get("continue"))

    return {
        "response": response_payload,
        "messages": self._convert_response_to_messages(response_text, tool_calls, tool_results),
        "tool_calls": tool_calls,
        "turns": agentic_result.num_turns,
        "is_complete": not should_continue,
    }

create_environment_class

create_environment_class(
    base_class: type[MultiTurnEnv] | None = None,
    **env_kwargs: Any,
) -> type[vf.MultiTurnEnv]

Create a verifiers MultiTurnEnv class that uses this adapter.

Parameters:

Name Type Description Default
base_class type[MultiTurnEnv] | None

Base environment class to inherit from

None
**env_kwargs Any

Additional arguments for environment

{}

Returns:

Type Description
type[MultiTurnEnv]

Environment class

Source code in src/gaze/verifiers/adapter.py
@beartype
def create_environment_class(
    self,
    base_class: type[vf.MultiTurnEnv] | None = None,
    **env_kwargs: Any,
) -> type[vf.MultiTurnEnv]:
    """Create a verifiers MultiTurnEnv class that uses this adapter.

    Args:
        base_class: Base environment class to inherit from
        **env_kwargs: Additional arguments for environment

    Returns:
        Environment class
    """
    _ = env_kwargs
    if base_class is None:
        from .base import BaseMultiTurnEnv

        base_class = BaseMultiTurnEnv

    # Capture adapter config in closure
    captured_processor = self.processor

    class AdapterEnv(base_class):
        def __init__(self, *args: Any, **kwargs: Any):
            super().__init__(*args, **kwargs)
            self._adapter = GazeAdapter(
                processor=captured_processor,
            )

        async def env_response(
            self,
            messages: Messages,
            state: State,
            **kwargs: Any,  # noqa: ARG002 - Required by vf.MultiTurnEnv interface
        ) -> vf.Messages:
            """Generate response using GAZE."""
            info = state.get("info") or {}
            result = await self._adapter.process_verifiers_messages(
                messages,
                info,
            )

            state["turn"] = state.get("turn", 0) + 1
            state["tool_uses"] = state.get("tool_uses", 0) + len(result["tool_calls"])
            state["is_complete"] = result["is_complete"]

            return result["messages"]

        @vf.stop
        async def _adapter_complete(self, state: State) -> bool:
            """Stop when adapter signals completion."""
            return state.get("is_complete", False)

    return AdapterEnv

BaseMultiTurnEnv

Bases: MultiTurnEnv

Base multi-turn environment for GAZE integration.

Provides common functionality for multi-turn environments: - Dataset loading from JSONL files - Turn tracking and limits - Standard message processing - Logging utilities - Tool request parsing

Subclasses should implement: - setup_state: Initialize episode state - env_response: Generate environment responses

Source code in src/gaze/verifiers/base.py
class BaseMultiTurnEnv(vf.MultiTurnEnv):
    """Base multi-turn environment for GAZE integration.

    Provides common functionality for multi-turn environments:
    - Dataset loading from JSONL files
    - Turn tracking and limits
    - Standard message processing
    - Logging utilities
    - Tool request parsing

    Subclasses should implement:
    - setup_state: Initialize episode state
    - env_response: Generate environment responses
    """

    def __init__(
        self,
        cases: list[dict[str, Any]] | None = None,
        *,
        dataset_path: str | None = None,
        max_turns: int = 10,
        name: str = "BaseGazeEnv",
        log_dir: Path | str | None = None,
    ) -> None:
        """Initialize environment.

        Args:
            cases: Pre-loaded cases (optional)
            dataset_path: Path to JSONL dataset file
            max_turns: Maximum conversation turns
            name: Environment name
            log_dir: Directory for debug logs
        """
        self._max_turns = max_turns
        self._log_dir = Path(log_dir) if log_dir else Path(__file__).parent.parent / "logs"
        self._log_path = self._log_dir / f"{name.lower()}_debug.log"

        # Load cases
        if cases is None and dataset_path:
            cases = self._load_jsonl(dataset_path)
        elif cases is None:
            cases = []

        # Process cases into prompts and info
        prompts, infos = self._prepare_cases(cases)

        dataset = Dataset.from_dict(
            {
                "id": list(range(len(cases))),
                "prompt": prompts,
                "info": infos,
            }
        )

        super().__init__(max_turns=max_turns, dataset=dataset)
        self._cases = cases

    @staticmethod
    def _load_jsonl(path: str) -> list[dict[str, Any]]:
        """Load cases from JSONL file."""
        rows: list[dict[str, Any]] = []
        with open(path, encoding="utf-8") as fh:
            for raw_line in fh:
                stripped = raw_line.strip()
                if stripped:
                    rows.append(json.loads(stripped))
        return rows

    def _prepare_cases(
        self,
        cases: list[dict[str, Any]],
    ) -> tuple[list[list[dict[str, Any]]], list[dict[str, Any]]]:
        """Process cases into prompts and info structures.

        Override this method to customize case processing.

        Args:
            cases: Raw case dictionaries

        Returns:
            Tuple of (prompts, infos) for dataset construction
        """
        prompts: list[list[dict[str, Any]]] = []
        infos: list[dict[str, Any]] = []

        for idx, case in enumerate(cases):
            # Default prompt structure
            prompt = self._build_prompt(case)
            prompts.append(prompt)

            # Default info structure
            info = {
                "case_index": idx,
                **case,  # Include all case data
            }
            infos.append(info)

        return prompts, infos

    def _build_prompt(self, case: dict[str, Any]) -> list[dict[str, Any]]:
        """Build initial prompt for a case.

        Override this method to customize prompt construction.

        Args:
            case: Case dictionary

        Returns:
            List of message dicts
        """
        system_prompt = self.get_system_prompt()
        user_content = self._build_user_message(case)

        messages = [{"role": "system", "content": system_prompt}]
        messages.append({"role": "user", "content": user_content})

        return messages

    def _build_user_message(self, case: dict[str, Any]) -> str | list[dict[str, Any]]:
        """Build user message content.

        Override this method to customize user messages.

        Args:
            case: Case dictionary

        Returns:
            String or list of content dicts (for multimodal)
        """
        # Default: just return the question or case text
        return case.get("question", str(case))

    def get_system_prompt(self) -> str:
        """Get system prompt for the environment.

        Override this method to provide custom system prompt.

        Returns:
            System prompt string
        """
        return "You are a helpful assistant. Respond accurately and concisely."

    async def setup_state(self, state: vf.State) -> vf.State:
        """Initialize episode state.

        Override this method to customize initial state.

        Args:
            state: State dict pre-populated by verifiers

        Returns:
            State with custom fields added
        """
        state["turn"] = 0
        state["tool_uses"] = 0
        return state

    @vf.stop
    async def _turn_limit_reached(self, state: vf.State) -> bool:
        """Stop when turn limit is reached."""
        return state.get("turn", 0) >= self._max_turns

    async def env_response(
        self,
        messages: vf.Messages,  # noqa: ARG002 - Required by interface
        state: vf.State,
        **kwargs: Any,  # noqa: ARG002 - Required by interface
    ) -> vf.Messages:
        """Generate environment response to assistant message.

        Override this method to implement custom environment behavior.
        Mutate state in-place to track turn progress.

        Args:
            messages: Conversation messages
            state: Current episode state (mutate in-place)
            **kwargs: Additional arguments from verifiers

        Returns:
            Response messages
        """
        # Default: increment turn counter, no response
        state["turn"] = state.get("turn", 0) + 1
        return []

    def _log_debug(self, line: str) -> None:
        """Write debug log entry."""
        self._log_dir.mkdir(parents=True, exist_ok=True)
        with self._log_path.open("a", encoding="utf-8") as f:
            f.write(line + "\n")

    def _last_assistant_text(self, messages: vf.Messages) -> str:
        """Get text from last assistant message."""
        for m in reversed(messages):
            if isinstance(m, dict) and m.get("role") == "assistant":
                content = m.get("content", "")
                if isinstance(content, str):
                    return content
                elif isinstance(content, list):
                    # Handle multimodal content
                    for item in content:
                        if item.get("type") == "text":
                            return item.get("text", "")
        return ""

    def _extract_tool_request(
        self,
        text: str,
        tools: list[str],
    ) -> tuple[str, list[Any]] | None:
        """Extract tool request from text.

        Override this method to support custom tool parsing.

        Args:
            text: Assistant message text
            tools: List of valid tool names

        Returns:
            Tuple of (tool_name, args) or None
        """
        text_upper = text.upper()

        for tool in tools:
            # Simple pattern: TOOL [args]
            pattern = f"{tool.upper()}\\s*\\[([^\\]]+)\\]"
            match = re.search(pattern, text_upper)
            if match:
                # Parse arguments (comma-separated)
                args_str = match.group(1).strip()
                try:
                    # Try parsing as numbers
                    args = [float(x.strip()) for x in args_str.split(",")]
                except ValueError:
                    # Keep as strings
                    args = [x.strip() for x in args_str.split(",")]
                return tool.lower(), args

        return None

__init__

__init__(
    cases: list[dict[str, Any]] | None = None,
    *,
    dataset_path: str | None = None,
    max_turns: int = 10,
    name: str = "BaseGazeEnv",
    log_dir: Path | str | None = None,
) -> None

Initialize environment.

Parameters:

Name Type Description Default
cases list[dict[str, Any]] | None

Pre-loaded cases (optional)

None
dataset_path str | None

Path to JSONL dataset file

None
max_turns int

Maximum conversation turns

10
name str

Environment name

'BaseGazeEnv'
log_dir Path | str | None

Directory for debug logs

None
Source code in src/gaze/verifiers/base.py
def __init__(
    self,
    cases: list[dict[str, Any]] | None = None,
    *,
    dataset_path: str | None = None,
    max_turns: int = 10,
    name: str = "BaseGazeEnv",
    log_dir: Path | str | None = None,
) -> None:
    """Initialize environment.

    Args:
        cases: Pre-loaded cases (optional)
        dataset_path: Path to JSONL dataset file
        max_turns: Maximum conversation turns
        name: Environment name
        log_dir: Directory for debug logs
    """
    self._max_turns = max_turns
    self._log_dir = Path(log_dir) if log_dir else Path(__file__).parent.parent / "logs"
    self._log_path = self._log_dir / f"{name.lower()}_debug.log"

    # Load cases
    if cases is None and dataset_path:
        cases = self._load_jsonl(dataset_path)
    elif cases is None:
        cases = []

    # Process cases into prompts and info
    prompts, infos = self._prepare_cases(cases)

    dataset = Dataset.from_dict(
        {
            "id": list(range(len(cases))),
            "prompt": prompts,
            "info": infos,
        }
    )

    super().__init__(max_turns=max_turns, dataset=dataset)
    self._cases = cases

get_system_prompt

get_system_prompt() -> str

Get system prompt for the environment.

Override this method to provide custom system prompt.

Returns:

Type Description
str

System prompt string

Source code in src/gaze/verifiers/base.py
def get_system_prompt(self) -> str:
    """Get system prompt for the environment.

    Override this method to provide custom system prompt.

    Returns:
        System prompt string
    """
    return "You are a helpful assistant. Respond accurately and concisely."

setup_state async

setup_state(state: State) -> vf.State

Initialize episode state.

Override this method to customize initial state.

Parameters:

Name Type Description Default
state State

State dict pre-populated by verifiers

required

Returns:

Type Description
State

State with custom fields added

Source code in src/gaze/verifiers/base.py
async def setup_state(self, state: vf.State) -> vf.State:
    """Initialize episode state.

    Override this method to customize initial state.

    Args:
        state: State dict pre-populated by verifiers

    Returns:
        State with custom fields added
    """
    state["turn"] = 0
    state["tool_uses"] = 0
    return state

env_response async

env_response(
    messages: Messages, state: State, **kwargs: Any
) -> vf.Messages

Generate environment response to assistant message.

Override this method to implement custom environment behavior. Mutate state in-place to track turn progress.

Parameters:

Name Type Description Default
messages Messages

Conversation messages

required
state State

Current episode state (mutate in-place)

required
**kwargs Any

Additional arguments from verifiers

{}

Returns:

Type Description
Messages

Response messages

Source code in src/gaze/verifiers/base.py
async def env_response(
    self,
    messages: vf.Messages,  # noqa: ARG002 - Required by interface
    state: vf.State,
    **kwargs: Any,  # noqa: ARG002 - Required by interface
) -> vf.Messages:
    """Generate environment response to assistant message.

    Override this method to implement custom environment behavior.
    Mutate state in-place to track turn progress.

    Args:
        messages: Conversation messages
        state: Current episode state (mutate in-place)
        **kwargs: Additional arguments from verifiers

    Returns:
        Response messages
    """
    # Default: increment turn counter, no response
    state["turn"] = state.get("turn", 0) + 1
    return []

VerifiableProcessorMixin

Mixin that adds verifiers integration to AgenticProcessorBase subclasses.

Provides methods for: - Creating verifiers environments from processors - Defining task-specific reward functions - Converting between GAZE and verifiers formats

Usage

class MyProcessor(VerifiableProcessorMixin, AgenticProcessorBase): def get_reward_function(self) -> BaseRewardFunction: return ExactMatchReward()

# ... implement other abstract methods ...

Create verifiers environment

env_class = MyProcessor.as_verifiers_env()

Source code in src/gaze/verifiers/mixin.py
class VerifiableProcessorMixin:
    """Mixin that adds verifiers integration to AgenticProcessorBase subclasses.

    Provides methods for:
    - Creating verifiers environments from processors
    - Defining task-specific reward functions
    - Converting between GAZE and verifiers formats

    Usage:
        class MyProcessor(VerifiableProcessorMixin, AgenticProcessorBase):
            def get_reward_function(self) -> BaseRewardFunction:
                return ExactMatchReward()

            # ... implement other abstract methods ...

        # Create verifiers environment
        env_class = MyProcessor.as_verifiers_env()
    """

    @abstractmethod
    def get_reward_function(self) -> BaseRewardFunction:
        """Return the reward function for this task.

        Must be implemented by subclasses to provide task-specific rewards.

        Returns:
            Reward function instance compatible with verifiers
        """
        ...

    @classmethod
    @beartype
    def as_verifiers_env(
        cls,
        *,
        max_turns: int = 10,
        cases: list[dict[str, Any]] | None = None,
        dataset_path: str | None = None,
        image_base_path: Path | None = None,
        **processor_kwargs: Any,
    ) -> type:
        """Create a verifiers MultiTurnEnv class from this processor.

        The returned environment class can be used directly with verifiers
        for training or evaluation.

        Args:
            max_turns: Maximum conversation turns per episode
            cases: Pre-loaded cases (optional, alternative to dataset_path)
            dataset_path: Path to JSONL dataset file
            image_base_path: Base path for resolving relative image paths
            **processor_kwargs: Arguments passed to processor __init__

        Returns:
            MultiTurnEnv subclass configured with this processor

        Example:
            EnvClass = NOVAProcessor.as_verifiers_env(
                max_turns=10,
                dataset_path="data/train.jsonl",
                model_name="openai/gpt-4o",
            )
            env = EnvClass()
        """

        from gaze.verifiers.adapter import GazeAdapter
        from gaze.verifiers.base import BaseMultiTurnEnv

        processor_cls = cls  # Capture for closure

        class _VerifiableEnv(BaseMultiTurnEnv):
            """Dynamically generated verifiers environment."""

            def __init__(
                self,
                env_cases: list[dict[str, Any]] | None = None,
                env_dataset_path: str | None = None,
                **env_kwargs: Any,
            ) -> None:
                # Create processor and adapter *before* super().__init__
                # because BaseMultiTurnEnv.__init__ calls _prepare_cases →
                # _build_prompt → get_system_prompt which needs _processor.
                self._processor = processor_cls(**processor_kwargs)
                self._adapter = GazeAdapter(
                    processor=self._processor,
                )
                self._image_base_path = image_base_path

                # Use provided cases/path or fall back to class-level defaults
                actual_cases = env_cases or cases
                actual_path = env_dataset_path or dataset_path

                super().__init__(
                    cases=actual_cases,
                    dataset_path=actual_path,
                    max_turns=max_turns,
                    name=f"{processor_cls.__name__}Env",
                    **env_kwargs,
                )

            def get_system_prompt(self) -> str:
                """Get system prompt from processor."""
                # Get a minimal system prompt for verifiers
                # Full prompt is built during processing
                return self._processor.get_system_prompt(images=[], metadata={})

            def _build_user_message(
                self,
                case: dict[str, Any],
            ) -> str | list[dict[str, Any]]:
                """Build user message with image support."""
                # Extract image path from case
                image_path = case.get("image_path") or case.get("image")

                if image_path:
                    # Resolve relative paths safely (prevent traversal)
                    if self._image_base_path and not Path(image_path).is_absolute():
                        image_path = _safe_resolve_image_path(self._image_base_path, image_path)

                    # Build multimodal message
                    text_content = self._processor.get_user_message(
                        images=[],  # Images handled separately
                        metadata=case,
                    )

                    return [
                        {"type": "text", "text": text_content},
                        {
                            "type": "image_url",
                            "image_url": {"url": _image_file_to_data_url(image_path)},
                        },
                    ]

                # Text-only case
                return self._processor.get_user_message(images=[], metadata=case)

            async def setup_state(self, state: vf.State) -> vf.State:
                """Initialize state with image info."""
                state = await super().setup_state(state)

                # Store image path for tool execution
                info = state.get("info") or {}
                image_path = info.get("image_path") or info.get("image")
                if image_path:
                    if self._image_base_path and not Path(image_path).is_absolute():
                        image_path = _safe_resolve_image_path(self._image_base_path, image_path)
                    state["image_path"] = image_path

                return state

            async def env_response(
                self,
                messages: vf.Messages,
                state: vf.State,
                **kwargs: Any,  # noqa: ARG002 - Required by vf.MultiTurnEnv interface
            ) -> vf.Messages:
                """Generate environment response using processor."""
                info = state.get("info") or {}

                # Get image path from state if available
                image_path = state.get("image_path")

                # Process using adapter with image context
                result = await self._adapter.process_verifiers_messages(
                    messages=messages,
                    info={**info, "image_path": image_path},
                )

                # Update state in-place
                state["turn"] = state.get("turn", 0) + 1
                state["tool_uses"] = state.get("tool_uses", 0) + len(result["tool_calls"])
                state["is_complete"] = result["is_complete"]

                return result["messages"]

            @vf.stop
            async def _processor_complete(self, state: vf.State) -> bool:
                """Stop when processor signals completion."""
                return state.get("is_complete", False)

            def get_reward_function(self) -> BaseRewardFunction:
                """Get reward function from processor."""
                return self._processor.get_reward_function()

        return _VerifiableEnv

get_reward_function abstractmethod

get_reward_function() -> BaseRewardFunction

Return the reward function for this task.

Must be implemented by subclasses to provide task-specific rewards.

Returns:

Type Description
BaseRewardFunction

Reward function instance compatible with verifiers

Source code in src/gaze/verifiers/mixin.py
@abstractmethod
def get_reward_function(self) -> BaseRewardFunction:
    """Return the reward function for this task.

    Must be implemented by subclasses to provide task-specific rewards.

    Returns:
        Reward function instance compatible with verifiers
    """
    ...

as_verifiers_env classmethod

as_verifiers_env(
    *,
    max_turns: int = 10,
    cases: list[dict[str, Any]] | None = None,
    dataset_path: str | None = None,
    image_base_path: Path | None = None,
    **processor_kwargs: Any,
) -> type

Create a verifiers MultiTurnEnv class from this processor.

The returned environment class can be used directly with verifiers for training or evaluation.

Parameters:

Name Type Description Default
max_turns int

Maximum conversation turns per episode

10
cases list[dict[str, Any]] | None

Pre-loaded cases (optional, alternative to dataset_path)

None
dataset_path str | None

Path to JSONL dataset file

None
image_base_path Path | None

Base path for resolving relative image paths

None
**processor_kwargs Any

Arguments passed to processor init

{}

Returns:

Type Description
type

MultiTurnEnv subclass configured with this processor

Example

EnvClass = NOVAProcessor.as_verifiers_env( max_turns=10, dataset_path="data/train.jsonl", model_name="openai/gpt-4o", ) env = EnvClass()

Source code in src/gaze/verifiers/mixin.py
@classmethod
@beartype
def as_verifiers_env(
    cls,
    *,
    max_turns: int = 10,
    cases: list[dict[str, Any]] | None = None,
    dataset_path: str | None = None,
    image_base_path: Path | None = None,
    **processor_kwargs: Any,
) -> type:
    """Create a verifiers MultiTurnEnv class from this processor.

    The returned environment class can be used directly with verifiers
    for training or evaluation.

    Args:
        max_turns: Maximum conversation turns per episode
        cases: Pre-loaded cases (optional, alternative to dataset_path)
        dataset_path: Path to JSONL dataset file
        image_base_path: Base path for resolving relative image paths
        **processor_kwargs: Arguments passed to processor __init__

    Returns:
        MultiTurnEnv subclass configured with this processor

    Example:
        EnvClass = NOVAProcessor.as_verifiers_env(
            max_turns=10,
            dataset_path="data/train.jsonl",
            model_name="openai/gpt-4o",
        )
        env = EnvClass()
    """

    from gaze.verifiers.adapter import GazeAdapter
    from gaze.verifiers.base import BaseMultiTurnEnv

    processor_cls = cls  # Capture for closure

    class _VerifiableEnv(BaseMultiTurnEnv):
        """Dynamically generated verifiers environment."""

        def __init__(
            self,
            env_cases: list[dict[str, Any]] | None = None,
            env_dataset_path: str | None = None,
            **env_kwargs: Any,
        ) -> None:
            # Create processor and adapter *before* super().__init__
            # because BaseMultiTurnEnv.__init__ calls _prepare_cases →
            # _build_prompt → get_system_prompt which needs _processor.
            self._processor = processor_cls(**processor_kwargs)
            self._adapter = GazeAdapter(
                processor=self._processor,
            )
            self._image_base_path = image_base_path

            # Use provided cases/path or fall back to class-level defaults
            actual_cases = env_cases or cases
            actual_path = env_dataset_path or dataset_path

            super().__init__(
                cases=actual_cases,
                dataset_path=actual_path,
                max_turns=max_turns,
                name=f"{processor_cls.__name__}Env",
                **env_kwargs,
            )

        def get_system_prompt(self) -> str:
            """Get system prompt from processor."""
            # Get a minimal system prompt for verifiers
            # Full prompt is built during processing
            return self._processor.get_system_prompt(images=[], metadata={})

        def _build_user_message(
            self,
            case: dict[str, Any],
        ) -> str | list[dict[str, Any]]:
            """Build user message with image support."""
            # Extract image path from case
            image_path = case.get("image_path") or case.get("image")

            if image_path:
                # Resolve relative paths safely (prevent traversal)
                if self._image_base_path and not Path(image_path).is_absolute():
                    image_path = _safe_resolve_image_path(self._image_base_path, image_path)

                # Build multimodal message
                text_content = self._processor.get_user_message(
                    images=[],  # Images handled separately
                    metadata=case,
                )

                return [
                    {"type": "text", "text": text_content},
                    {
                        "type": "image_url",
                        "image_url": {"url": _image_file_to_data_url(image_path)},
                    },
                ]

            # Text-only case
            return self._processor.get_user_message(images=[], metadata=case)

        async def setup_state(self, state: vf.State) -> vf.State:
            """Initialize state with image info."""
            state = await super().setup_state(state)

            # Store image path for tool execution
            info = state.get("info") or {}
            image_path = info.get("image_path") or info.get("image")
            if image_path:
                if self._image_base_path and not Path(image_path).is_absolute():
                    image_path = _safe_resolve_image_path(self._image_base_path, image_path)
                state["image_path"] = image_path

            return state

        async def env_response(
            self,
            messages: vf.Messages,
            state: vf.State,
            **kwargs: Any,  # noqa: ARG002 - Required by vf.MultiTurnEnv interface
        ) -> vf.Messages:
            """Generate environment response using processor."""
            info = state.get("info") or {}

            # Get image path from state if available
            image_path = state.get("image_path")

            # Process using adapter with image context
            result = await self._adapter.process_verifiers_messages(
                messages=messages,
                info={**info, "image_path": image_path},
            )

            # Update state in-place
            state["turn"] = state.get("turn", 0) + 1
            state["tool_uses"] = state.get("tool_uses", 0) + len(result["tool_calls"])
            state["is_complete"] = result["is_complete"]

            return result["messages"]

        @vf.stop
        async def _processor_complete(self, state: vf.State) -> bool:
            """Stop when processor signals completion."""
            return state.get("is_complete", False)

        def get_reward_function(self) -> BaseRewardFunction:
            """Get reward function from processor."""
            return self._processor.get_reward_function()

    return _VerifiableEnv

extract_completion_text

extract_completion_text(completion: Any) -> str

Extract text content from a verifiers completion.

Handles multiple formats: - Plain string - Message list with assistant role - Multimodal content lists

Parameters:

Name Type Description Default
completion Any

Model completion in any supported format

required

Returns:

Type Description
str

Extracted text content

Source code in src/gaze/verifiers/rewards.py
@beartype
def extract_completion_text(completion: Any) -> str:
    """Extract text content from a verifiers completion.

    Handles multiple formats:
    - Plain string
    - Message list with assistant role
    - Multimodal content lists

    Args:
        completion: Model completion in any supported format

    Returns:
        Extracted text content
    """
    if isinstance(completion, str):
        return completion

    if isinstance(completion, list):
        # Find last assistant message
        for msg in reversed(completion):
            if not isinstance(msg, dict) or msg.get("role") != "assistant":
                continue

            content = msg.get("content", "")
            if isinstance(content, str):
                return content

            # Handle multimodal content — concatenate ALL text items,
            # not just the first, so reasoning + answer are both captured.
            if isinstance(content, list):
                texts = [
                    item.get("text", "")
                    for item in content
                    if isinstance(item, dict) and item.get("type") == "text"
                ]
                # Return joined texts, or "" if the assistant message had no
                # text items (empty multimodal content).  This prevents the
                # final fallback from stringifying the whole message list.
                return "\n".join(texts) if texts else ""

    return str(completion or "")

__getattr__

__getattr__(name: str)

Lazy import for verifiers-dependent symbols.

GazeAdapter, BaseMultiTurnEnv, and VerifiableProcessorMixin pull in verifiers and datasets at import time. We defer those imports so that lightweight consumers (e.g. reward functions only) never pay the cost.

Source code in src/gaze/verifiers/__init__.py
def __getattr__(name: str):
    """Lazy import for verifiers-dependent symbols.

    ``GazeAdapter``, ``BaseMultiTurnEnv``, and
    ``VerifiableProcessorMixin`` pull in ``verifiers`` and ``datasets``
    at import time.  We defer those imports so that lightweight consumers
    (e.g. reward functions only) never pay the cost.
    """
    if name == "GazeAdapter":
        from .adapter import GazeAdapter

        return GazeAdapter
    if name == "BaseMultiTurnEnv":
        from .base import BaseMultiTurnEnv

        return BaseMultiTurnEnv
    if name == "VerifiableProcessorMixin":
        from .mixin import VerifiableProcessorMixin

        return VerifiableProcessorMixin
    msg = f"module {__name__!r} has no attribute {name!r}"
    raise AttributeError(msg)