Skip to content

MedMarks integration

Integration with MedMarks, an evaluation suite for medical LLMs developed by Sophont, MedARC, and Prime Intellect.

Architecture

MedMarks Platform (medmarks.ai, medarc-eval CLI)
        |
Verifiers Framework (vf.MultiTurnEnv, vf.Rubric)
        |
NOVA Brain MRI Environment (environments/nova_brain_mri/)
        |
GAZE (AgenticProcessorBase, VerifiableProcessorMixin)

The environments/nova_brain_mri/ package provides a MedMarks-compatible environment for the NOVA brain-MRI benchmark. It wraps the verifiers framework and uses GAZE reward utilities.

Installation

# Install gaze-vlm with MedMarks dependencies (from source)
pip install -e .[medmarks]

# Install the NOVA brain-MRI environment
cd environments/nova_brain_mri
pip install -e .

Usage

Via medarc-eval CLI

medarc-eval nova-brain-mri -m gpt-4o -n 100
medarc-eval nova-brain-mri -m gpt-4o --task diagnosis -n 50
medarc-eval nova-brain-mri -m gpt-4o --use-tools --max-turns 10

Via Python API

See environments/nova_brain_mri/README.md for the Python API and configuration reference.

Via GAZE processor

from examples.nova.src.processor import NOVAAgenticProcessor

processor = NOVAAgenticProcessor(
    model_name="openai/gpt-4o",
    use_tools=True,
    use_web_search=True,
    max_turns=10,
    task="all",
)

EnvClass = processor.as_verifiers_env(
    dataset_path="data/nova_test.jsonl",
    image_base_path=Path("data/images"),
)
env = EnvClass()

Configuration

Parameter Type Default Description
split str "test" Dataset split
task str "all" Task: caption, diagnosis, localization, all
max_turns int 10 Maximum conversation turns
use_tools bool True Enable visual tools
use_web_search bool False Enable PubMed search
iou_threshold float 0.5 IoU threshold for localization
data_dir str None Custom dataset directory

Reward functions

The environment provides three task-specific rewards:

  • Caption: token-level F1 between predicted and reference captions
  • Diagnosis: 60% top-1 accuracy + 40% coverage of reference diagnoses (with normalization)
  • Localization: detection F1 using greedy best-IoU matching at the configured threshold

Combined rubric weights: 33% caption, 34% diagnosis, 33% localization.

Creating custom environments

Any AgenticProcessorBase subclass with VerifiableProcessorMixin can become a MedMarks-compatible environment:

from gaze import AgenticProcessorBase
from gaze.verifiers import VerifiableProcessorMixin, BaseRewardFunction

class MyMedicalProcessor(VerifiableProcessorMixin, AgenticProcessorBase):
    def get_system_prompt(self, images, metadata):
        return "You are a medical imaging expert."

    def get_user_message(self, images, metadata):
        return "Analyze this medical image."

    def get_response_schema(self):
        return None

    def validate_response(self, response):
        return "diagnosis" in response

    def get_reward_function(self) -> BaseRewardFunction:
        return MyCustomReward()

EnvClass = MyMedicalProcessor.as_verifiers_env(
    max_turns=10,
    dataset_path="my_dataset.jsonl",
)

Resources