MedMarks integration¶
Integration with MedMarks, an evaluation suite for medical LLMs developed by Sophont, MedARC, and Prime Intellect.
Architecture¶
MedMarks Platform (medmarks.ai, medarc-eval CLI)
|
Verifiers Framework (vf.MultiTurnEnv, vf.Rubric)
|
NOVA Brain MRI Environment (environments/nova_brain_mri/)
|
GAZE (AgenticProcessorBase, VerifiableProcessorMixin)
The environments/nova_brain_mri/ package provides a MedMarks-compatible environment for the NOVA brain-MRI benchmark. It wraps the verifiers framework and uses GAZE reward utilities.
Installation¶
# Install gaze-vlm with MedMarks dependencies (from source)
pip install -e .[medmarks]
# Install the NOVA brain-MRI environment
cd environments/nova_brain_mri
pip install -e .
Usage¶
Via medarc-eval CLI¶
medarc-eval nova-brain-mri -m gpt-4o -n 100
medarc-eval nova-brain-mri -m gpt-4o --task diagnosis -n 50
medarc-eval nova-brain-mri -m gpt-4o --use-tools --max-turns 10
Via Python API¶
See environments/nova_brain_mri/README.md for the Python API and configuration reference.
Via GAZE processor¶
from examples.nova.src.processor import NOVAAgenticProcessor
processor = NOVAAgenticProcessor(
model_name="openai/gpt-4o",
use_tools=True,
use_web_search=True,
max_turns=10,
task="all",
)
EnvClass = processor.as_verifiers_env(
dataset_path="data/nova_test.jsonl",
image_base_path=Path("data/images"),
)
env = EnvClass()
Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
split |
str | "test" |
Dataset split |
task |
str | "all" |
Task: caption, diagnosis, localization, all |
max_turns |
int | 10 | Maximum conversation turns |
use_tools |
bool | True | Enable visual tools |
use_web_search |
bool | False | Enable PubMed search |
iou_threshold |
float | 0.5 | IoU threshold for localization |
data_dir |
str | None | Custom dataset directory |
Reward functions¶
The environment provides three task-specific rewards:
- Caption: token-level F1 between predicted and reference captions
- Diagnosis: 60% top-1 accuracy + 40% coverage of reference diagnoses (with normalization)
- Localization: detection F1 using greedy best-IoU matching at the configured threshold
Combined rubric weights: 33% caption, 34% diagnosis, 33% localization.
Creating custom environments¶
Any AgenticProcessorBase subclass with VerifiableProcessorMixin can become a MedMarks-compatible environment:
from gaze import AgenticProcessorBase
from gaze.verifiers import VerifiableProcessorMixin, BaseRewardFunction
class MyMedicalProcessor(VerifiableProcessorMixin, AgenticProcessorBase):
def get_system_prompt(self, images, metadata):
return "You are a medical imaging expert."
def get_user_message(self, images, metadata):
return "Analyze this medical image."
def get_response_schema(self):
return None
def validate_response(self, response):
return "diagnosis" in response
def get_reward_function(self) -> BaseRewardFunction:
return MyCustomReward()
EnvClass = MyMedicalProcessor.as_verifiers_env(
max_turns=10,
dataset_path="my_dataset.jsonl",
)