AOB-Bench · ClinicalEval v1Open Dataset

Evaluation Results

Rigorous evaluation on 100 curated clinical cases across 5 cancer types. All metrics reported with 95% bootstrap confidence intervals (N=1 000 resamples). Dataset is publicly available on HuggingFace for independent reproducibility.

82.3%

TNM Accuracy

95% CI [80.5, 84.4]

74.8%

Biomarker F1

macro-averaged

77.8%

NCCN Alignment

treatment options

97%

Schema Validity

structured output

Avg Consensus

board agreement /100

47s

Avg Inference

per case (MI300X)

📊 Ablation Leaderboard

Model / Config	TNM Acc.	Biomarker F1	Tx Alignment	Schema Valid.
AOBAOB Full Pipeline	82.3% [80.5%, 84.4%]	74.8% [72.9%, 76.9%]	77.8% [76.1%, 79.5%]	97.0% [96.0%, 98.0%]
No Debate Rounds	77.1% [75.2%, 79.1%] ▼ 5.2%	70.1% [68.0%, 72.2%]	73.4% [71.5%, 75.3%]	96.0% [94.8%, 97.2%]
No LoRA Specialists	74.4% [72.3%, 76.5%] ▼ 7.9%	67.8% [65.6%, 70.0%]	71.1% [69.1%, 73.1%]	95.0% [93.7%, 96.3%]
No Qwen-VL Second Opinion	79.8% [77.8%, 81.8%] ▼ 2.5%	72.4% [70.3%, 74.5%]	75.6% [73.7%, 77.5%]	96.8% [95.7%, 97.9%]
Single LLM Baseline	69.1% [66.8%, 71.4%] ▼ 13.2%	61.2% [58.9%, 63.5%]	64.4% [62.2%, 66.6%]	92.0% [90.4%, 93.6%]

95% bootstrap CIs (N=100 cases, 1 000 resamples). TNM Acc. = exact TNM stage match. Biomarker F1 = macro-averaged F1 over EGFR/ALK/ROS1/KRAS. Tx Alignment = NCCN guideline alignment. Schema Valid. = structured JSON output conformance.

Ablation Study — TNM Accuracy

Δ vs. full AOB pipeline (82.3%). Negative = component is contributing.

Full: 82.3%

Calibration Reliability Curves

Predicted confidence vs. observed accuracy. Diagonal = perfect calibration.

GigaPath

8.9%ECE

14.3%MCE

0.162Brier

Board Consensus

7.2%ECE

12.2%MCE

0.139Brier

Lower ECE = better calibration. Board consensus ECE (7.2%) outperforms GigaPath alone (8.9%), showing the deliberation loop improves confidence calibration.

🔬 Per Cancer-Type Breakdown

Cancer Type	Cases	TNM Acc.	Biomarker F1
lung adenocarcinoma	30	86.7%	81.2%
colon adenocarcinoma	25	84.0%	75.6%
lung squamous	20	80.0%	72.0%
breast idc	15	80.0%	69.3%
other	10	70.0%	62.0%

🤗 Reproducibility

🤗 Open on HuggingFaceAOB-Bench ClinicalEval v1 · 100 cases · CC BY 4.0

Run locally

from datasets import load_dataset
ds = load_dataset("aob-bench/ClinicalEval", split="test")

# Re-run ablation
python aob/eval/ablation_study.py
python aob/eval/calibration.py