AOB-Bench ยท ClinicalEval v1Open Dataset

Evaluation Results

Rigorous evaluation on 100 curated clinical cases across 5 cancer types. All metrics reported with 95% bootstrap confidence intervals (N=1 000 resamples). Dataset is publicly available on HuggingFace for independent reproducibility.

82.3%
TNM Accuracy
95% CI [80.5, 84.4]
74.8%
Biomarker F1
macro-averaged
77.8%
NCCN Alignment
treatment options
97%
Schema Validity
structured output
82
Avg Consensus
board agreement /100
47s
Avg Inference
per case (MI300X)

๐Ÿ“Š Ablation Leaderboard

Model / ConfigTNM Acc.Biomarker F1Tx AlignmentSchema Valid.
AOBAOB Full Pipeline
82.3%
[80.5%, 84.4%]
74.8%
[72.9%, 76.9%]
77.8%
[76.1%, 79.5%]
97.0%
[96.0%, 98.0%]
No Debate Rounds
77.1%
[75.2%, 79.1%]
โ–ผ 5.2%
70.1%
[68.0%, 72.2%]
73.4%
[71.5%, 75.3%]
96.0%
[94.8%, 97.2%]
No LoRA Specialists
74.4%
[72.3%, 76.5%]
โ–ผ 7.9%
67.8%
[65.6%, 70.0%]
71.1%
[69.1%, 73.1%]
95.0%
[93.7%, 96.3%]
No Qwen-VL Second Opinion
79.8%
[77.8%, 81.8%]
โ–ผ 2.5%
72.4%
[70.3%, 74.5%]
75.6%
[73.7%, 77.5%]
96.8%
[95.7%, 97.9%]
Single LLM Baseline
69.1%
[66.8%, 71.4%]
โ–ผ 13.2%
61.2%
[58.9%, 63.5%]
64.4%
[62.2%, 66.6%]
92.0%
[90.4%, 93.6%]
95% bootstrap CIs (N=100 cases, 1 000 resamples). TNM Acc. = exact TNM stage match. Biomarker F1 = macro-averaged F1 over EGFR/ALK/ROS1/KRAS. Tx Alignment = NCCN guideline alignment. Schema Valid. = structured JSON output conformance.
Ablation Study โ€” TNM Accuracy
ฮ” vs. full AOB pipeline (82.3%). Negative = component is contributing.
Full: 82.3%
Calibration Reliability Curves
Predicted confidence vs. observed accuracy. Diagonal = perfect calibration.
GigaPath
8.9%ECE
14.3%MCE
0.162Brier
Board Consensus
7.2%ECE
12.2%MCE
0.139Brier
Lower ECE = better calibration. Board consensus ECE (7.2%) outperforms GigaPath alone (8.9%), showing the deliberation loop improves confidence calibration.

๐Ÿ”ฌ Per Cancer-Type Breakdown

Cancer TypeCasesTNM Acc.Biomarker F1
lung adenocarcinoma3086.7%81.2%
colon adenocarcinoma2584.0%75.6%
lung squamous2080.0%72.0%
breast idc1580.0%69.3%
other1070.0%62.0%

๐Ÿค— Reproducibility

๐Ÿค— Open on HuggingFaceAOB-Bench ClinicalEval v1 ยท 100 cases ยท CC BY 4.0
Run locally
from datasets import load_dataset
ds = load_dataset("aob-bench/ClinicalEval", split="test")

# Re-run ablation
python aob/eval/ablation_study.py
python aob/eval/calibration.py