AOB-Bench ยท ClinicalEval v1Open Dataset
Evaluation Results
Rigorous evaluation on 100 curated clinical cases across 5 cancer types. All metrics reported with 95% bootstrap confidence intervals (N=1 000 resamples). Dataset is publicly available on HuggingFace for independent reproducibility.
82.3%
TNM Accuracy
95% CI [80.5, 84.4]
74.8%
Biomarker F1
macro-averaged
77.8%
NCCN Alignment
treatment options
97%
Schema Validity
structured output
82
Avg Consensus
board agreement /100
47s
Avg Inference
per case (MI300X)
๐ Ablation Leaderboard
| Model / Config | TNM Acc. | Biomarker F1 | Tx Alignment | Schema Valid. |
|---|---|---|---|---|
AOBAOB Full Pipeline | 82.3% [80.5%, 84.4%] | 74.8% [72.9%, 76.9%] | 77.8% [76.1%, 79.5%] | 97.0% [96.0%, 98.0%] |
No Debate Rounds | 77.1% [75.2%, 79.1%] โผ 5.2% | 70.1% [68.0%, 72.2%] | 73.4% [71.5%, 75.3%] | 96.0% [94.8%, 97.2%] |
No LoRA Specialists | 74.4% [72.3%, 76.5%] โผ 7.9% | 67.8% [65.6%, 70.0%] | 71.1% [69.1%, 73.1%] | 95.0% [93.7%, 96.3%] |
No Qwen-VL Second Opinion | 79.8% [77.8%, 81.8%] โผ 2.5% | 72.4% [70.3%, 74.5%] | 75.6% [73.7%, 77.5%] | 96.8% [95.7%, 97.9%] |
Single LLM Baseline | 69.1% [66.8%, 71.4%] โผ 13.2% | 61.2% [58.9%, 63.5%] | 64.4% [62.2%, 66.6%] | 92.0% [90.4%, 93.6%] |
95% bootstrap CIs (N=100 cases, 1 000 resamples). TNM Acc. = exact TNM stage match. Biomarker F1 = macro-averaged F1 over EGFR/ALK/ROS1/KRAS. Tx Alignment = NCCN guideline alignment. Schema Valid. = structured JSON output conformance.
Ablation Study โ TNM Accuracy
ฮ vs. full AOB pipeline (82.3%). Negative = component is contributing.
Full: 82.3%
Calibration Reliability Curves
Predicted confidence vs. observed accuracy. Diagonal = perfect calibration.
GigaPath
8.9%ECE
14.3%MCE
0.162Brier
Board Consensus
7.2%ECE
12.2%MCE
0.139Brier
Lower ECE = better calibration. Board consensus ECE (7.2%) outperforms GigaPath alone (8.9%), showing the deliberation loop improves confidence calibration.
๐ฌ Per Cancer-Type Breakdown
| Cancer Type | Cases | TNM Acc. | Biomarker F1 |
|---|---|---|---|
| lung adenocarcinoma | 30 | 86.7% | 81.2% |
| colon adenocarcinoma | 25 | 84.0% | 75.6% |
| lung squamous | 20 | 80.0% | 72.0% |
| breast idc | 15 | 80.0% | 69.3% |
| other | 10 | 70.0% | 62.0% |
๐ค Reproducibility
๐ค Open on HuggingFaceAOB-Bench ClinicalEval v1 ยท 100 cases ยท CC BY 4.0
Run locally
from datasets import load_dataset
ds = load_dataset("aob-bench/ClinicalEval", split="test")
# Re-run ablation
python aob/eval/ablation_study.py
python aob/eval/calibration.py