Per-task metrics
| Task | Macro-F1 A | Macro-F1 B | Macro Δ (A-B) | Accuracy A | Accuracy B | Acc Δ (A-B) | Top-3 A | Top-3 B |
|---|
Paired tests
| Task | Acc Diff (A-B) | 95% CI | McNemar p | Significant? | Direction | Discordant |
|---|
Charts
Macro-F1 per task
Mean latency per task (ms)
Top confusions
Article
Artifacts