Final Score = Quality × Efficiency × Model Bonus × 100
- Validation
- Required patterns must all match. Failure = DNF and no score recorded.
- Quality (30–100%)
- Line-by-line similarity to the reference answer. Any submission that passes validation earns at least 30% — but a concise, targeted answer matching the expected output line-for-line scores much higher.
- Efficiency
- Par ÷ tokens used. Under par > 1.0, so fewer tokens = better multiplier.
- Model Bonus
- Smaller models multiply your score. SmolLM2 135M = 2.5×, Phi 3.5 Mini = 0.5×.
- Golf Rating
- Token-only: Ace (≤50%), Eagle (≤75%), Birdie (under par), Par (at par), Bogey (≤125%), Double Bogey (over 125%).
A verbose essay can pass validation but score near-zero on quality. Aim for the specific insight the challenge asks for, not a comprehensive audit.