March 5, 20265 min read

Build with evals from day one

Most AI projects don't fail because the model is bad. They fail because nobody knew how to measure 'good' before they shipped.

The most common mistake I see in AI projects isn't a bad architecture or a wrong model choice. It's that nobody wrote down what "good" meant before they started building.

When the system goes wrong in production — and it will — there's no agreed-upon definition of "right" to compare it against. Engineering bickers with product. Product bickers with the customer. Everyone has a story about what should have happened, and none of them are testable.

The fix is unglamorous: write the eval set first. A few dozen examples that represent the real distribution of inputs, with expected behaviors documented in plain language. Run it on day one. Run it on every change. Watch what regresses when you swap models, tighten prompts, or add retrieval.

It is the closest thing AI engineering has to type-checking. The teams that adopt it ship faster, change models without panic, and have shorter post-incident meetings.

Evals are not a phase. They are the thing the system is.

— Sunitha Giduturi

Working on somethingthis touches?

Working on something
this touches?