A production agent runtime with evals, tool wiring and observability. Replaced a prototype that broke every week.
Tool-call success rate
99.4%
5 weeks · 1 PM + 2 engineers
Eval run cadence
0 → daily
Kick-off to production
5 weeks
Cut in token spend
~60%
“We went from being scared to upgrade the model to upgrading the day it ships. That is the whole point.”
Head of Engineering, Series B fintech
Before
Where the team was when we picked this up.
- A LangChain prototype was running in production and breaking weekly when the model or a tool changed.
- There was no way to tell why an agent run failed. Logs were a wall of text.
- Every model update was a coin flip. Nobody wanted to ship the upgrade.
What we built
Custom harness
Replaced the framework with a thin, typed runtime. Tool calls, retries and errors are first-class. The team can read a transcript and see exactly what happened.
Eval suite
Two hundred scenarios drawn from real production runs. Pass/fail signals, latency budgets and cost ceilings. Runs on every commit.
Observability
Every run produces a structured trace. We added dashboards for failure modes, tool latency, prompt drift and unit costs.
What changed
- Model upgrades now ship the same week they release.
- Failures are tracked by category, not anecdote.
- Token spend went down even as usage grew.
After
Same team. Same week. Different shape of work.
Stack
Timeline & team
5 weeks · 1 PM + 2 engineers
More projects
A mobile app with on-device and cloud AI features, shipped to the App Store in six weeks.
A multi-agent ops platform that took ~35 hours of partner busywork out of the week.
A RAG copilot over 60,000 compliance documents that answers in seconds, with citations.
Got a workflow like this one?
Book a working session. We will tell you whether this is a four-week build or something bigger, and what it would take to ship it.