Agent harness & evals
FINTECH · SERIES B · Financial services

A production agent runtime with evals, tool wiring and observability. Replaced a prototype that broke every week.

Tool-call success rate

99.4%

5 weeks · 1 PM + 2 engineers

Eval run cadence

0 → daily

Kick-off to production

5 weeks

Cut in token spend

~60%

“We went from being scared to upgrade the model to upgrading the day it ships. That is the whole point.”

Head of Engineering, Series B fintech

Before

Where the team was when we picked this up.

  • A LangChain prototype was running in production and breaking weekly when the model or a tool changed.
  • There was no way to tell why an agent run failed. Logs were a wall of text.
  • Every model update was a coin flip. Nobody wanted to ship the upgrade.

What we built

Custom harness

Replaced the framework with a thin, typed runtime. Tool calls, retries and errors are first-class. The team can read a transcript and see exactly what happened.

Eval suite

Two hundred scenarios drawn from real production runs. Pass/fail signals, latency budgets and cost ceilings. Runs on every commit.

Observability

Every run produces a structured trace. We added dashboards for failure modes, tool latency, prompt drift and unit costs.

What changed

  • Model upgrades now ship the same week they release.
  • Failures are tracked by category, not anecdote.
  • Token spend went down even as usage grew.

After

Same team. Same week. Different shape of work.

Stack

Anthropic ClaudeOpenAITypeScriptOpenTelemetryPostgresCustom evals

Timeline & team

5 weeks · 1 PM + 2 engineers

Got a workflow like this one?

Book a working session. We will tell you whether this is a four-week build or something bigger, and what it would take to ship it.

Book a working session

Cookies. Sadly not chocolate chip.

We use cookies to keep the site working, understand what is useful, and avoid shouting ads into the void. You can accept all, reject non-essential, or choose your own settings.

More detail lives in our Privacy Policy and Terms.