Agent harness & evals
FINTECH · SERIES B · Financial services

A production agent runtime with evals, tool wiring and observability. Replaced a prototype that broke every week.

Tool-call success rate

99.4%

5 weeks · 1 PM + 2 engineers

Eval run cadence

0 → daily

Kick-off to production

5 weeks

Cut in token spend

~60%

“We went from being scared to upgrade the model to upgrading the day it ships. That is the whole point.”

Head of Engineering, Series B fintech

Before

Where the team was when we picked this up.

  • A LangChain prototype was running in production and breaking weekly when the model or a tool changed.
  • There was no way to tell why an agent run failed. Logs were a wall of text.
  • Every model update was a coin flip. Nobody wanted to ship the upgrade.

What we built

Custom harness

Replaced the framework with a thin, typed runtime. Tool calls, retries and errors are first-class. The team can read a transcript and see exactly what happened.

Eval suite

Two hundred scenarios drawn from real production runs. Pass/fail signals, latency budgets and cost ceilings. Runs on every commit.

Observability

Every run produces a structured trace. We added dashboards for failure modes, tool latency, prompt drift and unit costs.

What changed

  • Model upgrades now ship the same week they release.
  • Failures are tracked by category, not anecdote.
  • Token spend went down even as usage grew.

After

Same team. Same week. Different shape of work.

Stack

Anthropic ClaudeOpenAITypeScriptOpenTelemetryPostgresCustom evals

Timeline & team

5 weeks · 1 PM + 2 engineers

Got a workflow like this one?

Book a working session. We will tell you whether this is a four-week build or something bigger, and what it would take to ship it.

Book a working session

Cookie settings

Optional analytics and marketing cookies only run if you allow them.