Agent harness & evals

FINTECH · SERIES B · Financial services

A production agent runtime with evals, tool wiring and observability. Replaced a prototype that broke every week.

Tool-call success rate

99.4%

5 weeks · 1 PM + 2 engineers

Eval run cadence

0 → daily

Kick-off to production

5 weeks

Cut in token spend

~60%

“We went from being scared to upgrade the model to upgrading the day it ships. That is the whole point.”

Head of Engineering, Series B fintech

Before

Where the team was when we picked this up.

A LangChain prototype was running in production and breaking weekly when the model or a tool changed.
There was no way to tell why an agent run failed. Logs were a wall of text.
Every model update was a coin flip. Nobody wanted to ship the upgrade.

What we built

Custom harness

Replaced the framework with a thin, typed runtime. Tool calls, retries and errors are first-class. The team can read a transcript and see exactly what happened.

Eval suite

Two hundred scenarios drawn from real production runs. Pass/fail signals, latency budgets and cost ceilings. Runs on every commit.

Observability

Every run produces a structured trace. We added dashboards for failure modes, tool latency, prompt drift and unit costs.

What changed

Model upgrades now ship the same week they release.
Failures are tracked by category, not anecdote.
Token spend went down even as usage grew.

After

Same team. Same week. Different shape of work.

Stack

Anthropic ClaudeOpenAITypeScriptOpenTelemetryPostgresCustom evals

Timeline & team

5 weeks · 1 PM + 2 engineers

More projects

AI-integrated mobile app

A mobile app with on-device and cloud AI features, shipped to the App Store in six weeks.

6 weeksKick-off to App Store

Read

Internal ops platform

A multi-agent ops platform that took ~35 hours of partner busywork out of the week.

~35h/wkManual ops removed

Read

RAG copilot

A RAG copilot over 60,000 compliance documents that answers in seconds, with citations.

~94%Answer accuracy on eval set

Read

Got a workflow like this one?

Book a working session. We will tell you whether this is a four-week build or something bigger, and what it would take to ship it.

Book a working session

A production agent runtime with evals, tool wiring and observability. Replaced a prototype that broke every week.

99.4%

Before

What we built

Custom harness

Eval suite

Observability

What changed

After

Stack

Timeline & team

More projects

A mobile app with on-device and cloud AI features, shipped to the App Store in six weeks.

A multi-agent ops platform that took ~35 hours of partner busywork out of the week.

A RAG copilot over 60,000 compliance documents that answers in seconds, with citations.

Got a workflow like this one?

Cookie settings