A long-term memory layer for an agent that cut hallucinations and made conversations feel like the user was being remembered.
Drop in hallucination rate
~70%
4 weeks · 1 PM + 1 engineer
Memory lookup latency
< 200ms
Episodic, semantic, procedural
3 layers
Kick-off to production
4 weeks
“Memory used to be the thing that broke first when we scaled. Now it is the thing users compliment.”
CTO, AI-native SaaS
Before
Where the team was when we picked this up.
- The agent forgot users between sessions. New conversations felt like cold starts.
- Pulling the entire history into context worked at first then broke as users grew.
- No way to tell when memory was the cause of a bad answer versus the model itself.
What we built
Three-layer store
Episodic (what happened, when), semantic (what the user is and cares about) and procedural (how this user likes to be handled). Each written and retrieved differently.
Selective recall
A small retrieval model decides what to pull into context for each turn. Cheap, fast, and trained on the team’s real conversations.
Memory evals
A test set of multi-session conversations with expected recall behaviour. Catches regressions before the next release.
What changed
- Users describe the agent as feeling like it knows them.
- Costs went down because the agent stops dragging entire transcripts into every prompt.
- The team can ship model upgrades without holding their breath.
After
Same team. Same week. Different shape of work.
Stack
Timeline & team
4 weeks · 1 PM + 1 engineer
More projects
A mobile app with on-device and cloud AI features, shipped to the App Store in six weeks.
A multi-agent ops platform that took ~35 hours of partner busywork out of the week.
A RAG copilot over 60,000 compliance documents that answers in seconds, with citations.
Got a workflow like this one?
Book a working session. We will tell you whether this is a four-week build or something bigger, and what it would take to ship it.