← Back to projects

Agent demos tend to fall apart once you ask a real operations question. They hallucinate inventory levels, invent suppliers, and contradict themselves between turns — with no log of what happened. I wanted to build something closer to what a platform team would actually ship: three specialist agents with clear scope, a typed state machine they share, real LLM calls in CI, and observability that isn't an afterthought.

The system doesn't need to serve real users, but it should be built the way you'd build it if it did. Typed Pydantic state, deterministic routing, human intervention gates, evals on every pull request, and Langfuse traces for every run. A platform engineer reading the repo should recognise their own job.

LANGGRAPH STATE MACHINE User Router classify Forecasting demand signals Inventory stock levels Supplier Risk risk scoring Synthesizer merge state Output Guard regex (fast) LLM judge (slow) END interrupt() TypedDict state: citations, tool_calls, guardrail_results
01

Typed graph state, not free-form scratchpads

The shared state is a Pydantic model where every field has a name, a type, and an owner. The forecasting agent can't silently overwrite the inventory plan; the supplier-risk agent can't mutate the demand signal. On a three-agent graph this feels like overkill. On anything bigger, it's the only reason you can refactor without breaking everything.

02

Two-tier output guardrails

Every response passes a fast deterministic check first — regex, schema validation, range bounds, forbidden patterns — then a slower LLM-as-judge for softer things like whether the answer actually addresses the question and whether cited numbers come from real context. Both tiers log which check fired and why, so when something gets rejected you can see what happened instead of guessing.

03

Real-LLM integration tests in CI

Every PR runs a test suite against a real Ollama model, end-to-end, with the full graph executing. Assertions check structure — did the agent route correctly, did the guardrails fire — not specific token sequences, so the tests don't flake on model updates. I stopped trusting mocks for anything beyond unit tests after getting burned by mocks that passed while the real model broke.

04

Human-in-the-loop at the points that matter

Pulling a forecast doesn't need a human. Approving a purchase order does. The graph has explicit interrupt points where a human reviews, edits, or rejects before the state machine continues. Interrupts are typed and logged, so you can replay any approval decision after the fact.

05

Self-hosted Langfuse, not vendor lock-in

Langfuse v3, ClickHouse, Postgres, Redis, and MinIO — nine services in one Docker Compose file, no cloud account, no per-trace billing. Every agent run logs prompts, tool calls, latencies, token counts, and guardrail outcomes. You own the data, and for anyone with data-residency constraints, self-hosted is the only option that passes review.

06

RAG that cites, not RAG that paraphrases

LlamaIndex over FAISS handles retrieval. Every claim the agents make about the corpus is tagged with its source chunk and surfaced in the trace. If a citation can't be produced, the agent says so instead of fabricating. Simple rule, and it solves most of the trust problem.

// reusable scaffolding

Reusable scaffolding

The supply-chain domain is incidental. Swap in a different set of agents and 90% of the scaffolding — typed state, guardrails, traces, evals — carries over unchanged.

// typed state as force multiplier

Typed state as force multiplier

Once the graph state is a Pydantic model, static analysis catches boundary bugs, refactoring an agent's role becomes a type-checked operation, and the LLM gets a cleaner interface to write into. One structural decision that pays off in every direction.

// distance to production

Distance to production

Swap the in-process graph for a durable workflow engine (Temporal or LangGraph's persistence layer). Front the LLM with a gateway for retries and rate limits. Push traces to a long-term store. Add per-tenant isolation. The architecture is already shaped for these, so none of them require rewriting what's here.

LangGraphLlamaIndexFAISSFastAPIStreamlitLangfuseOllama
// the code

Read the source, run it locally, open issues.