Supply Chain Multi-Agent System
Three specialist agents coordinating through a typed LangGraph state machine
Agent demos tend to fall apart once you ask a real operations question. They hallucinate inventory levels, invent suppliers, and contradict themselves between turns — with no log of what happened. I wanted to build something closer to what a platform team would actually ship: three specialist agents with clear scope, a typed state machine they share, real LLM calls in CI, and observability that isn't an afterthought.
The system doesn't need to serve real users, but it should be built the way you'd build it if it did. Typed Pydantic state, deterministic routing, human intervention gates, evals on every pull request, and Langfuse traces for every run. A platform engineer reading the repo should recognise their own job.
Typed graph state, not free-form scratchpads
The shared state is a Pydantic model where every field has a name, a type, and an owner. The forecasting agent can't silently overwrite the inventory plan; the supplier-risk agent can't mutate the demand signal. On a three-agent graph this feels like overkill. On anything bigger, it's the only reason you can refactor without breaking everything.
Two-tier output guardrails
Every response passes a fast deterministic check first — regex, schema validation, range bounds, forbidden patterns — then a slower LLM-as-judge for softer things like whether the answer actually addresses the question and whether cited numbers come from real context. Both tiers log which check fired and why, so when something gets rejected you can see what happened instead of guessing.
Real-LLM integration tests in CI
Every PR runs a test suite against a real Ollama model, end-to-end, with the full graph executing. Assertions check structure — did the agent route correctly, did the guardrails fire — not specific token sequences, so the tests don't flake on model updates. I stopped trusting mocks for anything beyond unit tests after getting burned by mocks that passed while the real model broke.
Human-in-the-loop at the points that matter
Pulling a forecast doesn't need a human. Approving a purchase order does. The graph has explicit interrupt points where a human reviews, edits, or rejects before the state machine continues. Interrupts are typed and logged, so you can replay any approval decision after the fact.
Self-hosted Langfuse, not vendor lock-in
Langfuse v3, ClickHouse, Postgres, Redis, and MinIO — nine services in one Docker Compose file, no cloud account, no per-trace billing. Every agent run logs prompts, tool calls, latencies, token counts, and guardrail outcomes. You own the data, and for anyone with data-residency constraints, self-hosted is the only option that passes review.
RAG that cites, not RAG that paraphrases
LlamaIndex over FAISS handles retrieval. Every claim the agents make about the corpus is tagged with its source chunk and surfaced in the trace. If a citation can't be produced, the agent says so instead of fabricating. Simple rule, and it solves most of the trust problem.
Reusable scaffolding
The supply-chain domain is incidental. Swap in a different set of agents and 90% of the scaffolding — typed state, guardrails, traces, evals — carries over unchanged.
Typed state as force multiplier
Once the graph state is a Pydantic model, static analysis catches boundary bugs, refactoring an agent's role becomes a type-checked operation, and the LLM gets a cleaner interface to write into. One structural decision that pays off in every direction.
Distance to production
Swap the in-process graph for a durable workflow engine (Temporal or LangGraph's persistence layer). Front the LLM with a gateway for retries and rate limits. Push traces to a long-term store. Add per-tenant isolation. The architecture is already shaped for these, so none of them require rewriting what's here.
Read the source, run it locally, open issues.