Shanmuga Sundaram Natarajan’s Personal Website

Why Most Teams Fail at Collaboration — And How Domain-Driven Design Fixes It

2026-04-01T12:00:00+00:00

Let me tell you about a bug that took three weeks to track down.

The backend team had added a status column to the users table. They meant it to track onboarding progress — steps 0 through 4. The frontend team read the same field and treated it as a boolean: account active or not. The data team exported it weekly and bucketed users into cold, warm, and hot engagement tiers.

No one documented any of this. No one thought they needed to. The column was called status. What could be clearer?

Three teams. One column. Three completely different mental models quietly coexisting in production until the day they collided — a badly timed migration that broke the frontend, corrupted the analytics pipeline, and sent the on-call engineer on a three-week archaeological dig through git history.

That’s the thing about collaboration failures. They don’t announce themselves. They hide in the gap between what you think a word means and what your colleague thinks it means. And by the time you find them, you’re usually staring at a production incident at 2am.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph TD
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    FIELD["users.status
one column,
zero documentation"]:::yellow

    FIELD --> BE["Backend
status = onboarding step
0 · 1 · 2 · 3 · 4"]:::blue
    FIELD --> FE["Frontend
status = account live?
true · false"]:::blue
    FIELD --> DA["Analytics
status = engagement tier
cold · warm · hot"]:::blue

    BE --> OOF["💥 3-week incident
Nobody was lying.
Nobody was wrong.
Nobody talked."]:::red

    FE --> OOF
    DA --> OOF

    OOF --> LESSON["The problem wasn't technical.
It was a missing shared model."]:::dim

It’s not a people problem. It’s a structure problem.

Most engineering managers diagnose this as a communication failure and reach for process. More standups. Mandatory documentation. A new confluence page that everyone writes to once and nobody ever reads again.

I’ve watched this play out at startups, at mid-sized companies, at enterprises. The Agile ceremonies don’t fix it. The retrospectives surface it but don’t solve it. The reason is that this isn’t a people problem or a process problem. It’s a structural problem.

When teams don’t share a coherent model of the domain they’re working in, every handoff becomes a translation exercise. And unlike foreign-language translation, nobody knows a translation is even happening. People assume they’re speaking the same language because they’re using the same words. They’re not.

This is what Domain-Driven Design (DDD) addresses. Eric Evans introduced the term in 2003 and the core idea is deceptively simple: the structure of your software should reflect the structure of the business. Not the structure of your database. Not the structure of your org chart from three reorgs ago. The actual domain — the problem space your business exists to solve.

Where DDD gets interesting is in how deeply it treats language as a first-class design concern.

Shared Language Isn’t a Soft Skill — It’s an Architecture Decision

DDD calls this Ubiquitous Language. The idea is that every team — engineers, product managers, designers, analysts — uses the exact same vocabulary to describe the domain. No synonyms. No “well, we call it an order but finance calls it a transaction.” Just one term, one definition, used everywhere consistently.

That includes the code.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph BEFORE["Before — everyone improvises"]
        direction TB
        P1["PM writes: 'Purchase'"]:::red
        P2["Eng codes: 'Transaction'"]:::red
        P3["Design mocks: 'Booking'"]:::red
        P4["Analyst queries: 'Order'"]:::red
        P1 & P2 & P3 & P4 --> CHAOS["4 mental models.
Nobody catches it
until it breaks."]:::red
    end

    subgraph AFTER["After — one shared glossary"]
        direction TB
        G["Order
─────────────────────
A confirmed purchase with
payment intent, owned by
a Customer, containing
one or more LineItems.
Not a cart. Not a quote."]:::mauve
        Q1["PM"]:::yellow
        Q2["Eng"]:::yellow
        Q3["Design"]:::yellow
        Q4["Analyst"]:::yellow
        Q1 & Q2 & Q3 & Q4 --> G --> GOOD["One model.
Code matches the spec.
Spec matches the meeting."]:::green
    end

When engineers name their classes and methods using the same language the business uses, something subtle but powerful happens: a product manager can read the code and recognise the concepts. An engineer can read a product spec without mentally translating it. Bugs that come from misunderstanding requirements — a whole category of bugs — start to disappear.

The discipline this requires is harder than it sounds. You have to resist the urge to rename things to what you think is cleaner. You have to resist the abstraction instinct. If the business calls it an “Order”, the class is Order. Not PurchaseRecord, not TxnEntity, not SaleModel.

The payoff is enormous. Teams that build on a shared language move noticeably faster because the gap between “what we decided” and “what we built” closes.

Every Big Model Eventually Collapses Under Its Own Weight

Here’s a trap almost every growing company falls into.

Things start simple. You have a Customer object. It has a name, an email, maybe a billing address. Everyone uses it. Fine.

Then Sales needs to add a preferred rep. Support needs to track open ticket counts. Finance needs invoice history and credit limits. Marketing wants engagement scores. Twelve months later, Customer has 60 fields, a confusing network of relationships, and a comment at the top of the file that says // DO NOT TOUCH - ask @someone before changing anything.

Nobody owns it, so everybody owns it. Which means nobody does.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph LR
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph SALES["Sales Context  ·  Team: Commerce"]
        C1["Customer
─────────────
Shopping cart
Purchase history
Wishlist
Assigned rep"]:::yellow
    end

    subgraph SUPPORT["Support Context  ·  Team: CX"]
        C2["Customer
─────────────
Open tickets
Case history
CSAT score
SLA tier"]:::teal
    end

    subgraph FINANCE["Finance Context  ·  Team: Finance"]
        C3["Customer
─────────────
Invoice history
Payment methods
Credit limit
Tax status"]:::blue
    end

    NOTE["Same word. Three lean models.
Each team owns theirs.
No one drowns in a god object."]:::dim

    C1 -.->|"CustomerID only"| NOTE
    C2 -.->|"CustomerID only"| NOTE
    C3 -.->|"CustomerID only"| NOTE

DDD gives you the concept of a Bounded Context to deal with this. A Bounded Context is just a boundary — explicit, named, intentional — within which your model applies. Inside Sales, Customer means one thing. Inside Finance, Customer means something else. Both are valid. Neither one bleeds into the other.

The only thing they share is a stable identifier (a CustomerID) that lets them talk about the same real-world entity without needing to agree on every attribute.

This isn’t just a modelling technique. It’s a team ownership technique. Bounded Contexts map directly to team responsibilities. The Sales context is the Commerce team’s problem. The Finance context is the Finance team’s problem. When something breaks in Finance’s Customer model, you know exactly whose phone to call. And crucially, the Commerce team doesn’t need to be in that conversation at all.

Drawing the Map Nobody Draws

Most teams have multiple contexts whether they know it or not. The problem is they’re invisible. Dependencies between teams exist but nobody’s written them down. You discover them when something changes upstream and breaks three things downstream that nobody knew were connected.

A Context Map makes the invisible visible. It’s a diagram — doesn’t have to be fancy, hand-drawn is fine — showing all your contexts and how they relate to each other.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph TD
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8
    

    INVENTORY["📦 Inventory
Team: Fulfillment"]:::yellow
    PAYMENTS["💳 Payments
Team: Finance"]:::yellow
    IDENTITY["🪪 Identity
Team: Platform"]:::yellow

    ACL["Anti-Corruption Layer
we translate their mess
so it doesn't leak in"]:::red

    ORDERS["🛒 Orders
Team: Commerce
— this is the core —"]:::mauve

    BUS["Event Bus
OrderPlaced · OrderShipped
PaymentSettled · OrderCancelled"]:::teal

    NOTIF["🔔 Notifications
Team: Engagement"]:::green
    REPORT["📊 Reporting
Team: Analytics"]:::green
    SUPPORT["🎧 Support
Team: CX"]:::green

    L1["Open Host Service
they expose a stable API"]:::dim
    L2["Shared Kernel
just the user identity bits"]:::dim
    L3["Customer / Supplier
we told them what we need"]:::dim

    INVENTORY --> ACL --> ORDERS
    PAYMENTS --> L1 --> ORDERS
    IDENTITY --> L2 --> ORDERS

    ORDERS --> BUS
    BUS --> NOTIF
    BUS --> REPORT
    ORDERS --> L3 --> SUPPORT

What makes this useful isn’t just the picture — it’s the relationship labels. DDD has names for the different ways contexts can relate, and those names carry a lot of weight:

An Anti-Corruption Layer is what you build when you have to integrate with a system that has a messy model you don’t control. You write a translation layer that converts their concepts into yours, so their chaos doesn’t leak into your clean domain. If you’ve ever written an adapter for a third-party API with bizarre field names and six levels of nesting, you’ve built an ACL without knowing what to call it.

A Shared Kernel means two teams own a small, explicitly agreed piece of the model together. Changes require coordination. Use this sparingly — shared ownership is shared risk.

A Customer/Supplier relationship is refreshingly honest. The downstream team (customer) tells the upstream team (supplier) what they need, and the upstream team tries to deliver it. Not always with perfect success, but at least it’s named.

Having names for these things matters because it lets you have more precise conversations. Instead of “we have a dependency,” you can say “we’re downstream from Payments in a Customer/Supplier relationship, and they keep breaking our integration.” That’s a different conversation.

Your Architecture Is Just Your Team Structure, Reflected Back at You

Conway’s Law is one of those observations that sounds cynical but is actually just true: any organisation that designs a system will produce a design whose structure mirrors the organisation’s communication structure.

Teams that don’t talk produce systems with unclear interfaces between them. Teams organised by technology layer — a frontend team, a backend team, a database team — produce a layered monolith. Not because anyone planned it that way, but because that’s how the Conway attractor works.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph TB
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph BAD["❌ Org by tech layer → distributed monolith"]
        direction LR
        FET["Frontend Team"]:::red --> FEA["React App"]:::red
        BET["Backend Team"]:::red --> BEA["One giant API"]:::red
        DBT["DB Team"]:::red --> DBA["Shared DB everyone writes to"]:::red
        FEA --> BEA --> DBA
    end

    subgraph GOOD["✅ Org by domain → real autonomy"]
        direction LR
        CT["Commerce Team"]:::yellow --> CS["Order Service + its own DB"]:::green
        IT["Identity Team"]:::yellow --> IS["Auth Service + its own DB"]:::green
        AT["Analytics Team"]:::yellow --> AS["Reporting Service + its own DB"]:::green
    end

    CONWAY["Conway's Law in action.
Flip your org structure
and the architecture follows."]:::dim

    BAD -.-> CONWAY
    GOOD -.-> CONWAY

The insight DDD gives you here is to use Conway’s Law deliberately. If you want an architecture that’s organised by business domain, organise your teams by business domain first. The architecture will naturally follow. This is sometimes called the “Inverse Conway Maneuver” and it’s more effective than any amount of architectural governance.

A Vocabulary for the Code Itself

The strategic stuff — contexts, maps, language — gets the most attention, but DDD also has a set of tactical patterns that give engineers a shared vocabulary for implementation. These are the building blocks you use once you’ve drawn your boundaries.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
classDiagram
    class Order {
        <>
        +OrderId id
        +CustomerId customerId
        +OrderStatus status
        +List~LineItem~ items
        +Money total
        +place() OrderPlaced
        +cancel() OrderCancelled
        +addItem(productId, qty)
    }

    class LineItem {
        <>
        +LineItemId id
        +ProductId productId
        +Quantity qty
        +Money unitPrice
        +subtotal() Money
    }

    class Money {
        <>
        +Decimal amount
        +Currency currency
        +add(Money) Money
        +equals(Money) bool
    }

    class OrderStatus {
        <>
        PENDING
        CONFIRMED
        SHIPPED
        CANCELLED
    }

    class OrderPlaced {
        <>
        +OrderId orderId
        +CustomerId customerId
        +Money total
        +DateTime occurredAt
    }

    class OrderRepository {
        <>
        +findById(OrderId) Order
        +save(Order) void
        +findByCustomer(CustomerId) List~Order~
    }

    Order "1" *-- "1..*" LineItem : contains
    Order *-- Money : total
    Order *-- OrderStatus : status
    Order ..> OrderPlaced : emits
    OrderRepository ..> Order : persists
    LineItem *-- Money : unitPrice

An Aggregate is a cluster of objects that change together, with a single root that controls access. In the diagram above, Order is the root. You never reach into a LineItem directly from outside — you always go through Order. This enforces consistency boundaries in a way that’s explicit and understandable.

Value Objects are immutable. Money has no identity — a $10 value object is interchangeable with any other $10 value object of the same currency. This is a meaningful design decision that prevents a whole class of bugs (mutating a price on one object and accidentally affecting another).

Domain Events record facts. OrderPlaced means something happened — past tense, immutable, true. It’s not a command. It’s not a request. Something happened and we’re recording it. This distinction matters for how teams interact with each other.

Domain Events Are How Teams Stop Stepping on Each Other

The thing that changed how I think about team autonomy was understanding Domain Events properly. When a context publishes an event, it’s announcing a fact to the rest of the world without knowing or caring who’s listening. Other contexts react. No direct coupling. No “we need to call the Notifications team’s endpoint before we can ship.”

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
sequenceDiagram
    participant C as Customer
    participant O as Orders
(Commerce Team)
    participant B as Event Bus
    participant P as Payments
(Finance Team)
    participant N as Notifications
(Engagement Team)
    participant R as Reporting
(Analytics Team)

    C->>O: place order
    O->>O: validate, create Order aggregate
    O-->>B: OrderPlaced { orderId, total, customerId }
    Note over O,B: Commerce ships this.
Nobody else was consulted.

    B-->>P: OrderPlaced
    P->>P: charge card, record transaction
    P-->>B: PaymentSucceeded { orderId, txnId }

    B-->>N: PaymentSucceeded
    N->>C: confirmation email

    B-->>R: OrderPlaced + PaymentSucceeded
    R->>R: update revenue dashboard

    Note over P,R: All three teams deploy
on their own schedule.
No cross-team standups to ship.

Notice what’s missing: the Commerce team never calls the Notifications team directly. Engagement never has to wait on Finance to expose an API. Analytics doesn’t need to ask Commerce for access to order data. Everyone reacts to shared facts. Everyone can ship without scheduling around everyone else.

This is the real payoff of Domain Events for collaboration. It’s not a technical trick — it’s a social contract encoded in architecture. “We commit to publishing accurate events. Do what you want with them.”

What This Actually Looked Like at a Real Company

A fintech startup I know had three teams sharing a Rails monolith. Payments, Accounts, and Reporting all worked in the same codebase, all writing to the same Account model. It had accumulated 60-something fields over 18 months. Nobody wanted to touch it.

Releasing anything required all three teams to coordinate. A Reporting query that nobody noticed would fail silently when Payments changed an account state transition. The Accounts team had a standing rule that any migration needed sign-off from two other teams before it went out. Releases happened every three weeks. Everyone was exhausted.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph TB
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph BEFORE["Before — one model, three teams, constant friction"]
        direction TB
        SHARED["Account
60+ fields · 20+ associations
// DO NOT TOUCH
// ask someone first"]:::red
        PT["Payments Team"]:::yellow --> SHARED
        AT["Accounts Team"]:::yellow --> SHARED
        RT["Reporting Team"]:::yellow --> SHARED
        SHARED --> PAIN["3-week release cycles.
Every migration needs
sign-off from 2 other teams.
Everyone is tired."]:::red
    end

    subgraph AFTER["After — three contexts, events, autonomy"]
        direction LR
        PC["Payment Processing
owns: charge lifecycle
refunds · disputes"]:::mauve
        AC["Account Management
owns: balance · ledger
account state"]:::mauve
        BUS2["Event Bus
PaymentSettled
AccountCredited
AccountDebited"]:::teal
        RC["Financial Reporting
owns: its own read model
built from events"]:::green
        PC --> BUS2
        AC --> BUS2
        BUS2 --> RC
    end

    RESULT["Each team ships when ready.
Cycle time: days not weeks.
Cross-team meetings dropped by half."]:::dim
    AFTER --> RESULT

They ran an Event Storming workshop — three hours, lots of sticky notes, some heated arguments about what “account” actually meant. They came out with three clearly named contexts, a rough event schema, and the realisation that Reporting didn’t actually need access to the Account model at all. It just needed events.

Six months later they were shipping daily. The Account model still existed in its original bloated form in Accounts context, but it was that team’s problem to clean up on their own timeline. Finance had a lean model. Reporting had a read model built from events. Everyone owned their own thing.

You Don’t Have to Do All of This at Once

The reason DDD gets a reputation for being heavyweight is that people treat it as an all-or-nothing proposition. They read Evans’ book, see 560 pages, and either implement everything or none of it. Neither is the right call.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    S1["1. Event Storming
──────────
3 hours, sticky notes,
domain experts in the room.
Surfaces what you don't know."]:::yellow

    S2["2. Shared Glossary
──────────
Pick your 5 most
misused terms.
Write definitions. Get sign-off."]:::blue

    S3["3. Context Map
──────────
Draw the boundaries
that already exist
but aren't documented."]:::mauve

    S4["4. Tactical Patterns
──────────
Value Objects for money.
Domain Events for
cross-team handoffs."]:::teal

    S5["5. Protect Boundaries
──────────
Anti-Corruption Layers
for legacy systems and
third-party APIs."]:::green

    S1 --> S2 --> S3 --> S4 --> S5

Start with the thing that gives you the most value with the least disruption. For most teams that’s a combination of steps 1 and 2. Run an Event Storming session — just a few hours, domain experts and engineers in the same room, mapping out what actually happens in the business. Then build a glossary for the five terms that cause the most confusion in your standups.

You can do both of those things without touching your architecture. You’ll still get an immediate improvement in how clearly your team communicates. The rest — the context boundaries, the tactical patterns, the event-driven integration — you layer in when it makes sense.

The trap to avoid is introducing DDD vocabulary without the discipline. If half the team calls it an Order and the other half still says Transaction, you haven’t adopted Ubiquitous Language — you’ve just added jargon. Full commitment to the shared vocabulary is the one thing that shouldn’t be done halfway.

The Real Unlock

Here’s the thing that doesn’t get said enough about DDD: the primary benefit isn’t better architecture. It’s faster trust.

When teams have explicit boundaries, they can trust each other to stay inside them. When events are the handoff mechanism, no team can break another team’s internals. When the language is shared, you stop wasting half a meeting realising you’ve been talking past each other.

That’s what makes collaboration actually work at scale. Not more process. Not more documentation. Structures that make the right thing easy and the wrong thing obvious — so teams can move fast without constantly stepping on each other.

DDD gives you those structures. It took the software world a while to really absorb what Evans was getting at, but the core insight holds up: the way you model your domain determines how well your teams can work together. Get the model right, and a lot of the collaboration friction disappears.

Get it wrong, and no amount of Agile ceremony will save you.

If you made it this far, your next move is simple: schedule a two-hour Event Storming session with your team. Bring in someone from product, someone from the business side, and your senior engineers. You’ll be surprised how quickly the domain’s real shape reveals itself — and how much everyone disagrees on terms you all thought you shared.

Tags: Domain-Driven Design · Software Architecture · Team Collaboration · Engineering Culture · Microservices

Building RAG That Actually Works: Lessons from the Trenches

2026-03-21T12:00:00+00:00

I’ve read probably a dozen RAG tutorials. They all do the same thing: show you how to embed a handful of PDFs into a vector store, run a similarity search, stuff the results into a prompt, and call it a production pipeline. Then you try to use the same approach on real data — thousands of documents, mixed formats, users with messy natural language queries — and the whole thing falls apart. The answers are vague, wrong, or confidently referencing documents that have nothing to do with the question.

That gap between “works in the tutorial” and “works in production” is what this post is about. I’ve built a few of these pipelines now, and I’ve made almost every mistake there is to make — wrong chunk sizes, no overlap, skipping the re-ranker, shipping without any evaluation. I’m going to walk through the full pipeline — from chunking to evaluation — not as a sanitized tutorial, but as the thing I wish I’d had when I started building this stuff for real.

We’ll use LangChain, ChromaDB, and OpenAI throughout. If you use different tools, the concepts all transfer.

The Two Phases You Need to Separate in Your Head

Before any code, the most important mental model is that RAG is two completely separate systems that happen to share a vector database.

The indexing pipeline is offline. It runs on a schedule or when documents change. It loads your source files, chunks them, converts them to embeddings, and writes them to a vector store. Speed isn’t critical here. Correctness is.

The query pipeline is online. It runs on every user request, and it needs to be fast. It embeds the user’s question, retrieves the most relevant chunks, builds a prompt, and calls the LLM.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#89B4FA', 'lineColor': '#A6ADC8', 'secondaryColor': '#24243E', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'clusterBorder': '#45475A', 'titleColor': '#CDD6F4', 'edgeLabelBackground': '#1E1E2E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow  fill:#313244,stroke:#F9E2AF,color:#F9E2AF,rx:4
    classDef blue    fill:#313244,stroke:#89B4FA,color:#89B4FA,rx:4
    classDef mauve   fill:#313244,stroke:#CBA6F7,color:#CBA6F7,rx:4
    classDef green   fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1,rx:4
    classDef red     fill:#313244,stroke:#F38BA8,color:#F38BA8,rx:4

    subgraph OFFLINE["① OFFLINE · INDEXING"]
        A["Documents
PDF / HTML / TXT"]:::yellow
        B["Document Loader"]:::blue
        C["Text Splitter
chunk_size=1000, overlap=200"]:::blue
        D["Embedding Model
text-embedding-3-large"]:::mauve
        A --> B --> C --> D
    end

    VS[("Vector Store
ChromaDB")]:::green

    D --> VS

    subgraph ONLINE["② ONLINE · QUERY"]
        F["User Question"]:::yellow
        G["Embed Question"]:::blue
        H["Similarity Search
MMR  k=5, fetch_k=20"]:::blue
        I["Re-ranker
cross-encoder  top-5"]:::red
        J["Prompt Builder → LLM
temperature=0"]:::green
        K(["Answer"]):::green
        F --> G --> H --> I --> J --> K
    end

    VS -->|top-k chunks| H

Keeping these two phases decoupled is the first thing most tutorials get wrong. If your indexing and querying code are tangled together, you’ll end up in situations where you can’t re-index without restarting your query service, or where a slow re-embed job blocks user requests. Treat them as separate processes from day one.

Here’s how to get your environment set up:

# requirements.txt equivalent — install these first
# pip install langchain langchain-openai langchain-chroma chromadb openai tiktoken pypdf

import os
from pathlib import Path

# Set your API key
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# Verify environment
import openai
import chromadb
import langchain

print(f"LangChain version: {langchain.__version__}")
print(f"ChromaDB version: {chromadb.__version__}")
print("Environment ready")

Run this and confirm your package versions before going further. Version mismatches between LangChain and ChromaDB have caused me more pain than any bug I’ve written myself.

Chunking: The Part I Got Wrong for Two Weeks

I’ll be direct: chunk size is the single most important decision in this entire pipeline. It took me two weeks of debugging poor retrieval quality before I realized my chunks were the problem, not my embedding model or retrieval code. I had set chunk_size=2000 thinking “more context = better” and ended up with bloated, unfocused chunks that pulled in too much noise along with the relevant content.

The intuition is simple. A chunk is the atomic unit of retrieval. When a user asks a question, the system fetches the N most relevant chunks and hands them to the LLM. If your chunks are too large, each one contains multiple topics and the similarity score gets diluted. Too small, and you end up with fragments that don’t make sense without their surrounding context.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart TD
    classDef peach   fill:#313244,stroke:#FAB387,color:#FAB387
    classDef blue    fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green   fill:#313244,stroke:#A6E3A1,color:#A6E3A1
    classDef teal    fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim     fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    ROOT["Document
Chunking Strategy?"]:::dim

    ROOT -->|"Fixed size"| FC["Fixed Character
─────────────────
Speed : Fast
Quality : Basic
Use : Prototypes only"]:::peach
    ROOT -->|"Recursive split"| RC["Recursive Character  ★ recommended
─────────────────
Speed : Fast
Quality : Very Good
Use : General production"]:::blue
    ROOT -->|"Embedding-based"| SC["Semantic Splitter
─────────────────
Speed : Slow
Quality : Excellent
Use : High-stakes retrieval"]:::green
    ROOT -->|"Structure-aware"| MC["HTML / Markdown
─────────────────
Speed : Fast
Quality : Very Good
Use : Structured docs"]:::teal

For most use cases, RecursiveCharacterTextSplitter with chunk_size=1000 and chunk_overlap=200 is a solid starting point. The recursive part matters: it tries to split on paragraph breaks first, then sentence boundaries, then spaces, only falling back to raw character splits as a last resort. This means your chunks are much more likely to contain complete thoughts rather than sentences cut in half.

The overlap is non-negotiable. Without it, a concept that straddles a chunk boundary gets split in two, and whichever half gets retrieved will be missing critical context.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef blue   fill:#1A1A2E,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A2E1A,stroke:#A6E3A1,color:#A6E3A1
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph NO["# without overlap"]
        C1["chunk_1
…concept begins"]:::red
        C2["chunk_2
continues…"]:::red
        C3["chunk_3
…conclusion"]:::red
        C1 --> C2 --> C3
        LOST["⚠ context from chunk_1
   is LOST at boundary"]:::dim
        C2 -.->|boundary gap| LOST
    end

    subgraph YES["# overlap=200"]
        A1["chunk_1
…concept begins"]:::blue
        A2["chunk_1 tail (200 chars)
+ chunk_2 new content"]:::green
        A1 -->|"overlapping tail carried forward"| A2
        OK["✓ context preserved
  across boundary"]:::dim
        A2 -.-> OK
    end

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from typing import List

def load_documents(source_dir: str) -> List[Document]:
    """Load all PDF documents from a directory."""
    loader = DirectoryLoader(
        source_dir,
        glob="**/*.pdf",
        loader_cls=PyPDFLoader,
        show_progress=True
    )
    documents = loader.load()
    print(f"Loaded {len(documents)} pages from {source_dir}")
    return documents


def chunk_documents(
    documents: List[Document],
    chunk_size: int = 1000,
    chunk_overlap: int = 200
) -> List[Document]:
    """
    Split documents into overlapping chunks.

    chunk_overlap=200 ensures continuity — if a concept spans a chunk
    boundary, both chunks will contain enough context to be meaningful.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        # Try these separators in order — fall back to the next if needed
        separators=["\n\n", "\n", ". ", " ", ""],
        length_function=len,
    )

    chunks = splitter.split_documents(documents)

    print(f"Split {len(documents)} pages into {len(chunks)} chunks")
    print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")

    return chunks


# --- Run it ---
docs = load_documents("./docs")
chunks = chunk_documents(docs, chunk_size=1000, chunk_overlap=200)

# Inspect a sample chunk
sample = chunks[5]
print(f"\n--- Sample Chunk ---")
print(f"Content: {sample.page_content[:300]}...")
print(f"Metadata: {sample.metadata}")

You should see chunk counts and average sizes printed out. I strongly recommend inspecting a handful of sample chunks manually before proceeding. If your chunks look like they’re cutting off mid-sentence constantly, drop the chunk_size or double-check that your separator list is actually matching your document structure.

One more thing I wish someone had told me: always print the average chunk size after splitting. If it’s dramatically smaller than your target (say, you set 1000 but average is 400), your documents are full of very short paragraphs and you probably need to reconsider your separator strategy.

Embeddings and the Vector Store: Boring but Critical

This part feels like plumbing, and it kind of is. But bad plumbing causes leaks.

The premise is that semantically similar text produces vectors that are geometrically close in high-dimensional space. “How does attention work?” and “Explain the self-attention mechanism” don’t share many words, but their embeddings will be very close because they mean the same thing. That’s the whole trick.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow  fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue    fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef mauve   fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef green   fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef dim     fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    Q["'What is self-attention?'
query text"]:::yellow
    QV["[0.12, -0.87, 0.34, …]
dim=1536"]:::yellow

    C["'Self-attention allows tokens
to attend to all other tokens'
chunk text"]:::blue
    CV["[0.11, -0.89, 0.31, …]
dim=1536"]:::blue

    SIM["cosine_sim() = 0.97"]:::mauve
    MATCH["HIGH MATCH
semantically equivalent"]:::green
    NOTE["# different words
# same meaning
# geometrically close"]:::dim

    Q  -->|"embed()"| QV
    C  -->|"embed()"| CV
    QV -->  SIM
    CV -->  SIM
    SIM --> MATCH
    MATCH -.-> NOTE

I use text-embedding-3-large for anything where quality matters. It’s more expensive than text-embedding-3-small, but for a knowledge base application the quality difference is real and the cost per query is still small. For high-volume applications where you’re embedding millions of documents and running thousands of queries a day, text-embedding-3-small is worth benchmarking.

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
import chromadb

def build_vector_store(
    chunks: List[Document],
    persist_directory: str = "./chroma_db",
    collection_name: str = "knowledge_base"
) -> Chroma:
    """
    Embed all document chunks and store them in ChromaDB.

    Uses text-embedding-3-large for best retrieval quality.
    Falls back gracefully if the store already exists.
    """

    # Initialize the embedding model
    # text-embedding-3-large: 1536 dims, excellent quality
    # text-embedding-3-small: cheaper, good for high-volume apps
    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-large",
        dimensions=1536
    )

    # Check if vector store already exists to avoid re-embedding
    if Path(persist_directory).exists():
        print(f"Loading existing vector store from {persist_directory}")
        vector_store = Chroma(
            collection_name=collection_name,
            embedding_function=embeddings,
            persist_directory=persist_directory
        )
    else:
        print(f"Building new vector store with {len(chunks)} chunks...")

        # Chroma.from_documents handles embedding + storing in one call
        vector_store = Chroma.from_documents(
            documents=chunks,
            embedding=embeddings,
            collection_name=collection_name,
            persist_directory=persist_directory,
        )
        print(f"Vector store built and persisted to {persist_directory}")

    # Verify the store
    count = vector_store._collection.count()
    print(f"Vector store contains {count} vectors")

    return vector_store


# --- Build it ---
vector_store = build_vector_store(chunks)

In production, documents change. People update wikis, replace PDFs, add new reports. You don’t want to re-embed your entire corpus every time a single file changes. The incremental indexing pattern below handles this with a simple content hash:

import hashlib
from datetime import datetime

def get_document_hash(doc: Document) -> str:
    """Generate a stable hash for a document chunk based on its content."""
    return hashlib.md5(doc.page_content.encode()).hexdigest()

def upsert_documents(
    vector_store: Chroma,
    new_chunks: List[Document]
) -> dict:
    """
    Add only new/changed documents to an existing vector store.
    Avoids re-embedding documents that haven't changed.
    """
    # Fetch all existing document IDs from the store
    existing = vector_store._collection.get(include=["metadatas"])
    existing_hashes = {
        meta.get("content_hash")
        for meta in existing["metadatas"]
        if meta.get("content_hash")
    }

    # Filter to only chunks we haven't seen before
    new_docs = []
    for chunk in new_chunks:
        content_hash = get_document_hash(chunk)
        if content_hash not in existing_hashes:
            # Stamp the chunk with its hash for future deduplication
            chunk.metadata["content_hash"] = content_hash
            chunk.metadata["indexed_at"] = datetime.now().isoformat()
            new_docs.append(chunk)

    if new_docs:
        vector_store.add_documents(new_docs)
        print(f"Added {len(new_docs)} new chunks to the vector store")
    else:
        print("No new documents to index — everything is up to date")

    return {"added": len(new_docs), "skipped": len(new_chunks) - len(new_docs)}

The first time you run this on a large corpus, embedding takes a while. Budget accordingly. On 10,000 chunks with text-embedding-3-large, expect roughly 2-3 minutes and a few dollars in API costs.

Retrieval: Where Most Pipelines Silently Die

Here’s what I mean by “silently.” A naive similarity search will almost always return something. The chunks it returns will usually be topically related to the query. The LLM will usually produce a fluent, confident answer. The problem is that the answer might be incomplete, subtly wrong, or built on the third-best chunk rather than the most relevant one. You’ll never know unless you’re logging and measuring.

The two main failure modes I’ve seen:

Redundant retrieval. You ask for k=5 chunks and get back 5 chunks that all say essentially the same thing. You’ve used your entire context window on one perspective of the topic and left out everything else.

Unfocused retrieval in multi-domain knowledge bases. If your vector store has documents from HR, engineering, finance, and legal all mixed together, a query about “approval process” might retrieve chunks from three different departments when the user only cared about one.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow  fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue    fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef teal    fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef red     fill:#313244,stroke:#F38BA8,color:#F38BA8
    classDef green   fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef mauve   fill:#313244,stroke:#CBA6F7,color:#CBA6F7

    UQ["User
Query"]:::yellow
    EMB["Embedder
embed()"]:::blue
    VS["Vector Store
similarity_search()"]:::teal
    RR["Re-ranker
CrossEncoder
top-20 → top-5"]:::red
    PB["Prompt
Builder"]:::green
    LLM["LLM
generate()"]:::mauve
    ANS(["Answer"]):::green

    UQ -->|"raw question"| EMB
    EMB -->|"query vector"| VS
    VS -->|"top-20 candidates"| RR
    RR -->|"re-ranked top-5"| PB
    PB -->|"augmented prompt"| LLM
    LLM --> ANS

MMR (Maximal Marginal Relevance) solves the redundancy problem. Instead of returning the top-K most similar chunks, it returns the top-K that are both relevant to the query and maximally different from each other. I should have been using this from the start.

def build_retriever(vector_store: Chroma, strategy: str = "mmr"):
    """
    Build a retriever with different strategies:
    - 'similarity': pure cosine similarity (fast, simple)
    - 'mmr': Maximal Marginal Relevance (diverse results, reduces redundancy)
    - 'filtered': similarity with metadata filtering
    """

    if strategy == "similarity":
        # Basic similarity search — good for small, focused knowledge bases
        retriever = vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 5}
        )

    elif strategy == "mmr":
        # MMR balances relevance AND diversity
        # fetch_k=20: fetch 20 candidates, then select 5 maximally diverse ones
        retriever = vector_store.as_retriever(
            search_type="mmr",
            search_kwargs={
                "k": 5,           # final number of results
                "fetch_k": 20,    # candidate pool size
                "lambda_mult": 0.7  # 1.0 = pure similarity, 0.0 = pure diversity
            }
        )

    elif strategy == "filtered":
        # Filter by metadata before similarity search
        # Useful when documents have tags, dates, categories, etc.
        retriever = vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={
                "k": 5,
                "filter": {"source": "annual_report_2025.pdf"}
            }
        )

    return retriever


def retrieve_with_scores(vector_store: Chroma, query: str, k: int = 5):
    """
    Retrieve chunks with their similarity scores for debugging/logging.
    """
    results = vector_store.similarity_search_with_score(query, k=k)

    print(f"\nQuery: '{query}'")
    print(f"{'─' * 60}")
    for i, (doc, score) in enumerate(results):
        # ChromaDB returns L2 distance (lower = more similar)
        # Convert to 0-1 similarity for readability
        similarity = 1 / (1 + score)
        print(f"\nResult {i+1} | Similarity: {similarity:.3f}")
        print(f"Source: {doc.metadata.get('source', 'unknown')}")
        print(f"Content: {doc.page_content[:200]}...")

    return results


# --- Test retrieval ---
retriever = build_retriever(vector_store, strategy="mmr")
results = retrieve_with_scores(
    vector_store,
    query="How does self-attention work in transformers?",
    k=5
)

Run this and look at the similarity scores. If your top result is below 0.75, something is off. Either your chunks are too large, your embedding model is mismatched for your domain, or your documents genuinely don’t contain a good answer to the query.

The re-ranker I resisted for too long

Honestly, I avoided adding a cross-encoder re-ranker for months because of the added latency. That was a mistake. The difference in retrieval quality was significant enough that I ended up adding it anyway after watching users get mediocre answers on questions that should have been easy.

The core issue is that embedding models (bi-encoders) work by embedding the query and each document independently and then comparing them. They’re fast but coarse. A cross-encoder, by contrast, takes the query and a document together as a pair and scores them jointly. It’s much more accurate at judging whether a specific document actually answers a specific question.

The trade-off: cross-encoders are slow. You wouldn’t use one to search a million documents. But used as a re-ranker on the top 20 candidates your vector store already retrieved, the latency is acceptable (usually 50-150ms for a batch of 20).

from sentence_transformers import CrossEncoder

class CrossEncoderReranker:
    """Re-rank retrieved documents using a cross-encoder model."""

    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(
        self,
        query: str,
        documents: List[Document],
        top_n: int = 5
    ) -> List[Document]:
        """
        Score each (query, document) pair and return top_n by score.
        """
        # Build pairs for the cross-encoder
        pairs = [(query, doc.page_content) for doc in documents]

        # Cross-encoder scores each pair (query is considered with each doc)
        scores = self.model.predict(pairs)

        # Sort by score descending, keep top_n
        ranked = sorted(
            zip(scores, documents),
            key=lambda x: x[0],
            reverse=True
        )

        top_docs = [doc for _, doc in ranked[:top_n]]

        print(f"Re-ranked {len(documents)} → {top_n} documents")
        for i, (score, doc) in enumerate(ranked[:top_n]):
            print(f"  Rank {i+1}: score={score:.3f} | {doc.page_content[:80]}...")

        return top_docs


# --- Use the re-ranker ---
reranker = CrossEncoderReranker()
candidate_docs = retriever.invoke("How does self-attention work?")
reranked_docs = reranker.rerank(
    query="How does self-attention work?",
    documents=candidate_docs,
    top_n=3
)

With the re-ranker in place, your retrieval pipeline now works in two stages: the vector store does a fast, broad sweep to surface 20 candidates, and the cross-encoder does a precise, slow pass to select the best 3-5 from that pool. That’s the combination I’d use as a default for any knowledge base application.

Prompt Construction: Don’t Waste Good Retrieval on a Bad Prompt

You’ve worked hard to get the right chunks. Now you need to actually use them well. This part is simpler than the retrieval work but still matters more than most people give it credit for.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart TD
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8

    RC["retrieved_chunks
from re-ranker"]:::teal
    UQ["user_question"]:::yellow
    SI["system_instructions
role + constraints"]:::dim

    CF["format_context()
add labels · tiktoken budget
max_tokens=6000"]:::teal

    PT["ChatPromptTemplate
system | context | question"]:::yellow

    LLM["ChatOpenAI
temperature=0"]:::mauve

    RES["response
grounded answer"]:::green
    FLAG["# flag if not grounded
# in retrieved context"]:::red

    RC -->|"chunks"| CF
    CF -->|"formatted context"| PT
    UQ --> PT
    SI --> PT
    PT -->|"assembled prompt"| LLM
    LLM --> RES
    LLM -.->|"hallucination check"| FLAG

Two things consistently trip people up here.

First, token budgets. If you’re not explicitly counting tokens before building your prompt, you are relying on luck. LangChain won’t error out when you exceed the context window. It’ll silently truncate, and you’ll get answers based on incomplete context. Always count with tiktoken.

Second, the system prompt. Tell the model to cite its sources, tell it to say “I don’t know” when the context doesn’t contain the answer, and tell it explicitly not to draw on outside knowledge. Without these constraints, GPT-4 in particular will happily synthesize an answer from its training data when the retrieved context falls short, and you’ll have no idea it’s doing it.

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
import tiktoken

# ── Token budget management ──────────────────────────────────────────
def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens using tiktoken to avoid exceeding context window."""
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def format_context(docs: List[Document], max_tokens: int = 6000) -> str:
    """
    Format retrieved docs into a context string, respecting a token budget.
    Prioritizes earlier (higher-ranked) chunks when truncating.
    """
    context_parts = []
    total_tokens = 0

    for i, doc in enumerate(docs):
        chunk_text = f"[Source {i+1}: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        chunk_tokens = count_tokens(chunk_text)

        if total_tokens + chunk_tokens > max_tokens:
            print(f"Token budget reached at chunk {i+1}. Truncating context.")
            break

        context_parts.append(chunk_text)
        total_tokens += chunk_tokens

    print(f"Context: {len(context_parts)} chunks, {total_tokens} tokens")
    return "\n\n---\n\n".join(context_parts)


# ── Prompt Template ───────────────────────────────────────────────────
RAG_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are a precise, helpful assistant. Answer the user's question
using ONLY the information provided in the context below.

Rules:
- If the context doesn't contain enough information to answer confidently, say so explicitly.
- Do not make up facts not present in the context.
- Cite the source number (e.g. [Source 1]) when referencing specific information.
- Be concise but complete.

Context:
{context}
"""),
    ("human", "{question}")
])


# ── Full RAG Pipeline ─────────────────────────────────────────────────
class RAGPipeline:
    def __init__(
        self,
        vector_store: Chroma,
        model_name: str = "gpt-4o",
        retrieval_strategy: str = "mmr",
        use_reranker: bool = True
    ):
        self.retriever = build_retriever(vector_store, strategy=retrieval_strategy)
        self.reranker = CrossEncoderReranker() if use_reranker else None
        self.llm = ChatOpenAI(model=model_name, temperature=0)
        self.prompt = RAG_PROMPT

    def run(self, question: str) -> dict:
        """Execute the full RAG pipeline and return answer + sources."""

        # Step 1: Retrieve candidate chunks
        print(f"\n[1/4] Retrieving candidates for: '{question}'")
        candidates = self.retriever.invoke(question)
        print(f"      Retrieved {len(candidates)} candidates")

        # Step 2: Re-rank (optional)
        if self.reranker:
            print(f"[2/4] Re-ranking candidates...")
            docs = self.reranker.rerank(question, candidates, top_n=4)
        else:
            docs = candidates[:4]

        # Step 3: Format context with token budget
        print(f"[3/4] Formatting context...")
        context = format_context(docs, max_tokens=6000)

        # Step 4: Generate response
        print(f"[4/4] Generating response...")
        chain = self.prompt | self.llm | StrOutputParser()
        answer = chain.invoke({"context": context, "question": question})

        # Collect source metadata
        sources = list({doc.metadata.get("source", "unknown") for doc in docs})

        return {
            "question": question,
            "answer": answer,
            "sources": sources,
            "chunks_used": len(docs)
        }


# ── Run it end-to-end ─────────────────────────────────────────────────
pipeline = RAGPipeline(
    vector_store=vector_store,
    model_name="gpt-4o",
    retrieval_strategy="mmr",
    use_reranker=True
)

result = pipeline.run("What is the difference between self-attention and cross-attention?")

print("\n" + "═" * 60)
print(f"Question: {result['question']}")
print(f"Answer:\n{result['answer']}")
print(f"\nSources: {result['sources']}")
print(f"Chunks used: {result['chunks_used']}")

When you run this, you’ll see the pipeline logging each stage (retrieval, re-ranking, context formatting, generation) along with the final answer and the source documents it drew from. That logging isn’t cosmetic; it’s how you debug when something goes wrong.

Also: temperature=0. I cannot stress this enough. RAG applications are not creative writing tasks. You want the model to be deterministic and faithful to its context. Set it to zero and leave it there.

You Can’t Improve What You Don’t Measure

Here’s a question most RAG tutorials skip entirely: how do you know your pipeline is actually good?

“It seems to answer things correctly” is not an answer. I’ve seen pipelines that produce fluent, confident, well-structured responses that are subtly wrong 30% of the time. You won’t catch that without systematic evaluation.

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart TD
    classDef root   fill:#45475A,stroke:#CDD6F4,color:#CDD6F4
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#313244,stroke:#A6E3A1,color:#A6E3A1
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    ROOT["evaluate(dataset)
RAGAS"]:::root

    subgraph RQ["Retrieval Quality"]
        CR["context_recall
# all relevant docs retrieved?"]:::blue
        CP["context_precision
# retrieved docs are relevant?"]:::blue
        MR["MRR / NDCG
# best docs ranked highest?"]:::blue
    end

    subgraph GQ["Generation Quality"]
        FA["faithfulness
# answer grounded in context?
target > 0.85"]:::green
        AR["answer_relevancy
# answer addresses question?"]:::green
        HR["hallucination_rate
# facts not in context?"]:::green
    end

    NOTE["# scores: 0.0 → 1.0
# run after every pipeline change"]:::dim

    ROOT --> CR & CP & MR
    ROOT --> FA & AR & HR
    FA & AR & HR -.-> NOTE

RAGAS is the library I use for this. It evaluates four things that matter:

Faithfulness measures whether the answer is actually grounded in the retrieved context. This is your hallucination detector. A low score here means your LLM is drawing on its parametric memory instead of your documents.

Answer relevancy measures whether the answer actually addresses the question. You can be faithful to the context while still giving a technically correct but non-responsive answer.

Context recall and context precision measure the retrieval layer specifically: did you get the right documents, and only the right documents?

# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import (
    faithfulness,        # Is the answer grounded in the retrieved context?
    answer_relevancy,   # Does the answer actually address the question?
    context_recall,     # Were the relevant documents retrieved?
    context_precision,  # Are the retrieved documents actually relevant?
)
from datasets import Dataset

def evaluate_rag_pipeline(pipeline: RAGPipeline, test_cases: list) -> dict:
    """
    Evaluate the RAG pipeline on a set of test cases using RAGAS metrics.

    test_cases format:
    [
        {
            "question": "...",
            "ground_truth": "...",  # Expected answer
        },
        ...
    ]
    """

    questions = []
    answers = []
    contexts = []
    ground_truths = []

    print(f"Running evaluation on {len(test_cases)} test cases...")

    for i, test in enumerate(test_cases):
        print(f"  Test {i+1}/{len(test_cases)}: {test['question'][:60]}...")

        result = pipeline.run(test["question"])

        # Also retrieve raw context for RAGAS
        raw_docs = pipeline.retriever.invoke(test["question"])
        context_texts = [doc.page_content for doc in raw_docs[:4]]

        questions.append(test["question"])
        answers.append(result["answer"])
        contexts.append(context_texts)
        ground_truths.append(test["ground_truth"])

    # Build dataset for RAGAS
    eval_dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths
    })

    # Run RAGAS evaluation
    scores = evaluate(
        eval_dataset,
        metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
    )

    return scores


# --- Define test cases ---
test_cases = [
    {
        "question": "What is the role of positional encoding in transformers?",
        "ground_truth": "Positional encoding adds information about the position of tokens in a sequence, since the transformer architecture itself has no inherent notion of order."
    },
    {
        "question": "How does multi-head attention differ from single-head attention?",
        "ground_truth": "Multi-head attention runs self-attention multiple times in parallel with different learned projections, allowing the model to attend to information from different representation subspaces."
    }
]

scores = evaluate_rag_pipeline(pipeline, test_cases)
print("\nEvaluation Results:")
print(f"  Faithfulness:       {scores['faithfulness']:.3f}")
print(f"  Answer Relevancy:   {scores['answer_relevancy']:.3f}")
print(f"  Context Recall:     {scores['context_recall']:.3f}")
print(f"  Context Precision:  {scores['context_precision']:.3f}")

You’ll get four scores between 0 and 1. Faithfulness below 0.85 tells you your LLM is going off-script. Context precision below 0.80 tells you your retrieval is pulling in irrelevant chunks. Start with a test set of 20-30 questions and treat these numbers as your baseline before you change anything else.

What I’d Tell Myself Before Starting

A few things I’ve distilled from all of this:

Chunk size is where you should spend your debugging time first. Before you blame your embedding model or experiment with exotic retrieval strategies, print out some sample chunks and ask yourself whether they contain coherent, complete thoughts. This is unglamorous work but it pays off faster than anything else.

The indexing and query pipelines are separate systems. Build them that way from the start. Your future self will thank you when you need to re-index 50,000 documents at 2am without touching the query service.

Use MMR. Pure cosine similarity is fine for demos. In production with a real knowledge base, you’ll get redundant results constantly. MMR takes an extra parameter and fixes this. There’s no good reason not to use it.

On re-ranking: I resisted it for too long. The latency cost (50-150ms) is real, but for a knowledge base application where users expect accurate answers, it’s worth it. If you’re building something latency-sensitive, benchmark both and decide with data rather than intuition.

The most underrated practice here is writing test cases before you optimize anything. It’s easy to spend a day tuning chunk sizes or trying different embedding models and convince yourself things are getting better based on a handful of manual tests. RAGAS scores on a fixed test set give you a real signal. Without them, you’re just guessing.

I’m still not 100% sure semantic chunking is worth the overhead for most applications. It produces better-quality chunks in theory, and in isolated tests I’ve seen it improve context precision by a few points. But it’s significantly slower at indexing time and adds a dependency on another embedding pass. For now, I default to recursive character splitting and only reach for semantic chunking when I have evidence that chunk quality is the bottleneck.

Where This Goes Next

The honest answer is that RAG is still an unsolved problem. The pipeline in this post works well, but I’ve been watching a few developments closely.

Query rewriting is probably the next thing I add to this stack — having the LLM rephrase the user’s raw question before retrieval catches a lot of cases where users ask things in a way that’s natural for a human but terrible for semantic search. Hybrid search (combining BM25 keyword matching with vector similarity) is also on my list, especially for domains where exact terminology matters.

What I keep coming back to, though, is something less exciting: better document preparation. The more time I’ve spent on these systems, the more I believe that the quality of your source documents matters more than almost any retrieval optimization you can apply downstream. How they’re structured, how consistently they’re formatted, how much noise is stripped out before they hit the chunker: all of that compounds through every stage of the pipeline. Garbage in, garbage out, regardless of how clever your pipeline is.

If you build this and hit walls, the most useful thing you can do is instrument your pipeline end to end and look at where quality is leaking. Usually it’s obvious once you’re looking at actual data rather than gut-checking demo queries.

Zero Vibe Coding: How I Engineered a Full-Stack Finance App with Claude Code’s Multi-Agent Pipeline

2026-03-20T12:00:00+00:00

When we talk about AI-assisted coding, the conversation usually revolves around “vibe coding” — throwing vague prompts at an LLM, connecting a UI layout tool, and crossing our fingers that the underlying architecture holds together.

I wanted to see what happens when you treat AI not as a magic wand, but as an embedded enterprise engineering team. To test this, I built a production-ready Personal Finance Tracker entirely from the terminal. I strictly avoided UI Model Context Protocol (MCP) integrations like Figma. Instead, I forced the Claude Code CLI into a rigid, Spec-Driven, Multi-Agent Development Environment.

You can check out the final codebase here: Personal Finance Tracker on GitHub

The Final Product

Before diving into how the AI built this, here’s a look at the final application running locally — clean, modern interface, responsive sidebar navigation, and distinct feature pages (Dashboard, Transactions, Budgets) generated without ever touching a visual design tool:

1. Unleashing Claude Code CLI Features

The Claude Code CLI provides native capabilities that elevate it out of a standard web chat interface. By leveraging the local filesystem, I turned the CLI into an automated powerhouse:

Custom Prompt Hooks (.claude/hooks/): Building a modern React App requires strict linting, and AI can easily format things incorrectly. I created a .claude/hooks/post-edit-lint.sh script. Every time Claude Code edited a .tsx or .ts file, the CLI automatically fired off $ESLINT_BIN --fix. This completely eliminated broken CI builds due to styling errors.
Persistent CLI Memory (.claude/agent-memory/): I stored detailed markdown files like DOMAIN-OWNERSHIP.md and hexagonal-architecture.md. Whenever I prompted Claude Code, it automatically fetched this local context to write code compliant with my architectural decisions.
Context Compaction (/compact): Building a complex backend generates massive CLI context. Once Claude finished planning a feature, I simply typed /compact. This reset the active context window while keeping core memories intact — saving tokens and keeping the CLI fast.
Intelligent Commits (/commit): After ./gradlew test passed, rather than typing out git messages, I ran /commit. Claude Code read the actual git diff and generated a perfect, semantic commit message right in the terminal.

Simulated Terminal Run: Claude Code in Action

$ claude
╭────────────────────────────────────────────────────────────────────────╮
│ Claude Code (v0.x.x) - Project: personal-finance-tracker               │
╰────────────────────────────────────────────────────────────────────────╯
▶ I want to add double-entry transfers between accounts. Follow the
  `Transaction.md` PRD and use the solution-planner workflow.

(Claude is thinking...)
[Reading files: docs/product/Transaction.md,
 .claude/agent-memory/tech-lead/architecture-decisions.md]

I will act as the solution-planner. Let's start the pipeline.
1. Running Domain Analysis against FR-1 (Double-Entry Transfers).
   - The transaction must atomically generate a TRANSFER_OUT and TRANSFER_IN.
2. Running Tech-Lead Architecture Review...
   - I will utilize a Spring `@Transactional` block.

Would you like me to spawn `full-stack-dev` to implement this now? (Y/n) > y

2. Coordinated Execution: The 7-Agent Pipeline

Instead of hooking up an external Figma MCP and letting a single AI guess at the UI and database simultaneously, I enforced spec-driven development using a pipeline of 7 distinct Markdown-based personas inside .claude/agents/.

The most impressive part wasn’t the AI’s coding ability — it was the strict handoff mechanism between agents. They never “spoke” in a chaotic group chat. They communicated by writing strict Markdown files (Briefs) to each other, orchestrated entirely by the solution-planner.

User Request
    │
    ▼
solution-planner (Orchestrator)
    │
    ├── 1. personal-finance-analyst  → Domain Brief
    ├── 2. tech-lead                 → Architecture Brief
    ├── 3. full-stack-dev            → Implementation
    ├── 4. ux-ui-designer            → Accessibility Audit
    ├── 5. qa-automation-tester      → Test Gatekeeper
    └── 6. devops-engineer           → CI/CD & Containers
                                            │
                                            ▼
                                    Feature Complete

Here’s exactly how a feature flowed through this virtual team:

Step 1 — solution-planner (The Orchestrator): The central brain. This agent was explicitly instructed to never write software code. Its only job was to take my feature request, kickstart the pipeline, and gather outputs from the analysts.

Step 2 — personal-finance-analyst (Domain Brief): Before pushing code, the planner invoked the analyst. This agent evaluated feature requests against strict PRDs, caught financial edge cases like rounding errors, and generated a Domain Brief.

Step 3 — tech-lead (Architecture Brief): The architect analyzed business rules and translated them into Hexagonal Architecture and Liquibase schema designs. It generated an Architecture Brief outlining exact Java classes, @Transactional boundaries, and interfaces — catching cross-context import violations before they happened.

Step 4 — full-stack-dev (Execution): The developer received the merged Feature Implementation Brief and focused on end-to-end data flow from the PostgreSQL backend to the React Vite frontend — avoiding N+1 queries in JPA and using proper React state management.

Step 5 — ux-ui-designer (Anti-Vibe Coding Audit): By explicitly avoiding UI MCPs, I forced Claude to rely on strict UX heuristics. This agent reviewed React code strictly against WCAG 2.1 AA standards and Fitts’s Law spacing (touch targets ≥ 44px).

Step 6 — qa-automation-tester (The Gatekeeper): This agent wrote integration tests via JUnit and MockMvc, utilized axe-core for accessibility enforcement, and refused to complete until ./gradlew test passed in the terminal.

Step 7 — devops-engineer (Infrastructure): Configured Testcontainers (postgres:15.2), wrote multi-stage Dockerfiles with Alpine JRE layers, and managed GitHub Actions caching to slash build times.

3. Engineering Strict Guardrails

To ensure the CLI didn’t drift as the conversation grew, I enforced architectural purity with hard automated constraints rather than hoping Claude would follow instructions.

Guardrail A: Pure Hexagonal Boundaries via ArchUnit

The domain layer must be completely unaware of Spring or databases. If Claude generated a Lombok or @Autowired import inside the domain logic, the build failed physically in the terminal:

// ArchitectureTest.java
@ArchTest
static final ArchRule domain_must_not_import_spring =
        noClasses()
                .that().resideInAPackage("..domain..")
                .should().dependOnClassesThat()
                .resideInAnyPackage("org.springframework..", "jakarta.persistence..", "lombok..")
                .as("Domain classes must not depend on Spring, JPA, Jackson, or Lombok");

Guardrail B: Clean Wiring Without Framework Pollution

Claude was forbidden from using @Service or @Component on core business logic. Instead, it learned to construct a dedicated configuration class to wire dependencies manually:

// account/config/AccountConfig.java
@Configuration
public class AccountConfig {
    @Bean
    public AccountCommandService accountCommandService(
            AccountPersistencePort accountPersistencePort,
            AccountEventPublisherPort accountEventPublisherPort) {
        return new AccountCommandService(accountPersistencePort, accountEventPublisherPort);
    }
}

Guardrail C: Persistent Database Models vs Domain Models

To avoid attaching @Entity to a domain object (which allows bad states to bypass business invariants), Claude maintained two separate class structures:

// account/adapter/outbound/persistence/AccountJpaEntity.java
@Entity
@Table(schema = "finance_tracker", name = "accounts")
public class AccountJpaEntity extends AuditableJpaEntity {

    @Column(name = "current_balance", nullable = false, precision = 19, scale = 4)
    private BigDecimal currentBalance; // Rule: Must be exactly NUMERIC(19,4)

    @Version
    @Column(name = "version", nullable = false)
    private Long version; // Rule: Required for optimistic locking
}

Guardrail D: Anti-Corruption Layers (Bounded Context Isolation)

If the budget module needed transaction data, Claude had to build an explicit cross-context outbound port adapter — it could not directly import across bounded contexts:

// budget/domain/port/outbound/TransactionSummaryPort.java
public interface TransactionSummaryPort {
    Money sumExpensesForCategory(CategoryId categoryId, LocalDate start, LocalDate end);
}

4. Feature Spotlight: Dynamic Budget Aggregation

The Challenge: Storing a physical spentAmount column on a Budget table is a classic beginner mistake — if you delete a transaction from 3 weeks ago, the budget’s spentAmount goes out of sync.

The Solution: Because Claude Code was forced to follow the PRD, it leveraged Guardrail D to build a Runtime Synthesis Engine — dynamically computing usage on the fly rather than storing stale state:

// Inside BudgetQueryService (Domain logic)
Money spentThisPeriod = transactionSummaryPort.sumExpensesForCategory(
    budget.getCategoryId(),
    budget.getStartDate(),
    budget.getEndDate()
);
budget.evaluateStatus(spentThisPeriod); // Flags true if spent >= 85% alert threshold

QA Execution

▶ Automatically running QA sequence for new budget logic...

$ ./gradlew test --tests "com.shan.cyber.tech.financetracker.budget.*"

> Task :application:test
com.shan.cyber.tech.financetracker.budget.domain.BudgetQueryServiceTest
  ✔ dynamic_synthesis_calculates_correct_usage_percent()
  ✔ threshold_alert_triggered_when_over_85_percent()

BUILD SUCCESSFUL in 3s

[Claude]: All tests passed. ArchUnit hex boundaries are clean.
          Should I `/commit` these changes? (Y/n) > y

Final Thoughts: From Chatbot to Engineering Partner

Building a production-ready application with Claude Code CLI reveals a massive shift in how AI should be utilized for software engineering. By rejecting “vibe coding” and external MCP crutches, and instead relying on Claude Code’s native hooks, persistent memory, and /compact commands, I was able to enforce a rigid, 7-agent spec-driven environment.

This process didn’t just build an app for me — it taught me how to properly design scalable databases, orchestrate Hexagonal solutions, and respect WCAG UX heuristics. By forcing Claude Code to respect pipeline personas and automated tests, you don’t just get an AI assistant that writes text files — you get a senior engineering team embedded directly in your terminal.

Check out the full source: github.com/shanmuga-sundaram-n/personal-finance-tracker

LLM Tokenizers: The Hidden Engine Behind AI Language Models

2025-03-08T12:00:00+00:00

Every interaction with a large language model begins long before any neural network computation. It starts with tokenization — the process of converting human language into the numerical representations that models actually process. Understanding tokenizers unlocks a deeper understanding of why LLMs behave the way they do.

What is Tokenization?

Tokenization transforms text into discrete units called tokens — numerical representations the model can process. Rather than understanding words directly, models convert input into token IDs corresponding to entries in a vocabulary.

“Hello, world!” might become [15496, 11, 995, 0] — four numbers that the model processes as a sequence. The model never sees the original text; it only sees tokens.

Why Tokenization Matters

The choice of tokenizer has profound downstream effects:

Context window capacity — the maximum text a model can process is measured in tokens, not words or characters
API costs — most LLM APIs charge per token; tokenization efficiency directly affects cost
Cross-lingual performance — languages with inefficient tokenization get fewer “thoughts” per context window
Specialised content — code, mathematical notation, and domain-specific text tokenize very differently than prose

Three Main Tokenization Approaches

1. Word-Based Tokenization

The system scans text character-by-character, treating delimiters (spaces, punctuation) as token boundaries. “Hello world!” becomes individual word units.

Advantages:

Intuitive and semantically meaningful
Simple to implement
Efficient for common vocabulary

Limitations:

Massive vocabularies for morphologically rich languages
Out-of-vocabulary (OOV) problems for rare or new words
Compound word challenges (German, Finnish, Turkish suffer most)

2. Character-Based Tokenization

Each character becomes a token. “Hello world” breaks into 11 tokens: H-e-l-l-o-[space]-w-o-r-l-d.

Advantages:

Tiny vocabulary (100–200 tokens covers most languages)
Eliminates unknown word problems entirely

Trade-offs:

Sequences 5–10× longer than word-based
The model must reconstruct semantic meaning from individual characters
Higher computational cost for the same amount of text

3. Subword Tokenization (The Modern Standard)

Modern LLMs — GPT, Claude, Llama, Gemini — all use subword tokenization. Words are split into meaningful units: “Unlikeliest” becomes ["Un", "likely", "est"].

The dominant algorithm is Byte-Pair Encoding (BPE):

Start with a vocabulary of individual characters
Count the frequency of all adjacent token pairs
Merge the most common pair into a new token
Repeat until reaching the target vocabulary size (typically 10,000–100,000 tokens)

The result: common words become single tokens, rare words get split into recognisable subword units, and the vocabulary is compact enough to be manageable.

Impact on Model Performance

Economic Implications

Tokenization efficiency has real cost consequences:

Training efficiency improves with optimised tokenization
Fine-tuning costs scale directly with tokens per training example
Inference latency varies based on how a given input tokenizes

A well-structured prompt in English might use significantly fewer tokens than an equivalent prompt in another language — directly affecting both cost and available reasoning capacity.

Cross-Lingual Disparities

This is one of the most significant fairness issues in modern LLMs. Languages with less efficient tokenization:

Get fewer reasoning steps within a fixed context window
Cost more per equivalent amount of text
May have lower model quality due to less training data representation

English typically tokenizes very efficiently in models trained primarily on English text. Languages like Thai, Arabic, or many African languages often require significantly more tokens for equivalent content.

Technical Effects on Model Reasoning

Attention dilution — when semantic units are spread across many tokens, the model’s attention mechanism must work harder to connect related concepts
Boundary artifacts — token boundaries that misalign with semantic units can affect how the model processes meaning
Embedding geometry — the choice of tokens shapes the internal representation of concepts in the model’s embedding space

Practical Optimisation Strategies

For prompt engineering:

Structure prompts with awareness of how your target language tokenizes
Use tokenizer visualisation tools (like OpenAI’s tokenizer playground) to inspect token boundaries
Prefer structured formats (JSON, markdown) that often tokenize efficiently

For data preparation:

Select formats based on tokenization performance for your use case
Apply semantic compression — preserve meaning with fewer tokens where possible
Be aware that code tokenizes very differently than prose

The Future of Tokenization

Emerging approaches are addressing current limitations:

Character-level fallbacks for rare words and multilingual content
Learned tokenizers that adapt during pre-training
Semantic tokenization that incorporates meaning-based rather than purely frequency-based boundaries
Byte-level models that operate directly on raw bytes, eliminating the tokenization step entirely

Key Takeaway

Tokenization forms the crucial bridge between human language and machine understanding. It shapes everything from API costs to model reasoning depth to cross-lingual fairness — yet it’s almost entirely invisible to most users.

Understanding how your text becomes tokens isn’t just an academic exercise. It directly informs better prompt design, more accurate cost estimation, and a clearer mental model of why LLMs behave the way they do.

Mastering Backpressure in Reactive Programming: A Deep Dive

2024-12-28T12:00:00+00:00

Reactive programming allows developers to build highly responsive and scalable systems that handle asynchronous data flows. But a fundamental challenge emerges when producers emit data faster than consumers can process it.

Without a solution, this imbalance leads to memory overload, CPU exhaustion, and cascading failures. The solution is backpressure.

What is Backpressure?

Backpressure is the mechanism that allows a consumer to communicate to its producer that it cannot keep up with the current data rate. Rather than silently dropping messages or crashing under load, a well-designed reactive system uses backpressure to apply flow control.

Think of it like a water pipe: if you push water in faster than it can flow out, something bursts. Backpressure is the pressure relief valve.

Why It Matters

Without proper backpressure handling:

Memory overload — unbounded buffers fill up and cause OOM errors
CPU exhaustion — the consumer thrashes trying to process an overwhelming queue
Cascading failures — one slow consumer can destabilise an entire pipeline
Silent data loss — messages get dropped without any indication

Four Core Backpressure Strategies

1. Buffering

Temporarily store excess data in a bounded buffer. When the buffer fills, apply additional pressure upstream.

// Project Reactor
Flux.range(1, 1000)
    .onBackpressureBuffer(100) // buffer up to 100 items
    .subscribe(item -> process(item));

Use when: Data loss is unacceptable and consumers will eventually catch up. Be careful with buffer size — an unbounded buffer is no protection at all.

2. Dropping

Discard items when the consumer falls behind. Simpler than buffering, but only appropriate when losing some messages is acceptable.

// RxJava
Flowable.interval(1, TimeUnit.MILLISECONDS)
    .onBackpressureDrop()
    .observeOn(Schedulers.computation())
    .subscribe(item -> slowProcess(item));

Use when: You’re processing real-time streams (metrics, sensor data) where stale data has no value.

3. Throttling

Control the emission speed from the producer to match consumer capacity. Instead of buffering or dropping, you slow the source down.

// Akka Streams
Source.repeat("event")
    .throttle(10, Duration.ofSeconds(1)) // max 10 elements per second
    .runWith(Sink.foreach(System.out::println), system);

Use when: The producer is controllable and you want to maintain a steady, sustainable flow rather than bursts.

4. Requesting (Pull-based)

Consumers explicitly request a specific number of items from the producer. This is the most precise form of backpressure — the foundation of the Reactive Streams specification.

// Project Reactor — limit rate to 10 items at a time
Flux.range(1, 1000)
    .limitRate(10) // consumer pulls 10 items at a time
    .subscribe(item -> process(item));

Use when: You want fine-grained control over throughput and can predict consumer processing capacity.

Framework Comparison

Framework	Default Strategy	Key API
Project Reactor	Error on overflow	`onBackpressureBuffer()`, `limitRate()`
RxJava	Configurable	`onBackpressureDrop()`, `onBackpressureLatest()`
Akka Streams	Built-in propagation	`throttle()`, `buffer()`

All three follow the Reactive Streams specification, which standardises backpressure handling through the Publisher/Subscriber contract.

Best Practices

Match strategy to data criticality — financial transactions need buffering; live metrics can afford dropping
Always bound your buffers — unbounded buffers defer the problem rather than solving it
Test under realistic load — backpressure issues only surface at production traffic levels
Understand your consumer’s processing capacity — profile before tuning
Monitor queue depths — expose buffer utilisation as a metric to catch pressure building up

Key Takeaway

Backpressure isn’t an edge case — it’s a core concern in any reactive system that handles real-world load. Choosing the right strategy (buffer, drop, throttle, or request) depends on your data’s criticality and your consumer’s characteristics. The frameworks provide the tools; understanding the trade-offs is the engineering judgment that makes them work.

Traditional Coding vs AI-Assisted Coding vs Vibe Coding: A New Spectrum

2024-12-01T12:00:00+00:00

Something has shifted in how software gets written. As an engineer, I’ve noticed that I no longer work in a single mode throughout the day. Depending on the task, the stakes, and the context, I switch between three fundamentally different headspaces — and recognising which one to be in has become as important as any technical skill.

Mode 1: Traditional Coding

Traditional coding is manual, deliberate, and total-control. You write every line, reason through every decision, and understand the full implications of the code you produce.

There is a deep satisfaction in knowing you understand every single line. When something breaks, you know where to look. When a colleague asks why something works a certain way, you can explain it fully.

This mode is slower — but it builds the deepest understanding. It’s where you learn. It’s where complex, business-critical, or security-sensitive code should live.

Best for:

Core domain logic
Security-critical code paths
Novel problems with no clear prior art
Situations where you need to deeply understand what you’re building

Mode 2: AI-Assisted Coding

AI-assisted coding is my default mode today. It’s a constant collaboration — I handle the logic, intent, and architectural decisions, while AI tools manage boilerplate, suggest implementations, and reduce cognitive overhead.

The workflow feels like pair programming with a very fast, very well-read partner who sometimes hallucinates. You stay in control of the wheel, but the journey is significantly faster.

This mode reduces friction without reducing understanding. I still review every line. I still reason about every decision. But I spend less time on the mechanical aspects of writing code and more time on what matters.

Best for:

Most day-to-day development tasks
Boilerplate and scaffolding
Translating well-understood patterns into code
Code reviews and refactoring suggestions

Mode 3: Vibe Coding

Vibe coding is the newest mode — and the most misunderstood. It’s about describing ideas in plain language and iterating on results rather than syntax. You’re directing, not writing.

The speed advantage is real. For prototyping, exploration, and proof-of-concept work, vibe coding collapses the feedback loop dramatically. You can test ten ideas in the time it would traditionally take to implement one.

But vibe coding doesn’t eliminate the need for strong engineering fundamentals — it requires them. You need to evaluate what gets generated, spot the subtle bugs, recognise the architectural shortcuts that will cause problems at scale. Without a strong foundation, vibe coding produces fast-moving, difficult-to-maintain code.

Best for:

Rapid prototyping and exploration
Proof-of-concept development
Low-stakes internal tools
Learning new domains quickly

The Real Skill: Knowing Which Mode to Use

The most important insight isn’t about any single mode — it’s about switching between them deliberately.

A day of engineering now might look like: vibe coding to explore a new API (Mode 3), shifting to AI-assisted coding to build the actual integration (Mode 2), and dropping into traditional coding for the authentication and error-handling logic (Mode 1).

Conflating these modes is where things go wrong. Vibe coding your security layer is dangerous. Traditionally coding your boilerplate is inefficient. AI-assisted coding without review is irresponsible.

What This Means for Engineers

The spectrum of coding modes is expanding — and that’s a good thing. More tools, more leverage, more ways to deliver value.

But the fundamentals don’t go away. Understanding algorithms, systems, trade-offs, and failure modes matters more as the code generation layer becomes more automated. The engineer who can work across all three modes — and knows when to use each — is more valuable than ever.

The headspace is different. The skill is knowing which one you’re in.

LlamaCoder: Turn Your Idea into an App in Minutes

2024-11-25T12:00:00+00:00

What if you could describe an app idea in plain English and have working code ready in minutes? That’s exactly what LlamaCoder delivers — and it signals a significant shift in who gets to build software.

What is LlamaCoder?

LlamaCoder is an AI-powered code generation platform built on large language models. It converts natural language descriptions into functional application code, enabling people without deep programming expertise to build real apps.

How It Works

The workflow is straightforward:

Describe your idea in plain language — no technical jargon required
AI generates clean, efficient code in Python, JavaScript, Swift, and more
Customise and refine the generated code to fit your exact needs
Deploy across web, iOS, or Android platforms

The model interprets intent, not just keywords — it understands the goal of what you’re building and generates code structured accordingly.

Key Features

No prior coding knowledge required — describe the app, get the code
Rapid prototyping — go from idea to working prototype in minutes
Multi-language support — Python, JavaScript, Swift, and more
Cross-platform output — web, iOS, Android
Iterative refinement — describe changes in natural language to evolve the code
Cost-effective — dramatically reduces development time and resources

Who Is It For?

LlamaCoder opens software development to a much broader audience:

Entrepreneurs validating product ideas without hiring developers
Business owners building internal tools without IT dependency
Designers prototyping interactive concepts directly
Students learning by seeing working code generated from their ideas
Developers accelerating boilerplate and scaffolding

What Can You Build?

The platform handles a surprisingly wide range of use cases:

E-commerce platforms with product listings and checkout flows
Social networking applications with feeds and user profiles
Productivity tools like task managers and dashboards
Educational software with quizzes and progress tracking
Gaming and entertainment apps

The Bigger Picture

LlamaCoder is part of a broader shift in software development. As AI code generation matures, the bottleneck moves from writing code to knowing what to build. Domain expertise, product thinking, and clear problem definition become the primary skills — the technical implementation becomes increasingly automated.

This doesn’t eliminate the need for engineers — but it fundamentally changes what they spend their time on. The best engineers will focus on architecture, quality, edge cases, and the hard problems that AI still can’t solve reliably.

For everyone else, tools like LlamaCoder are a genuine step toward democratising the ability to build software.

Try it yourself: llamacoder.together.ai

Architectural Trade-off: Serverless APIs vs. Kubernetes APIs

2024-11-17T12:00:00+00:00

Choosing between serverless and Kubernetes for API deployment is one of the most consequential architectural decisions in modern cloud engineering. Both are powerful paradigms — but they make fundamentally different trade-offs.

The Core Distinction

Serverless abstracts away all infrastructure. You deploy functions or containers, define triggers, and the cloud provider handles scaling, patching, and availability.

Kubernetes gives you a programmable infrastructure platform. You define workloads, networking, and scaling policies — with full control over how your APIs run.

Trade-off Analysis

1. Operational Overhead

Serverless minimises operational burden through abstraction. There are no servers to patch, no clusters to manage, no capacity to plan. Teams can focus entirely on business logic.

Kubernetes demands DevOps expertise. Cluster management, networking (CNI plugins, ingress controllers), storage, upgrades, and observability all require dedicated attention — often a platform engineering team.

Winner for small teams or rapid iteration: Serverless.

2. Scalability

Serverless offers automatic, near-instant scaling from zero to peak — no configuration required.

Kubernetes supports scaling through Horizontal Pod Autoscalers (HPA) and KEDA, but requires explicit configuration and has a minimum baseline cost (you can’t truly scale to zero without additional tooling like Knative).

Winner for unpredictable traffic spikes: Serverless.

3. Cost

Serverless uses pay-per-execution pricing — ideal for sporadic, bursty workloads. Cost can spike unexpectedly at high volumes.

Kubernetes costs are tied to cluster utilisation. With well-optimised bin-packing and steady traffic, Kubernetes can be significantly cheaper at scale.

Winner for high, steady-state throughput: Kubernetes.

4. Control and Customisation

Serverless limits infrastructure control. Runtime versions, execution environments, and networking are largely managed by the provider.

Kubernetes provides extensive customisation: custom runtimes, sidecar patterns, service meshes, custom schedulers, and full network policy control.

Winner for complex, specialised requirements: Kubernetes.

5. Vendor Lock-in

Serverless platforms (AWS Lambda, Google Cloud Functions, Azure Functions) create provider dependency. Migrating functions across clouds is non-trivial.

Kubernetes is open-source and runs on any cloud or on-premises. Managed distributions (EKS, GKE, AKS) add some lock-in at the control plane level, but workloads remain portable.

Winner for portability: Kubernetes.

6. Deployment Speed

Serverless accelerates time-to-market. Deployments are simple — upload code, configure a trigger, done.

Kubernetes adds pipeline complexity: container builds, image registries, Helm charts or manifests, rolling deployments. A well-engineered pipeline abstracts this, but the investment is real.

Winner for speed of initial delivery: Serverless.

7. Debugging and Observability

Serverless offers limited visibility. Cold starts, ephemeral execution environments, and distributed traces across functions can be difficult to reason about.

Kubernetes provides robust observability tooling: Prometheus, Grafana, Jaeger, and deep integration with service meshes like Istio. Full control over logging, metrics, and tracing.

Winner for production debuggability: Kubernetes.

8. State Management

Serverless emphasises statelessness by design. Persistent state requires external services (DynamoDB, S3, Redis).

Kubernetes supports stateful applications natively through StatefulSets and Persistent Volumes — suitable for databases, queues, and long-running workloads.

Winner for stateful workloads: Kubernetes.

Decision Framework

Consider	Choose
Small team, rapid iteration	Serverless
Bursty, unpredictable traffic	Serverless
High steady-state throughput	Kubernetes
Complex infrastructure requirements	Kubernetes
Strong portability requirements	Kubernetes
Minimal operational investment	Serverless
Stateful workloads	Kubernetes

Key Takeaway

There is no universally correct answer. Serverless excels at reducing operational overhead and handling variable load elegantly. Kubernetes excels at control, portability, and cost efficiency at scale.

Many mature organisations run both — serverless for event-driven, low-traffic APIs and Kubernetes for core platform services. Evaluate your API complexity, team expertise, traffic patterns, and long-term architectural goals before committing to either paradigm.

Exploring Threads and Virtual Threads in Java: A Comprehensive Guide

2024-11-13T12:00:00+00:00

Java’s concurrency model has undergone a significant evolution with the introduction of virtual threads in Java 19 (Project Loom). Understanding the difference between traditional threads and virtual threads is key to building scalable, efficient applications.

Traditional Java Threads (Platform Threads)

Traditional threads — also called platform threads — are managed by the operating system and map directly to native OS threads.

Key characteristics:

Heavyweight: significant creation and management cost
Consume OS resources even when blocked on I/O
Limited scalability due to memory overhead per thread
Scheduled by the OS, incurring context-switching overhead

Internal mechanics:

Each Java thread maps 1:1 to a kernel-level thread with its own memory stack
The OS scheduler manages thread switching through context-switching
Synchronization occurs at the OS level

Thousands of threads quickly become inefficient — memory constraints and context-switching overheads degrade performance significantly.

Virtual Threads (Project Loom)

Introduced in Java 19 and made production-ready in Java 21, virtual threads are user-mode threads managed by the JVM rather than the OS.

Key characteristics:

Lightweight and inexpensive to create (millions can run simultaneously)
Yield control to the JVM when blocking, freeing carrier OS threads for other work
Managed by the JVM scheduler using a small pool of OS threads
Cooperative multitasking through yielding

Internal mechanics:

No direct 1:1 mapping to OS threads
Use continuation-based scheduling backed by carrier threads
Stack frames stored in the Java heap — minimal per-thread overhead
When a virtual thread blocks (e.g., on I/O), it unmounts from the carrier thread, which immediately picks up another virtual thread

Side-by-Side Comparison

Aspect	Platform Threads	Virtual Threads
Managed by	OS	JVM
Cost to create	High	Very low
Scalability	Thousands	Millions
Blocking behaviour	Blocks OS thread	Unmounts from carrier
Memory overhead	Large (OS stack)	Small (heap)
Best for	CPU-bound tasks	I/O-bound tasks

Practical Example: Web Server

Consider a web server handling thousands of concurrent requests.

With platform threads: Each request ties up an OS thread while waiting for database queries or network calls. With 10,000 concurrent requests, you need 10,000 OS threads — memory exhaustion becomes a real risk.

With virtual threads: Each request gets its own virtual thread. When blocked on I/O, the virtual thread unmounts and the carrier thread immediately handles another request. You can handle 10,000 concurrent requests with just a handful of OS threads.

// Creating a virtual thread (Java 21)
Thread.ofVirtual().start(() -> {
    // handle request — blocking I/O is fine here
    var result = database.query("SELECT ...");
    response.send(result);
});

// Or with ExecutorService
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
    executor.submit(() -> handleRequest(request));
}

When to Use Each

Use platform threads when:

Your workload is CPU-bound (heavy computation, no blocking)
You have a small, fixed number of concurrent tasks
You need precise OS-level thread control

Use virtual threads when:

Your workload is I/O-bound (database, network, file access)
You need high concurrency (web servers, microservices, messaging)
You want to simplify code by avoiding reactive/callback patterns

Key Takeaway

Virtual threads don’t replace platform threads — they complement them. For the vast majority of server-side Java applications that spend most time waiting on I/O, virtual threads offer a dramatic scalability improvement with minimal code changes. Project Loom effectively brings the simplicity of synchronous code to the scale of asynchronous systems.

Why is Kafka So Fast? Unveiling the Secrets Behind Kafka’s Speed

2024-10-17T12:00:00+00:00

Apache Kafka is renowned for its extraordinary throughput and low latency. But what actually makes it so fast? The answer lies in a combination of deliberate engineering decisions that work together to minimize overhead at every layer.

1. Sequential I/O: Optimizing Disk Access

Kafka employs an append-only log architecture that leverages sequential rather than random disk access. Messages are written in the order they arrive and stored sequentially — data is continuously appended to the end of the log file.

This approach minimizes seek time on mechanical drives. When handling thousands of messages per second from IoT sensors, each new entry simply gets added to the log’s end, avoiding the expensive physical movement of disk read/write heads.

Sequential reads and writes are orders of magnitude faster than random access — Kafka exploits this at the core of its storage design.

2. Zero-Copy Principle: Efficient Data Transfer

Traditionally, transferring data from disk to network involves multiple copies through kernel and user-space buffers. Kafka bypasses this with zero-copy using system calls like sendfile() on Linux.

This technique instructs the kernel to move data directly from the disk buffer to the network socket buffer — eliminating unnecessary intermediate copies, reducing CPU overhead, and maximizing delivery throughput.

The practical benefit: transferring a 1 GB batch of logs bypasses the costly round-trip through user space entirely.

3. Message Compression: Reducing Transmission Size

Kafka supports compression algorithms including GZIP, Snappy, and LZ4 at the producer level. Compression is applied to message batches before transmission, then decompressed by consumers.

This is particularly valuable when handling repetitive data like application logs, where compression ratios can be significant — reducing both network bandwidth and storage requirements.

4. Message Batching: Efficient Processing

Rather than handling individual messages one at a time, Kafka groups multiple messages into batches before disk writes or network transmission. Grouping 100 metrics into a single batch:

Reduces the number of I/O operations
Decreases network round-trips
Lowers broker CPU load

This amortises the fixed overhead of each operation across many messages, dramatically improving throughput.

5. Efficient Memory Management and Caching

Kafka maintains in-memory indexes and leverages the OS page cache for recently accessed log segments. This enables rapid message retrieval without frequent disk reads — particularly when consumers request recently produced messages, which are almost always already in the page cache.

Kafka intentionally relies on the OS page cache rather than implementing its own heap-based caching, keeping JVM garbage collection pressure low.

Key Takeaway

Kafka’s performance is not the result of a single trick — it’s a comprehensive engineering approach:

Technique	Benefit
Sequential I/O	Eliminates random disk seek time
Zero-copy	Removes redundant data copies
Compression	Reduces bandwidth and storage
Batching	Amortises I/O overhead across messages
Page cache	Fast reads without disk access

Together, these decisions make Kafka capable of handling millions of messages per second with consistently low latency — a system designed from the ground up for high-throughput streaming.