<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://shanmuga-sundaram-n.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://shanmuga-sundaram-n.github.io/" rel="alternate" type="text/html" /><updated>2026-04-08T01:07:27+00:00</updated><id>https://shanmuga-sundaram-n.github.io/feed.xml</id><title type="html">Shanmuga Sundaram Natarajan’s Personal Website</title><subtitle>Hands-on Architect &amp; Technical Leader | 18+ Years Experience | AI-First Software Delivery</subtitle><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><entry><title type="html">Why Most Teams Fail at Collaboration — And How Domain-Driven Design Fixes It</title><link href="https://shanmuga-sundaram-n.github.io/blog/2026/04/01/why-teams-fail-at-collaboration-ddd/" rel="alternate" type="text/html" title="Why Most Teams Fail at Collaboration — And How Domain-Driven Design Fixes It" /><published>2026-04-01T12:00:00+00:00</published><updated>2026-04-01T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2026/04/01/why-teams-fail-at-collaboration-ddd</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2026/04/01/why-teams-fail-at-collaboration-ddd/"><![CDATA[<p>Let me tell you about a bug that took three weeks to track down.</p>

<p>The backend team had added a <code class="language-plaintext highlighter-rouge">status</code> column to the <code class="language-plaintext highlighter-rouge">users</code> table. They meant it to track onboarding progress — steps 0 through 4. The frontend team read the same field and treated it as a boolean: account active or not. The data team exported it weekly and bucketed users into cold, warm, and hot engagement tiers.</p>

<p>No one documented any of this. No one thought they needed to. The column was called <code class="language-plaintext highlighter-rouge">status</code>. What could be clearer?</p>

<p>Three teams. One column. Three completely different mental models quietly coexisting in production until the day they collided — a badly timed migration that broke the frontend, corrupted the analytics pipeline, and sent the on-call engineer on a three-week archaeological dig through git history.</p>

<p>That’s the thing about collaboration failures. They don’t announce themselves. They hide in the gap between what you think a word means and what your colleague thinks it means. And by the time you find them, you’re usually staring at a production incident at 2am.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph TD
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    FIELD["users.status&lt;br/&gt;one column,&lt;br/&gt;zero documentation"]:::yellow

    FIELD --&gt; BE["Backend&lt;br/&gt;status = onboarding step&lt;br/&gt;0 · 1 · 2 · 3 · 4"]:::blue
    FIELD --&gt; FE["Frontend&lt;br/&gt;status = account live?&lt;br/&gt;true · false"]:::blue
    FIELD --&gt; DA["Analytics&lt;br/&gt;status = engagement tier&lt;br/&gt;cold · warm · hot"]:::blue

    BE --&gt; OOF["💥 3-week incident&lt;br/&gt;Nobody was lying.&lt;br/&gt;Nobody was wrong.&lt;br/&gt;Nobody talked."]:::red

    FE --&gt; OOF
    DA --&gt; OOF

    OOF --&gt; LESSON["The problem wasn't technical.&lt;br/&gt;It was a missing shared model."]:::dim
</code></pre>

<hr />

<h2 id="its-not-a-people-problem-its-a-structure-problem">It’s not a people problem. It’s a structure problem.</h2>

<p>Most engineering managers diagnose this as a communication failure and reach for process. More standups. Mandatory documentation. A new confluence page that everyone writes to once and nobody ever reads again.</p>

<p>I’ve watched this play out at startups, at mid-sized companies, at enterprises. The Agile ceremonies don’t fix it. The retrospectives surface it but don’t solve it. The reason is that this isn’t a people problem or a process problem. It’s a <strong>structural</strong> problem.</p>

<p>When teams don’t share a coherent model of the domain they’re working in, every handoff becomes a translation exercise. And unlike foreign-language translation, nobody knows a translation is even happening. People assume they’re speaking the same language because they’re using the same words. They’re not.</p>

<p>This is what Domain-Driven Design (DDD) addresses. Eric Evans introduced the term in 2003 and the core idea is deceptively simple: the structure of your software should reflect the structure of the business. Not the structure of your database. Not the structure of your org chart from three reorgs ago. The actual domain — the problem space your business exists to solve.</p>

<p>Where DDD gets interesting is in how deeply it treats language as a first-class design concern.</p>

<hr />

<h2 id="shared-language-isnt-a-soft-skill--its-an-architecture-decision">Shared Language Isn’t a Soft Skill — It’s an Architecture Decision</h2>

<p>DDD calls this <em>Ubiquitous Language</em>. The idea is that every team — engineers, product managers, designers, analysts — uses the exact same vocabulary to describe the domain. No synonyms. No “well, we call it an order but finance calls it a transaction.” Just one term, one definition, used everywhere consistently.</p>

<p>That includes the code.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph BEFORE["Before — everyone improvises"]
        direction TB
        P1["PM writes: 'Purchase'"]:::red
        P2["Eng codes: 'Transaction'"]:::red
        P3["Design mocks: 'Booking'"]:::red
        P4["Analyst queries: 'Order'"]:::red
        P1 &amp; P2 &amp; P3 &amp; P4 --&gt; CHAOS["4 mental models.&lt;br/&gt;Nobody catches it&lt;br/&gt;until it breaks."]:::red
    end

    subgraph AFTER["After — one shared glossary"]
        direction TB
        G["Order&lt;br/&gt;─────────────────────&lt;br/&gt;A confirmed purchase with&lt;br/&gt;payment intent, owned by&lt;br/&gt;a Customer, containing&lt;br/&gt;one or more LineItems.&lt;br/&gt;Not a cart. Not a quote."]:::mauve
        Q1["PM"]:::yellow
        Q2["Eng"]:::yellow
        Q3["Design"]:::yellow
        Q4["Analyst"]:::yellow
        Q1 &amp; Q2 &amp; Q3 &amp; Q4 --&gt; G --&gt; GOOD["One model.&lt;br/&gt;Code matches the spec.&lt;br/&gt;Spec matches the meeting."]:::green
    end
</code></pre>

<p>When engineers name their classes and methods using the same language the business uses, something subtle but powerful happens: a product manager can read the code and recognise the concepts. An engineer can read a product spec without mentally translating it. Bugs that come from misunderstanding requirements — a whole category of bugs — start to disappear.</p>

<p>The discipline this requires is harder than it sounds. You have to resist the urge to rename things to what <em>you</em> think is cleaner. You have to resist the abstraction instinct. If the business calls it an “Order”, the class is <code class="language-plaintext highlighter-rouge">Order</code>. Not <code class="language-plaintext highlighter-rouge">PurchaseRecord</code>, not <code class="language-plaintext highlighter-rouge">TxnEntity</code>, not <code class="language-plaintext highlighter-rouge">SaleModel</code>.</p>

<p>The payoff is enormous. Teams that build on a shared language move noticeably faster because the gap between “what we decided” and “what we built” closes.</p>

<hr />

<h2 id="every-big-model-eventually-collapses-under-its-own-weight">Every Big Model Eventually Collapses Under Its Own Weight</h2>

<p>Here’s a trap almost every growing company falls into.</p>

<p>Things start simple. You have a <code class="language-plaintext highlighter-rouge">Customer</code> object. It has a name, an email, maybe a billing address. Everyone uses it. Fine.</p>

<p>Then Sales needs to add a preferred rep. Support needs to track open ticket counts. Finance needs invoice history and credit limits. Marketing wants engagement scores. Twelve months later, <code class="language-plaintext highlighter-rouge">Customer</code> has 60 fields, a confusing network of relationships, and a comment at the top of the file that says <code class="language-plaintext highlighter-rouge">// DO NOT TOUCH - ask @someone before changing anything</code>.</p>

<p>Nobody owns it, so everybody owns it. Which means nobody does.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph LR
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph SALES["Sales Context  ·  Team: Commerce"]
        C1["Customer&lt;br/&gt;─────────────&lt;br/&gt;Shopping cart&lt;br/&gt;Purchase history&lt;br/&gt;Wishlist&lt;br/&gt;Assigned rep"]:::yellow
    end

    subgraph SUPPORT["Support Context  ·  Team: CX"]
        C2["Customer&lt;br/&gt;─────────────&lt;br/&gt;Open tickets&lt;br/&gt;Case history&lt;br/&gt;CSAT score&lt;br/&gt;SLA tier"]:::teal
    end

    subgraph FINANCE["Finance Context  ·  Team: Finance"]
        C3["Customer&lt;br/&gt;─────────────&lt;br/&gt;Invoice history&lt;br/&gt;Payment methods&lt;br/&gt;Credit limit&lt;br/&gt;Tax status"]:::blue
    end

    NOTE["Same word. Three lean models.&lt;br/&gt;Each team owns theirs.&lt;br/&gt;No one drowns in a god object."]:::dim

    C1 -.-&gt;|"CustomerID only"| NOTE
    C2 -.-&gt;|"CustomerID only"| NOTE
    C3 -.-&gt;|"CustomerID only"| NOTE
</code></pre>

<p>DDD gives you the concept of a <strong>Bounded Context</strong> to deal with this. A Bounded Context is just a boundary — explicit, named, intentional — within which your model applies. Inside Sales, <code class="language-plaintext highlighter-rouge">Customer</code> means one thing. Inside Finance, <code class="language-plaintext highlighter-rouge">Customer</code> means something else. Both are valid. Neither one bleeds into the other.</p>

<p>The only thing they share is a stable identifier (a <code class="language-plaintext highlighter-rouge">CustomerID</code>) that lets them talk about the same real-world entity without needing to agree on every attribute.</p>

<p>This isn’t just a modelling technique. It’s a team ownership technique. Bounded Contexts map directly to team responsibilities. The Sales context is the Commerce team’s problem. The Finance context is the Finance team’s problem. When something breaks in Finance’s <code class="language-plaintext highlighter-rouge">Customer</code> model, you know exactly whose phone to call. And crucially, the Commerce team doesn’t need to be in that conversation at all.</p>

<hr />

<h2 id="drawing-the-map-nobody-draws">Drawing the Map Nobody Draws</h2>

<p>Most teams have multiple contexts whether they know it or not. The problem is they’re invisible. Dependencies between teams exist but nobody’s written them down. You discover them when something changes upstream and breaks three things downstream that nobody knew were connected.</p>

<p>A <strong>Context Map</strong> makes the invisible visible. It’s a diagram — doesn’t have to be fancy, hand-drawn is fine — showing all your contexts and how they relate to each other.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph TD
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8
    

    INVENTORY["📦 Inventory&lt;br/&gt;Team: Fulfillment"]:::yellow
    PAYMENTS["💳 Payments&lt;br/&gt;Team: Finance"]:::yellow
    IDENTITY["🪪 Identity&lt;br/&gt;Team: Platform"]:::yellow

    ACL["Anti-Corruption Layer&lt;br/&gt;we translate their mess&lt;br/&gt;so it doesn't leak in"]:::red

    ORDERS["🛒 Orders&lt;br/&gt;Team: Commerce&lt;br/&gt;— this is the core —"]:::mauve

    BUS["Event Bus&lt;br/&gt;OrderPlaced · OrderShipped&lt;br/&gt;PaymentSettled · OrderCancelled"]:::teal

    NOTIF["🔔 Notifications&lt;br/&gt;Team: Engagement"]:::green
    REPORT["📊 Reporting&lt;br/&gt;Team: Analytics"]:::green
    SUPPORT["🎧 Support&lt;br/&gt;Team: CX"]:::green

    L1["Open Host Service&lt;br/&gt;they expose a stable API"]:::dim
    L2["Shared Kernel&lt;br/&gt;just the user identity bits"]:::dim
    L3["Customer / Supplier&lt;br/&gt;we told them what we need"]:::dim

    INVENTORY --&gt; ACL --&gt; ORDERS
    PAYMENTS --&gt; L1 --&gt; ORDERS
    IDENTITY --&gt; L2 --&gt; ORDERS

    ORDERS --&gt; BUS
    BUS --&gt; NOTIF
    BUS --&gt; REPORT
    ORDERS --&gt; L3 --&gt; SUPPORT
</code></pre>

<p>What makes this useful isn’t just the picture — it’s the relationship labels. DDD has names for the different ways contexts can relate, and those names carry a lot of weight:</p>

<p>An <strong>Anti-Corruption Layer</strong> is what you build when you have to integrate with a system that has a messy model you don’t control. You write a translation layer that converts their concepts into yours, so their chaos doesn’t leak into your clean domain. If you’ve ever written an adapter for a third-party API with bizarre field names and six levels of nesting, you’ve built an ACL without knowing what to call it.</p>

<p>A <strong>Shared Kernel</strong> means two teams own a small, explicitly agreed piece of the model together. Changes require coordination. Use this sparingly — shared ownership is shared risk.</p>

<p>A <strong>Customer/Supplier</strong> relationship is refreshingly honest. The downstream team (customer) tells the upstream team (supplier) what they need, and the upstream team tries to deliver it. Not always with perfect success, but at least it’s named.</p>

<p>Having names for these things matters because it lets you have more precise conversations. Instead of “we have a dependency,” you can say “we’re downstream from Payments in a Customer/Supplier relationship, and they keep breaking our integration.” That’s a different conversation.</p>

<hr />

<h2 id="your-architecture-is-just-your-team-structure-reflected-back-at-you">Your Architecture Is Just Your Team Structure, Reflected Back at You</h2>

<p>Conway’s Law is one of those observations that sounds cynical but is actually just true: any organisation that designs a system will produce a design whose structure mirrors the organisation’s communication structure.</p>

<p>Teams that don’t talk produce systems with unclear interfaces between them. Teams organised by technology layer — a frontend team, a backend team, a database team — produce a layered monolith. Not because anyone planned it that way, but because that’s how the Conway attractor works.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph TB
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph BAD["❌ Org by tech layer → distributed monolith"]
        direction LR
        FET["Frontend Team"]:::red --&gt; FEA["React App"]:::red
        BET["Backend Team"]:::red --&gt; BEA["One giant API"]:::red
        DBT["DB Team"]:::red --&gt; DBA["Shared DB everyone writes to"]:::red
        FEA --&gt; BEA --&gt; DBA
    end

    subgraph GOOD["✅ Org by domain → real autonomy"]
        direction LR
        CT["Commerce Team"]:::yellow --&gt; CS["Order Service + its own DB"]:::green
        IT["Identity Team"]:::yellow --&gt; IS["Auth Service + its own DB"]:::green
        AT["Analytics Team"]:::yellow --&gt; AS["Reporting Service + its own DB"]:::green
    end

    CONWAY["Conway's Law in action.&lt;br/&gt;Flip your org structure&lt;br/&gt;and the architecture follows."]:::dim

    BAD -.-&gt; CONWAY
    GOOD -.-&gt; CONWAY
</code></pre>

<p>The insight DDD gives you here is to use Conway’s Law deliberately. If you want an architecture that’s organised by business domain, organise your teams by business domain first. The architecture will naturally follow. This is sometimes called the “Inverse Conway Maneuver” and it’s more effective than any amount of architectural governance.</p>

<hr />

<h2 id="a-vocabulary-for-the-code-itself">A Vocabulary for the Code Itself</h2>

<p>The strategic stuff — contexts, maps, language — gets the most attention, but DDD also has a set of tactical patterns that give engineers a shared vocabulary for implementation. These are the building blocks you use once you’ve drawn your boundaries.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
classDiagram
    class Order {
        &lt;&lt;Aggregate Root&gt;&gt;
        +OrderId id
        +CustomerId customerId
        +OrderStatus status
        +List~LineItem~ items
        +Money total
        +place() OrderPlaced
        +cancel() OrderCancelled
        +addItem(productId, qty)
    }

    class LineItem {
        &lt;&lt;Entity&gt;&gt;
        +LineItemId id
        +ProductId productId
        +Quantity qty
        +Money unitPrice
        +subtotal() Money
    }

    class Money {
        &lt;&lt;Value Object&gt;&gt;
        +Decimal amount
        +Currency currency
        +add(Money) Money
        +equals(Money) bool
    }

    class OrderStatus {
        &lt;&lt;Value Object&gt;&gt;
        PENDING
        CONFIRMED
        SHIPPED
        CANCELLED
    }

    class OrderPlaced {
        &lt;&lt;Domain Event&gt;&gt;
        +OrderId orderId
        +CustomerId customerId
        +Money total
        +DateTime occurredAt
    }

    class OrderRepository {
        &lt;&lt;Repository&gt;&gt;
        +findById(OrderId) Order
        +save(Order) void
        +findByCustomer(CustomerId) List~Order~
    }

    Order "1" *-- "1..*" LineItem : contains
    Order *-- Money : total
    Order *-- OrderStatus : status
    Order ..&gt; OrderPlaced : emits
    OrderRepository ..&gt; Order : persists
    LineItem *-- Money : unitPrice
</code></pre>

<p>An <strong>Aggregate</strong> is a cluster of objects that change together, with a single root that controls access. In the diagram above, <code class="language-plaintext highlighter-rouge">Order</code> is the root. You never reach into a <code class="language-plaintext highlighter-rouge">LineItem</code> directly from outside — you always go through <code class="language-plaintext highlighter-rouge">Order</code>. This enforces consistency boundaries in a way that’s explicit and understandable.</p>

<p><strong>Value Objects</strong> are immutable. <code class="language-plaintext highlighter-rouge">Money</code> has no identity — a $10 value object is interchangeable with any other $10 value object of the same currency. This is a meaningful design decision that prevents a whole class of bugs (mutating a price on one object and accidentally affecting another).</p>

<p><strong>Domain Events</strong> record facts. <code class="language-plaintext highlighter-rouge">OrderPlaced</code> means something happened — past tense, immutable, true. It’s not a command. It’s not a request. Something happened and we’re recording it. This distinction matters for how teams interact with each other.</p>

<hr />

<h2 id="domain-events-are-how-teams-stop-stepping-on-each-other">Domain Events Are How Teams Stop Stepping on Each Other</h2>

<p>The thing that changed how I think about team autonomy was understanding Domain Events properly. When a context publishes an event, it’s announcing a fact to the rest of the world without knowing or caring who’s listening. Other contexts react. No direct coupling. No “we need to call the Notifications team’s endpoint before we can ship.”</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
sequenceDiagram
    participant C as Customer
    participant O as Orders&lt;br/&gt;(Commerce Team)
    participant B as Event Bus
    participant P as Payments&lt;br/&gt;(Finance Team)
    participant N as Notifications&lt;br/&gt;(Engagement Team)
    participant R as Reporting&lt;br/&gt;(Analytics Team)

    C-&gt;&gt;O: place order
    O-&gt;&gt;O: validate, create Order aggregate
    O--&gt;&gt;B: OrderPlaced { orderId, total, customerId }
    Note over O,B: Commerce ships this.&lt;br/&gt;Nobody else was consulted.

    B--&gt;&gt;P: OrderPlaced
    P-&gt;&gt;P: charge card, record transaction
    P--&gt;&gt;B: PaymentSucceeded { orderId, txnId }

    B--&gt;&gt;N: PaymentSucceeded
    N-&gt;&gt;C: confirmation email

    B--&gt;&gt;R: OrderPlaced + PaymentSucceeded
    R-&gt;&gt;R: update revenue dashboard

    Note over P,R: All three teams deploy&lt;br/&gt;on their own schedule.&lt;br/&gt;No cross-team standups to ship.
</code></pre>

<p>Notice what’s missing: the Commerce team never calls the Notifications team directly. Engagement never has to wait on Finance to expose an API. Analytics doesn’t need to ask Commerce for access to order data. Everyone reacts to shared facts. Everyone can ship without scheduling around everyone else.</p>

<p>This is the real payoff of Domain Events for collaboration. It’s not a technical trick — it’s a social contract encoded in architecture. “We commit to publishing accurate events. Do what you want with them.”</p>

<hr />

<h2 id="what-this-actually-looked-like-at-a-real-company">What This Actually Looked Like at a Real Company</h2>

<p>A fintech startup I know had three teams sharing a Rails monolith. Payments, Accounts, and Reporting all worked in the same codebase, all writing to the same <code class="language-plaintext highlighter-rouge">Account</code> model. It had accumulated 60-something fields over 18 months. Nobody wanted to touch it.</p>

<p>Releasing anything required all three teams to coordinate. A Reporting query that nobody noticed would fail silently when Payments changed an account state transition. The Accounts team had a standing rule that any migration needed sign-off from two other teams before it went out. Releases happened every three weeks. Everyone was exhausted.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
graph TB
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph BEFORE["Before — one model, three teams, constant friction"]
        direction TB
        SHARED["Account&lt;br/&gt;60+ fields · 20+ associations&lt;br/&gt;// DO NOT TOUCH&lt;br/&gt;// ask someone first"]:::red
        PT["Payments Team"]:::yellow --&gt; SHARED
        AT["Accounts Team"]:::yellow --&gt; SHARED
        RT["Reporting Team"]:::yellow --&gt; SHARED
        SHARED --&gt; PAIN["3-week release cycles.&lt;br/&gt;Every migration needs&lt;br/&gt;sign-off from 2 other teams.&lt;br/&gt;Everyone is tired."]:::red
    end

    subgraph AFTER["After — three contexts, events, autonomy"]
        direction LR
        PC["Payment Processing&lt;br/&gt;owns: charge lifecycle&lt;br/&gt;refunds · disputes"]:::mauve
        AC["Account Management&lt;br/&gt;owns: balance · ledger&lt;br/&gt;account state"]:::mauve
        BUS2["Event Bus&lt;br/&gt;PaymentSettled&lt;br/&gt;AccountCredited&lt;br/&gt;AccountDebited"]:::teal
        RC["Financial Reporting&lt;br/&gt;owns: its own read model&lt;br/&gt;built from events"]:::green
        PC --&gt; BUS2
        AC --&gt; BUS2
        BUS2 --&gt; RC
    end

    RESULT["Each team ships when ready.&lt;br/&gt;Cycle time: days not weeks.&lt;br/&gt;Cross-team meetings dropped by half."]:::dim
    AFTER --&gt; RESULT
</code></pre>

<p>They ran an Event Storming workshop — three hours, lots of sticky notes, some heated arguments about what “account” actually meant. They came out with three clearly named contexts, a rough event schema, and the realisation that Reporting didn’t actually need access to the <code class="language-plaintext highlighter-rouge">Account</code> model at all. It just needed events.</p>

<p>Six months later they were shipping daily. The <code class="language-plaintext highlighter-rouge">Account</code> model still existed in its original bloated form in Accounts context, but it was that team’s problem to clean up on their own timeline. Finance had a lean model. Reporting had a read model built from events. Everyone owned their own thing.</p>

<hr />

<h2 id="you-dont-have-to-do-all-of-this-at-once">You Don’t Have to Do All of This at Once</h2>

<p>The reason DDD gets a reputation for being heavyweight is that people treat it as an all-or-nothing proposition. They read Evans’ book, see 560 pages, and either implement everything or none of it. Neither is the right call.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    S1["1. Event Storming&lt;br/&gt;──────────&lt;br/&gt;3 hours, sticky notes,&lt;br/&gt;domain experts in the room.&lt;br/&gt;Surfaces what you don't know."]:::yellow

    S2["2. Shared Glossary&lt;br/&gt;──────────&lt;br/&gt;Pick your 5 most&lt;br/&gt;misused terms.&lt;br/&gt;Write definitions. Get sign-off."]:::blue

    S3["3. Context Map&lt;br/&gt;──────────&lt;br/&gt;Draw the boundaries&lt;br/&gt;that already exist&lt;br/&gt;but aren't documented."]:::mauve

    S4["4. Tactical Patterns&lt;br/&gt;──────────&lt;br/&gt;Value Objects for money.&lt;br/&gt;Domain Events for&lt;br/&gt;cross-team handoffs."]:::teal

    S5["5. Protect Boundaries&lt;br/&gt;──────────&lt;br/&gt;Anti-Corruption Layers&lt;br/&gt;for legacy systems and&lt;br/&gt;third-party APIs."]:::green

    S1 --&gt; S2 --&gt; S3 --&gt; S4 --&gt; S5
</code></pre>

<p>Start with the thing that gives you the most value with the least disruption. For most teams that’s a combination of steps 1 and 2. Run an Event Storming session — just a few hours, domain experts and engineers in the same room, mapping out what actually happens in the business. Then build a glossary for the five terms that cause the most confusion in your standups.</p>

<p>You can do both of those things without touching your architecture. You’ll still get an immediate improvement in how clearly your team communicates. The rest — the context boundaries, the tactical patterns, the event-driven integration — you layer in when it makes sense.</p>

<p>The trap to avoid is introducing DDD vocabulary without the discipline. If half the team calls it an Order and the other half still says Transaction, you haven’t adopted Ubiquitous Language — you’ve just added jargon. Full commitment to the shared vocabulary is the one thing that shouldn’t be done halfway.</p>

<hr />

<h2 id="the-real-unlock">The Real Unlock</h2>

<p>Here’s the thing that doesn’t get said enough about DDD: the primary benefit isn’t better architecture. It’s <strong>faster trust</strong>.</p>

<p>When teams have explicit boundaries, they can trust each other to stay inside them. When events are the handoff mechanism, no team can break another team’s internals. When the language is shared, you stop wasting half a meeting realising you’ve been talking past each other.</p>

<p>That’s what makes collaboration actually work at scale. Not more process. Not more documentation. Structures that make the right thing easy and the wrong thing obvious — so teams can move fast without constantly stepping on each other.</p>

<p>DDD gives you those structures. It took the software world a while to really absorb what Evans was getting at, but the core insight holds up: the way you model your domain determines how well your teams can work together. Get the model right, and a lot of the collaboration friction disappears.</p>

<p>Get it wrong, and no amount of Agile ceremony will save you.</p>

<hr />

<p><em>If you made it this far, your next move is simple: schedule a two-hour Event Storming session with your team. Bring in someone from product, someone from the business side, and your senior engineers. You’ll be surprised how quickly the domain’s real shape reveals itself — and how much everyone disagrees on terms you all thought you shared.</em></p>

<hr />

<p><strong>Tags:</strong> Domain-Driven Design · Software Architecture · Team Collaboration · Engineering Culture · Microservices</p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="engineering" /><category term="architecture" /><category term="ddd" /><category term="domain-driven-design" /><category term="software-architecture" /><category term="team-collaboration" /><category term="engineering-culture" /><category term="microservices" /><category term="bounded-context" /><summary type="html"><![CDATA[Three teams. One database column. Three different mental models — until it all broke at 2am. Here's how Domain-Driven Design gives you the structures that make collaboration actually work at scale.]]></summary></entry><entry><title type="html">Building RAG That Actually Works: Lessons from the Trenches</title><link href="https://shanmuga-sundaram-n.github.io/blog/2026/03/21/building-production-rag-pipeline/" rel="alternate" type="text/html" title="Building RAG That Actually Works: Lessons from the Trenches" /><published>2026-03-21T12:00:00+00:00</published><updated>2026-03-21T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2026/03/21/building-production-rag-pipeline</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2026/03/21/building-production-rag-pipeline/"><![CDATA[<p>I’ve read probably a dozen RAG tutorials. They all do the same thing: show you how to embed a handful of PDFs into a vector store, run a similarity search, stuff the results into a prompt, and call it a production pipeline. Then you try to use the same approach on real data — thousands of documents, mixed formats, users with messy natural language queries — and the whole thing falls apart. The answers are vague, wrong, or confidently referencing documents that have nothing to do with the question.</p>

<p>That gap between “works in the tutorial” and “works in production” is what this post is about. I’ve built a few of these pipelines now, and I’ve made almost every mistake there is to make — wrong chunk sizes, no overlap, skipping the re-ranker, shipping without any evaluation. I’m going to walk through the full pipeline — from chunking to evaluation — not as a sanitized tutorial, but as the thing I wish I’d had when I started building this stuff for real.</p>

<p>We’ll use LangChain, ChromaDB, and OpenAI throughout. If you use different tools, the concepts all transfer.</p>

<hr />

<h2 id="the-two-phases-you-need-to-separate-in-your-head">The Two Phases You Need to Separate in Your Head</h2>

<p>Before any code, the most important mental model is that RAG is two completely separate systems that happen to share a vector database.</p>

<p>The <strong>indexing pipeline</strong> is offline. It runs on a schedule or when documents change. It loads your source files, chunks them, converts them to embeddings, and writes them to a vector store. Speed isn’t critical here. Correctness is.</p>

<p>The <strong>query pipeline</strong> is online. It runs on every user request, and it needs to be fast. It embeds the user’s question, retrieves the most relevant chunks, builds a prompt, and calls the LLM.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#89B4FA', 'lineColor': '#A6ADC8', 'secondaryColor': '#24243E', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'clusterBorder': '#45475A', 'titleColor': '#CDD6F4', 'edgeLabelBackground': '#1E1E2E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow  fill:#313244,stroke:#F9E2AF,color:#F9E2AF,rx:4
    classDef blue    fill:#313244,stroke:#89B4FA,color:#89B4FA,rx:4
    classDef mauve   fill:#313244,stroke:#CBA6F7,color:#CBA6F7,rx:4
    classDef green   fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1,rx:4
    classDef red     fill:#313244,stroke:#F38BA8,color:#F38BA8,rx:4

    subgraph OFFLINE["① OFFLINE · INDEXING"]
        A["Documents&lt;br/&gt;PDF / HTML / TXT"]:::yellow
        B["Document Loader"]:::blue
        C["Text Splitter&lt;br/&gt;chunk_size=1000, overlap=200"]:::blue
        D["Embedding Model&lt;br/&gt;text-embedding-3-large"]:::mauve
        A --&gt; B --&gt; C --&gt; D
    end

    VS[("Vector Store&lt;br/&gt;ChromaDB")]:::green

    D --&gt; VS

    subgraph ONLINE["② ONLINE · QUERY"]
        F["User Question"]:::yellow
        G["Embed Question"]:::blue
        H["Similarity Search&lt;br/&gt;MMR  k=5, fetch_k=20"]:::blue
        I["Re-ranker&lt;br/&gt;cross-encoder  top-5"]:::red
        J["Prompt Builder → LLM&lt;br/&gt;temperature=0"]:::green
        K(["Answer"]):::green
        F --&gt; G --&gt; H --&gt; I --&gt; J --&gt; K
    end

    VS --&gt;|top-k chunks| H
</code></pre>

<p>Keeping these two phases decoupled is the first thing most tutorials get wrong. If your indexing and querying code are tangled together, you’ll end up in situations where you can’t re-index without restarting your query service, or where a slow re-embed job blocks user requests. Treat them as separate processes from day one.</p>

<p>Here’s how to get your environment set up:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># requirements.txt equivalent — install these first
# pip install langchain langchain-openai langchain-chroma chromadb openai tiktoken pypdf
</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="c1"># Set your API key
</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">"OPENAI_API_KEY"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"your-openai-api-key"</span>

<span class="c1"># Verify environment
</span><span class="kn">import</span> <span class="nn">openai</span>
<span class="kn">import</span> <span class="nn">chromadb</span>
<span class="kn">import</span> <span class="nn">langchain</span>

<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"LangChain version: </span><span class="si">{</span><span class="n">langchain</span><span class="p">.</span><span class="n">__version__</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"ChromaDB version: </span><span class="si">{</span><span class="n">chromadb</span><span class="p">.</span><span class="n">__version__</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Environment ready"</span><span class="p">)</span>
</code></pre></div></div>

<p>Run this and confirm your package versions before going further. Version mismatches between LangChain and ChromaDB have caused me more pain than any bug I’ve written myself.</p>

<hr />

<h2 id="chunking-the-part-i-got-wrong-for-two-weeks">Chunking: The Part I Got Wrong for Two Weeks</h2>

<p>I’ll be direct: chunk size is the single most important decision in this entire pipeline. It took me two weeks of debugging poor retrieval quality before I realized my chunks were the problem, not my embedding model or retrieval code. I had set <code class="language-plaintext highlighter-rouge">chunk_size=2000</code> thinking “more context = better” and ended up with bloated, unfocused chunks that pulled in too much noise along with the relevant content.</p>

<p>The intuition is simple. A chunk is the atomic unit of retrieval. When a user asks a question, the system fetches the N most relevant chunks and hands them to the LLM. If your chunks are too large, each one contains multiple topics and the similarity score gets diluted. Too small, and you end up with fragments that don’t make sense without their surrounding context.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart TD
    classDef peach   fill:#313244,stroke:#FAB387,color:#FAB387
    classDef blue    fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green   fill:#313244,stroke:#A6E3A1,color:#A6E3A1
    classDef teal    fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef dim     fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    ROOT["Document&lt;br/&gt;Chunking Strategy?"]:::dim

    ROOT --&gt;|"Fixed size"| FC["Fixed Character&lt;br/&gt;─────────────────&lt;br/&gt;Speed : Fast&lt;br/&gt;Quality : Basic&lt;br/&gt;Use : Prototypes only"]:::peach
    ROOT --&gt;|"Recursive split"| RC["Recursive Character  ★ recommended&lt;br/&gt;─────────────────&lt;br/&gt;Speed : Fast&lt;br/&gt;Quality : Very Good&lt;br/&gt;Use : General production"]:::blue
    ROOT --&gt;|"Embedding-based"| SC["Semantic Splitter&lt;br/&gt;─────────────────&lt;br/&gt;Speed : Slow&lt;br/&gt;Quality : Excellent&lt;br/&gt;Use : High-stakes retrieval"]:::green
    ROOT --&gt;|"Structure-aware"| MC["HTML / Markdown&lt;br/&gt;─────────────────&lt;br/&gt;Speed : Fast&lt;br/&gt;Quality : Very Good&lt;br/&gt;Use : Structured docs"]:::teal
</code></pre>

<p>For most use cases, <code class="language-plaintext highlighter-rouge">RecursiveCharacterTextSplitter</code> with <code class="language-plaintext highlighter-rouge">chunk_size=1000</code> and <code class="language-plaintext highlighter-rouge">chunk_overlap=200</code> is a solid starting point. The recursive part matters: it tries to split on paragraph breaks first, then sentence boundaries, then spaces, only falling back to raw character splits as a last resort. This means your chunks are much more likely to contain complete thoughts rather than sentences cut in half.</p>

<p>The overlap is non-negotiable. Without it, a concept that straddles a chunk boundary gets split in two, and whichever half gets retrieved will be missing critical context.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8
    classDef blue   fill:#1A1A2E,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#1A2E1A,stroke:#A6E3A1,color:#A6E3A1
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    subgraph NO["# without overlap"]
        C1["chunk_1&lt;br/&gt;…concept begins"]:::red
        C2["chunk_2&lt;br/&gt;continues…"]:::red
        C3["chunk_3&lt;br/&gt;…conclusion"]:::red
        C1 --&gt; C2 --&gt; C3
        LOST["⚠ context from chunk_1&lt;br/&gt;   is LOST at boundary"]:::dim
        C2 -.-&gt;|boundary gap| LOST
    end

    subgraph YES["# overlap=200"]
        A1["chunk_1&lt;br/&gt;…concept begins"]:::blue
        A2["chunk_1 tail (200 chars)&lt;br/&gt;+ chunk_2 new content"]:::green
        A1 --&gt;|"overlapping tail carried forward"| A2
        OK["✓ context preserved&lt;br/&gt;  across boundary"]:::dim
        A2 -.-&gt; OK
    end
</code></pre>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain_community.document_loaders</span> <span class="kn">import</span> <span class="n">PyPDFLoader</span><span class="p">,</span> <span class="n">DirectoryLoader</span>
<span class="kn">from</span> <span class="nn">langchain.text_splitter</span> <span class="kn">import</span> <span class="n">RecursiveCharacterTextSplitter</span>
<span class="kn">from</span> <span class="nn">langchain.schema</span> <span class="kn">import</span> <span class="n">Document</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>

<span class="k">def</span> <span class="nf">load_documents</span><span class="p">(</span><span class="n">source_dir</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Document</span><span class="p">]:</span>
    <span class="s">"""Load all PDF documents from a directory."""</span>
    <span class="n">loader</span> <span class="o">=</span> <span class="n">DirectoryLoader</span><span class="p">(</span>
        <span class="n">source_dir</span><span class="p">,</span>
        <span class="n">glob</span><span class="o">=</span><span class="s">"**/*.pdf"</span><span class="p">,</span>
        <span class="n">loader_cls</span><span class="o">=</span><span class="n">PyPDFLoader</span><span class="p">,</span>
        <span class="n">show_progress</span><span class="o">=</span><span class="bp">True</span>
    <span class="p">)</span>
    <span class="n">documents</span> <span class="o">=</span> <span class="n">loader</span><span class="p">.</span><span class="n">load</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loaded </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">documents</span><span class="p">)</span><span class="si">}</span><span class="s"> pages from </span><span class="si">{</span><span class="n">source_dir</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">documents</span>


<span class="k">def</span> <span class="nf">chunk_documents</span><span class="p">(</span>
    <span class="n">documents</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Document</span><span class="p">],</span>
    <span class="n">chunk_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">,</span>
    <span class="n">chunk_overlap</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">200</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Document</span><span class="p">]:</span>
    <span class="s">"""
    Split documents into overlapping chunks.

    chunk_overlap=200 ensures continuity — if a concept spans a chunk
    boundary, both chunks will contain enough context to be meaningful.
    """</span>
    <span class="n">splitter</span> <span class="o">=</span> <span class="n">RecursiveCharacterTextSplitter</span><span class="p">(</span>
        <span class="n">chunk_size</span><span class="o">=</span><span class="n">chunk_size</span><span class="p">,</span>
        <span class="n">chunk_overlap</span><span class="o">=</span><span class="n">chunk_overlap</span><span class="p">,</span>
        <span class="c1"># Try these separators in order — fall back to the next if needed
</span>        <span class="n">separators</span><span class="o">=</span><span class="p">[</span><span class="s">"</span><span class="se">\n\n</span><span class="s">"</span><span class="p">,</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="s">". "</span><span class="p">,</span> <span class="s">" "</span><span class="p">,</span> <span class="s">""</span><span class="p">],</span>
        <span class="n">length_function</span><span class="o">=</span><span class="nb">len</span><span class="p">,</span>
    <span class="p">)</span>

    <span class="n">chunks</span> <span class="o">=</span> <span class="n">splitter</span><span class="p">.</span><span class="n">split_documents</span><span class="p">(</span><span class="n">documents</span><span class="p">)</span>

    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Split </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">documents</span><span class="p">)</span><span class="si">}</span><span class="s"> pages into </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span><span class="si">}</span><span class="s"> chunks"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Average chunk size: </span><span class="si">{</span><span class="nb">sum</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">page_content</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">chunks</span><span class="p">)</span> <span class="o">//</span> <span class="nb">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span><span class="si">}</span><span class="s"> chars"</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">chunks</span>


<span class="c1"># --- Run it ---
</span><span class="n">docs</span> <span class="o">=</span> <span class="n">load_documents</span><span class="p">(</span><span class="s">"./docs"</span><span class="p">)</span>
<span class="n">chunks</span> <span class="o">=</span> <span class="n">chunk_documents</span><span class="p">(</span><span class="n">docs</span><span class="p">,</span> <span class="n">chunk_size</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">chunk_overlap</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>

<span class="c1"># Inspect a sample chunk
</span><span class="n">sample</span> <span class="o">=</span> <span class="n">chunks</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">--- Sample Chunk ---"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Content: </span><span class="si">{</span><span class="n">sample</span><span class="p">.</span><span class="n">page_content</span><span class="p">[</span><span class="si">:</span><span class="mi">300</span><span class="p">]</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Metadata: </span><span class="si">{</span><span class="n">sample</span><span class="p">.</span><span class="n">metadata</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>You should see chunk counts and average sizes printed out. I strongly recommend inspecting a handful of sample chunks manually before proceeding. If your chunks look like they’re cutting off mid-sentence constantly, drop the <code class="language-plaintext highlighter-rouge">chunk_size</code> or double-check that your separator list is actually matching your document structure.</p>

<p>One more thing I wish someone had told me: always print the average chunk size after splitting. If it’s dramatically smaller than your target (say, you set 1000 but average is 400), your documents are full of very short paragraphs and you probably need to reconsider your separator strategy.</p>

<hr />

<h2 id="embeddings-and-the-vector-store-boring-but-critical">Embeddings and the Vector Store: Boring but Critical</h2>

<p>This part feels like plumbing, and it kind of is. But bad plumbing causes leaks.</p>

<p>The premise is that semantically similar text produces vectors that are geometrically close in high-dimensional space. “How does attention work?” and “Explain the self-attention mechanism” don’t share many words, but their embeddings will be very close because they mean the same thing. That’s the whole trick.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow  fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue    fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef mauve   fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef green   fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef dim     fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    Q["'What is self-attention?'&lt;br/&gt;query text"]:::yellow
    QV["[0.12, -0.87, 0.34, …]&lt;br/&gt;dim=1536"]:::yellow

    C["'Self-attention allows tokens&lt;br/&gt;to attend to all other tokens'&lt;br/&gt;chunk text"]:::blue
    CV["[0.11, -0.89, 0.31, …]&lt;br/&gt;dim=1536"]:::blue

    SIM["cosine_sim() = 0.97"]:::mauve
    MATCH["HIGH MATCH&lt;br/&gt;semantically equivalent"]:::green
    NOTE["# different words&lt;br/&gt;# same meaning&lt;br/&gt;# geometrically close"]:::dim

    Q  --&gt;|"embed()"| QV
    C  --&gt;|"embed()"| CV
    QV --&gt;  SIM
    CV --&gt;  SIM
    SIM --&gt; MATCH
    MATCH -.-&gt; NOTE
</code></pre>

<p>I use <code class="language-plaintext highlighter-rouge">text-embedding-3-large</code> for anything where quality matters. It’s more expensive than <code class="language-plaintext highlighter-rouge">text-embedding-3-small</code>, but for a knowledge base application the quality difference is real and the cost per query is still small. For high-volume applications where you’re embedding millions of documents and running thousands of queries a day, <code class="language-plaintext highlighter-rouge">text-embedding-3-small</code> is worth benchmarking.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain_openai</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>
<span class="kn">from</span> <span class="nn">langchain_chroma</span> <span class="kn">import</span> <span class="n">Chroma</span>
<span class="kn">import</span> <span class="nn">chromadb</span>

<span class="k">def</span> <span class="nf">build_vector_store</span><span class="p">(</span>
    <span class="n">chunks</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Document</span><span class="p">],</span>
    <span class="n">persist_directory</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./chroma_db"</span><span class="p">,</span>
    <span class="n">collection_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"knowledge_base"</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Chroma</span><span class="p">:</span>
    <span class="s">"""
    Embed all document chunks and store them in ChromaDB.

    Uses text-embedding-3-large for best retrieval quality.
    Falls back gracefully if the store already exists.
    """</span>

    <span class="c1"># Initialize the embedding model
</span>    <span class="c1"># text-embedding-3-large: 1536 dims, excellent quality
</span>    <span class="c1"># text-embedding-3-small: cheaper, good for high-volume apps
</span>    <span class="n">embeddings</span> <span class="o">=</span> <span class="n">OpenAIEmbeddings</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="s">"text-embedding-3-large"</span><span class="p">,</span>
        <span class="n">dimensions</span><span class="o">=</span><span class="mi">1536</span>
    <span class="p">)</span>

    <span class="c1"># Check if vector store already exists to avoid re-embedding
</span>    <span class="k">if</span> <span class="n">Path</span><span class="p">(</span><span class="n">persist_directory</span><span class="p">).</span><span class="n">exists</span><span class="p">():</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading existing vector store from </span><span class="si">{</span><span class="n">persist_directory</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">vector_store</span> <span class="o">=</span> <span class="n">Chroma</span><span class="p">(</span>
            <span class="n">collection_name</span><span class="o">=</span><span class="n">collection_name</span><span class="p">,</span>
            <span class="n">embedding_function</span><span class="o">=</span><span class="n">embeddings</span><span class="p">,</span>
            <span class="n">persist_directory</span><span class="o">=</span><span class="n">persist_directory</span>
        <span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Building new vector store with </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span><span class="si">}</span><span class="s"> chunks..."</span><span class="p">)</span>

        <span class="c1"># Chroma.from_documents handles embedding + storing in one call
</span>        <span class="n">vector_store</span> <span class="o">=</span> <span class="n">Chroma</span><span class="p">.</span><span class="n">from_documents</span><span class="p">(</span>
            <span class="n">documents</span><span class="o">=</span><span class="n">chunks</span><span class="p">,</span>
            <span class="n">embedding</span><span class="o">=</span><span class="n">embeddings</span><span class="p">,</span>
            <span class="n">collection_name</span><span class="o">=</span><span class="n">collection_name</span><span class="p">,</span>
            <span class="n">persist_directory</span><span class="o">=</span><span class="n">persist_directory</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Vector store built and persisted to </span><span class="si">{</span><span class="n">persist_directory</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

    <span class="c1"># Verify the store
</span>    <span class="n">count</span> <span class="o">=</span> <span class="n">vector_store</span><span class="p">.</span><span class="n">_collection</span><span class="p">.</span><span class="n">count</span><span class="p">()</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Vector store contains </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s"> vectors"</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">vector_store</span>


<span class="c1"># --- Build it ---
</span><span class="n">vector_store</span> <span class="o">=</span> <span class="n">build_vector_store</span><span class="p">(</span><span class="n">chunks</span><span class="p">)</span>
</code></pre></div></div>

<p>In production, documents change. People update wikis, replace PDFs, add new reports. You don’t want to re-embed your entire corpus every time a single file changes. The incremental indexing pattern below handles this with a simple content hash:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">hashlib</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="k">def</span> <span class="nf">get_document_hash</span><span class="p">(</span><span class="n">doc</span><span class="p">:</span> <span class="n">Document</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Generate a stable hash for a document chunk based on its content."""</span>
    <span class="k">return</span> <span class="n">hashlib</span><span class="p">.</span><span class="n">md5</span><span class="p">(</span><span class="n">doc</span><span class="p">.</span><span class="n">page_content</span><span class="p">.</span><span class="n">encode</span><span class="p">()).</span><span class="n">hexdigest</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">upsert_documents</span><span class="p">(</span>
    <span class="n">vector_store</span><span class="p">:</span> <span class="n">Chroma</span><span class="p">,</span>
    <span class="n">new_chunks</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Document</span><span class="p">]</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="s">"""
    Add only new/changed documents to an existing vector store.
    Avoids re-embedding documents that haven't changed.
    """</span>
    <span class="c1"># Fetch all existing document IDs from the store
</span>    <span class="n">existing</span> <span class="o">=</span> <span class="n">vector_store</span><span class="p">.</span><span class="n">_collection</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">include</span><span class="o">=</span><span class="p">[</span><span class="s">"metadatas"</span><span class="p">])</span>
    <span class="n">existing_hashes</span> <span class="o">=</span> <span class="p">{</span>
        <span class="n">meta</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"content_hash"</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">meta</span> <span class="ow">in</span> <span class="n">existing</span><span class="p">[</span><span class="s">"metadatas"</span><span class="p">]</span>
        <span class="k">if</span> <span class="n">meta</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"content_hash"</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="c1"># Filter to only chunks we haven't seen before
</span>    <span class="n">new_docs</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">new_chunks</span><span class="p">:</span>
        <span class="n">content_hash</span> <span class="o">=</span> <span class="n">get_document_hash</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">content_hash</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">existing_hashes</span><span class="p">:</span>
            <span class="c1"># Stamp the chunk with its hash for future deduplication
</span>            <span class="n">chunk</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="s">"content_hash"</span><span class="p">]</span> <span class="o">=</span> <span class="n">content_hash</span>
            <span class="n">chunk</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="s">"indexed_at"</span><span class="p">]</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">().</span><span class="n">isoformat</span><span class="p">()</span>
            <span class="n">new_docs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">new_docs</span><span class="p">:</span>
        <span class="n">vector_store</span><span class="p">.</span><span class="n">add_documents</span><span class="p">(</span><span class="n">new_docs</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Added </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">new_docs</span><span class="p">)</span><span class="si">}</span><span class="s"> new chunks to the vector store"</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"No new documents to index — everything is up to date"</span><span class="p">)</span>

    <span class="k">return</span> <span class="p">{</span><span class="s">"added"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">new_docs</span><span class="p">),</span> <span class="s">"skipped"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">new_chunks</span><span class="p">)</span> <span class="o">-</span> <span class="nb">len</span><span class="p">(</span><span class="n">new_docs</span><span class="p">)}</span>
</code></pre></div></div>

<p>The first time you run this on a large corpus, embedding takes a while. Budget accordingly. On 10,000 chunks with <code class="language-plaintext highlighter-rouge">text-embedding-3-large</code>, expect roughly 2-3 minutes and a few dollars in API costs.</p>

<hr />

<h2 id="retrieval-where-most-pipelines-silently-die">Retrieval: Where Most Pipelines Silently Die</h2>

<p>Here’s what I mean by “silently.” A naive similarity search will almost always return <em>something</em>. The chunks it returns will usually be topically related to the query. The LLM will usually produce a fluent, confident answer. The problem is that the answer might be incomplete, subtly wrong, or built on the third-best chunk rather than the most relevant one. You’ll never know unless you’re logging and measuring.</p>

<p>The two main failure modes I’ve seen:</p>

<p><strong>Redundant retrieval.</strong> You ask for <code class="language-plaintext highlighter-rouge">k=5</code> chunks and get back 5 chunks that all say essentially the same thing. You’ve used your entire context window on one perspective of the topic and left out everything else.</p>

<p><strong>Unfocused retrieval in multi-domain knowledge bases.</strong> If your vector store has documents from HR, engineering, finance, and legal all mixed together, a query about “approval process” might retrieve chunks from three different departments when the user only cared about one.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart LR
    classDef yellow  fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef blue    fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef teal    fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef red     fill:#313244,stroke:#F38BA8,color:#F38BA8
    classDef green   fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef mauve   fill:#313244,stroke:#CBA6F7,color:#CBA6F7

    UQ["User&lt;br/&gt;Query"]:::yellow
    EMB["Embedder&lt;br/&gt;embed()"]:::blue
    VS["Vector Store&lt;br/&gt;similarity_search()"]:::teal
    RR["Re-ranker&lt;br/&gt;CrossEncoder&lt;br/&gt;top-20 → top-5"]:::red
    PB["Prompt&lt;br/&gt;Builder"]:::green
    LLM["LLM&lt;br/&gt;generate()"]:::mauve
    ANS(["Answer"]):::green

    UQ --&gt;|"raw question"| EMB
    EMB --&gt;|"query vector"| VS
    VS --&gt;|"top-20 candidates"| RR
    RR --&gt;|"re-ranked top-5"| PB
    PB --&gt;|"augmented prompt"| LLM
    LLM --&gt; ANS
</code></pre>

<p>MMR (Maximal Marginal Relevance) solves the redundancy problem. Instead of returning the top-K most similar chunks, it returns the top-K that are both relevant to the query <em>and</em> maximally different from each other. I should have been using this from the start.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">build_retriever</span><span class="p">(</span><span class="n">vector_store</span><span class="p">:</span> <span class="n">Chroma</span><span class="p">,</span> <span class="n">strategy</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"mmr"</span><span class="p">):</span>
    <span class="s">"""
    Build a retriever with different strategies:
    - 'similarity': pure cosine similarity (fast, simple)
    - 'mmr': Maximal Marginal Relevance (diverse results, reduces redundancy)
    - 'filtered': similarity with metadata filtering
    """</span>

    <span class="k">if</span> <span class="n">strategy</span> <span class="o">==</span> <span class="s">"similarity"</span><span class="p">:</span>
        <span class="c1"># Basic similarity search — good for small, focused knowledge bases
</span>        <span class="n">retriever</span> <span class="o">=</span> <span class="n">vector_store</span><span class="p">.</span><span class="n">as_retriever</span><span class="p">(</span>
            <span class="n">search_type</span><span class="o">=</span><span class="s">"similarity"</span><span class="p">,</span>
            <span class="n">search_kwargs</span><span class="o">=</span><span class="p">{</span><span class="s">"k"</span><span class="p">:</span> <span class="mi">5</span><span class="p">}</span>
        <span class="p">)</span>

    <span class="k">elif</span> <span class="n">strategy</span> <span class="o">==</span> <span class="s">"mmr"</span><span class="p">:</span>
        <span class="c1"># MMR balances relevance AND diversity
</span>        <span class="c1"># fetch_k=20: fetch 20 candidates, then select 5 maximally diverse ones
</span>        <span class="n">retriever</span> <span class="o">=</span> <span class="n">vector_store</span><span class="p">.</span><span class="n">as_retriever</span><span class="p">(</span>
            <span class="n">search_type</span><span class="o">=</span><span class="s">"mmr"</span><span class="p">,</span>
            <span class="n">search_kwargs</span><span class="o">=</span><span class="p">{</span>
                <span class="s">"k"</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span>           <span class="c1"># final number of results
</span>                <span class="s">"fetch_k"</span><span class="p">:</span> <span class="mi">20</span><span class="p">,</span>    <span class="c1"># candidate pool size
</span>                <span class="s">"lambda_mult"</span><span class="p">:</span> <span class="mf">0.7</span>  <span class="c1"># 1.0 = pure similarity, 0.0 = pure diversity
</span>            <span class="p">}</span>
        <span class="p">)</span>

    <span class="k">elif</span> <span class="n">strategy</span> <span class="o">==</span> <span class="s">"filtered"</span><span class="p">:</span>
        <span class="c1"># Filter by metadata before similarity search
</span>        <span class="c1"># Useful when documents have tags, dates, categories, etc.
</span>        <span class="n">retriever</span> <span class="o">=</span> <span class="n">vector_store</span><span class="p">.</span><span class="n">as_retriever</span><span class="p">(</span>
            <span class="n">search_type</span><span class="o">=</span><span class="s">"similarity"</span><span class="p">,</span>
            <span class="n">search_kwargs</span><span class="o">=</span><span class="p">{</span>
                <span class="s">"k"</span><span class="p">:</span> <span class="mi">5</span><span class="p">,</span>
                <span class="s">"filter"</span><span class="p">:</span> <span class="p">{</span><span class="s">"source"</span><span class="p">:</span> <span class="s">"annual_report_2025.pdf"</span><span class="p">}</span>
            <span class="p">}</span>
        <span class="p">)</span>

    <span class="k">return</span> <span class="n">retriever</span>


<span class="k">def</span> <span class="nf">retrieve_with_scores</span><span class="p">(</span><span class="n">vector_store</span><span class="p">:</span> <span class="n">Chroma</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span><span class="p">):</span>
    <span class="s">"""
    Retrieve chunks with their similarity scores for debugging/logging.
    """</span>
    <span class="n">results</span> <span class="o">=</span> <span class="n">vector_store</span><span class="p">.</span><span class="n">similarity_search_with_score</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">k</span><span class="p">)</span>

    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Query: '</span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'─'</span> <span class="o">*</span> <span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">doc</span><span class="p">,</span> <span class="n">score</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">results</span><span class="p">):</span>
        <span class="c1"># ChromaDB returns L2 distance (lower = more similar)
</span>        <span class="c1"># Convert to 0-1 similarity for readability
</span>        <span class="n">similarity</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">score</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Result </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s"> | Similarity: </span><span class="si">{</span><span class="n">similarity</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Source: </span><span class="si">{</span><span class="n">doc</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'source'</span><span class="p">,</span> <span class="s">'unknown'</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Content: </span><span class="si">{</span><span class="n">doc</span><span class="p">.</span><span class="n">page_content</span><span class="p">[</span><span class="si">:</span><span class="mi">200</span><span class="p">]</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">results</span>


<span class="c1"># --- Test retrieval ---
</span><span class="n">retriever</span> <span class="o">=</span> <span class="n">build_retriever</span><span class="p">(</span><span class="n">vector_store</span><span class="p">,</span> <span class="n">strategy</span><span class="o">=</span><span class="s">"mmr"</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">retrieve_with_scores</span><span class="p">(</span>
    <span class="n">vector_store</span><span class="p">,</span>
    <span class="n">query</span><span class="o">=</span><span class="s">"How does self-attention work in transformers?"</span><span class="p">,</span>
    <span class="n">k</span><span class="o">=</span><span class="mi">5</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Run this and look at the similarity scores. If your top result is below 0.75, something is off. Either your chunks are too large, your embedding model is mismatched for your domain, or your documents genuinely don’t contain a good answer to the query.</p>

<h3 id="the-re-ranker-i-resisted-for-too-long">The re-ranker I resisted for too long</h3>

<p>Honestly, I avoided adding a cross-encoder re-ranker for months because of the added latency. That was a mistake. The difference in retrieval quality was significant enough that I ended up adding it anyway after watching users get mediocre answers on questions that should have been easy.</p>

<p>The core issue is that embedding models (bi-encoders) work by embedding the query and each document <em>independently</em> and then comparing them. They’re fast but coarse. A cross-encoder, by contrast, takes the query and a document together as a pair and scores them jointly. It’s much more accurate at judging whether a specific document actually answers a specific question.</p>

<p>The trade-off: cross-encoders are slow. You wouldn’t use one to search a million documents. But used as a re-ranker on the top 20 candidates your vector store already retrieved, the latency is acceptable (usually 50-150ms for a batch of 20).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sentence_transformers</span> <span class="kn">import</span> <span class="n">CrossEncoder</span>

<span class="k">class</span> <span class="nc">CrossEncoderReranker</span><span class="p">:</span>
    <span class="s">"""Re-rank retrieved documents using a cross-encoder model."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"cross-encoder/ms-marco-MiniLM-L-6-v2"</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">CrossEncoder</span><span class="p">(</span><span class="n">model_name</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">rerank</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">documents</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Document</span><span class="p">],</span>
        <span class="n">top_n</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">5</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Document</span><span class="p">]:</span>
        <span class="s">"""
        Score each (query, document) pair and return top_n by score.
        """</span>
        <span class="c1"># Build pairs for the cross-encoder
</span>        <span class="n">pairs</span> <span class="o">=</span> <span class="p">[(</span><span class="n">query</span><span class="p">,</span> <span class="n">doc</span><span class="p">.</span><span class="n">page_content</span><span class="p">)</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">documents</span><span class="p">]</span>

        <span class="c1"># Cross-encoder scores each pair (query is considered with each doc)
</span>        <span class="n">scores</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">pairs</span><span class="p">)</span>

        <span class="c1"># Sort by score descending, keep top_n
</span>        <span class="n">ranked</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span>
            <span class="nb">zip</span><span class="p">(</span><span class="n">scores</span><span class="p">,</span> <span class="n">documents</span><span class="p">),</span>
            <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
            <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span>
        <span class="p">)</span>

        <span class="n">top_docs</span> <span class="o">=</span> <span class="p">[</span><span class="n">doc</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">ranked</span><span class="p">[:</span><span class="n">top_n</span><span class="p">]]</span>

        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Re-ranked </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">documents</span><span class="p">)</span><span class="si">}</span><span class="s"> → </span><span class="si">{</span><span class="n">top_n</span><span class="si">}</span><span class="s"> documents"</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">score</span><span class="p">,</span> <span class="n">doc</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">ranked</span><span class="p">[:</span><span class="n">top_n</span><span class="p">]):</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Rank </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">: score=</span><span class="si">{</span><span class="n">score</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">doc</span><span class="p">.</span><span class="n">page_content</span><span class="p">[</span><span class="si">:</span><span class="mi">80</span><span class="p">]</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">top_docs</span>


<span class="c1"># --- Use the re-ranker ---
</span><span class="n">reranker</span> <span class="o">=</span> <span class="n">CrossEncoderReranker</span><span class="p">()</span>
<span class="n">candidate_docs</span> <span class="o">=</span> <span class="n">retriever</span><span class="p">.</span><span class="n">invoke</span><span class="p">(</span><span class="s">"How does self-attention work?"</span><span class="p">)</span>
<span class="n">reranked_docs</span> <span class="o">=</span> <span class="n">reranker</span><span class="p">.</span><span class="n">rerank</span><span class="p">(</span>
    <span class="n">query</span><span class="o">=</span><span class="s">"How does self-attention work?"</span><span class="p">,</span>
    <span class="n">documents</span><span class="o">=</span><span class="n">candidate_docs</span><span class="p">,</span>
    <span class="n">top_n</span><span class="o">=</span><span class="mi">3</span>
<span class="p">)</span>
</code></pre></div></div>

<p>With the re-ranker in place, your retrieval pipeline now works in two stages: the vector store does a fast, broad sweep to surface 20 candidates, and the cross-encoder does a precise, slow pass to select the best 3-5 from that pool. That’s the combination I’d use as a default for any knowledge base application.</p>

<hr />

<h2 id="prompt-construction-dont-waste-good-retrieval-on-a-bad-prompt">Prompt Construction: Don’t Waste Good Retrieval on a Bad Prompt</h2>

<p>You’ve worked hard to get the right chunks. Now you need to actually use them well. This part is simpler than the retrieval work but still matters more than most people give it credit for.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart TD
    classDef teal   fill:#313244,stroke:#94E2D5,color:#94E2D5
    classDef yellow fill:#313244,stroke:#F9E2AF,color:#F9E2AF
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8
    classDef mauve  fill:#313244,stroke:#CBA6F7,color:#CBA6F7
    classDef green  fill:#1A3A1A,stroke:#A6E3A1,color:#A6E3A1
    classDef red    fill:#2E1A1A,stroke:#F38BA8,color:#F38BA8

    RC["retrieved_chunks&lt;br/&gt;from re-ranker"]:::teal
    UQ["user_question"]:::yellow
    SI["system_instructions&lt;br/&gt;role + constraints"]:::dim

    CF["format_context()&lt;br/&gt;add labels · tiktoken budget&lt;br/&gt;max_tokens=6000"]:::teal

    PT["ChatPromptTemplate&lt;br/&gt;system | context | question"]:::yellow

    LLM["ChatOpenAI&lt;br/&gt;temperature=0"]:::mauve

    RES["response&lt;br/&gt;grounded answer"]:::green
    FLAG["# flag if not grounded&lt;br/&gt;# in retrieved context"]:::red

    RC --&gt;|"chunks"| CF
    CF --&gt;|"formatted context"| PT
    UQ --&gt; PT
    SI --&gt; PT
    PT --&gt;|"assembled prompt"| LLM
    LLM --&gt; RES
    LLM -.-&gt;|"hallucination check"| FLAG
</code></pre>

<p>Two things consistently trip people up here.</p>

<p>First, token budgets. If you’re not explicitly counting tokens before building your prompt, you are relying on luck. LangChain won’t error out when you exceed the context window. It’ll silently truncate, and you’ll get answers based on incomplete context. Always count with tiktoken.</p>

<p>Second, the system prompt. Tell the model to cite its sources, tell it to say “I don’t know” when the context doesn’t contain the answer, and tell it explicitly not to draw on outside knowledge. Without these constraints, GPT-4 in particular will happily synthesize an answer from its training data when the retrieved context falls short, and you’ll have no idea it’s doing it.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain_openai</span> <span class="kn">import</span> <span class="n">ChatOpenAI</span>
<span class="kn">from</span> <span class="nn">langchain.prompts</span> <span class="kn">import</span> <span class="n">ChatPromptTemplate</span>
<span class="kn">from</span> <span class="nn">langchain.schema.output_parser</span> <span class="kn">import</span> <span class="n">StrOutputParser</span>
<span class="kn">import</span> <span class="nn">tiktoken</span>

<span class="c1"># ── Token budget management ──────────────────────────────────────────
</span><span class="k">def</span> <span class="nf">count_tokens</span><span class="p">(</span><span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">model</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"gpt-4o"</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="s">"""Count tokens using tiktoken to avoid exceeding context window."""</span>
    <span class="n">enc</span> <span class="o">=</span> <span class="n">tiktoken</span><span class="p">.</span><span class="n">encoding_for_model</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
    <span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">enc</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">format_context</span><span class="p">(</span><span class="n">docs</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Document</span><span class="p">],</span> <span class="n">max_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">6000</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""
    Format retrieved docs into a context string, respecting a token budget.
    Prioritizes earlier (higher-ranked) chunks when truncating.
    """</span>
    <span class="n">context_parts</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">total_tokens</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">doc</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">docs</span><span class="p">):</span>
        <span class="n">chunk_text</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"[Source </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">doc</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'source'</span><span class="p">,</span> <span class="s">'unknown'</span><span class="p">)</span><span class="si">}</span><span class="s">]</span><span class="se">\n</span><span class="si">{</span><span class="n">doc</span><span class="p">.</span><span class="n">page_content</span><span class="si">}</span><span class="s">"</span>
        <span class="n">chunk_tokens</span> <span class="o">=</span> <span class="n">count_tokens</span><span class="p">(</span><span class="n">chunk_text</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">total_tokens</span> <span class="o">+</span> <span class="n">chunk_tokens</span> <span class="o">&gt;</span> <span class="n">max_tokens</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Token budget reached at chunk </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">. Truncating context."</span><span class="p">)</span>
            <span class="k">break</span>

        <span class="n">context_parts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">chunk_text</span><span class="p">)</span>
        <span class="n">total_tokens</span> <span class="o">+=</span> <span class="n">chunk_tokens</span>

    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Context: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">context_parts</span><span class="p">)</span><span class="si">}</span><span class="s"> chunks, </span><span class="si">{</span><span class="n">total_tokens</span><span class="si">}</span><span class="s"> tokens"</span><span class="p">)</span>
    <span class="k">return</span> <span class="s">"</span><span class="se">\n\n</span><span class="s">---</span><span class="se">\n\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">context_parts</span><span class="p">)</span>


<span class="c1"># ── Prompt Template ───────────────────────────────────────────────────
</span><span class="n">RAG_PROMPT</span> <span class="o">=</span> <span class="n">ChatPromptTemplate</span><span class="p">.</span><span class="n">from_messages</span><span class="p">([</span>
    <span class="p">(</span><span class="s">"system"</span><span class="p">,</span> <span class="s">"""You are a precise, helpful assistant. Answer the user's question
using ONLY the information provided in the context below.

Rules:
- If the context doesn't contain enough information to answer confidently, say so explicitly.
- Do not make up facts not present in the context.
- Cite the source number (e.g. [Source 1]) when referencing specific information.
- Be concise but complete.

Context:
{context}
"""</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"human"</span><span class="p">,</span> <span class="s">"{question}"</span><span class="p">)</span>
<span class="p">])</span>


<span class="c1"># ── Full RAG Pipeline ─────────────────────────────────────────────────
</span><span class="k">class</span> <span class="nc">RAGPipeline</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">vector_store</span><span class="p">:</span> <span class="n">Chroma</span><span class="p">,</span>
        <span class="n">model_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"gpt-4o"</span><span class="p">,</span>
        <span class="n">retrieval_strategy</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"mmr"</span><span class="p">,</span>
        <span class="n">use_reranker</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">retriever</span> <span class="o">=</span> <span class="n">build_retriever</span><span class="p">(</span><span class="n">vector_store</span><span class="p">,</span> <span class="n">strategy</span><span class="o">=</span><span class="n">retrieval_strategy</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">reranker</span> <span class="o">=</span> <span class="n">CrossEncoderReranker</span><span class="p">()</span> <span class="k">if</span> <span class="n">use_reranker</span> <span class="k">else</span> <span class="bp">None</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">llm</span> <span class="o">=</span> <span class="n">ChatOpenAI</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="n">model_name</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">prompt</span> <span class="o">=</span> <span class="n">RAG_PROMPT</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">question</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="s">"""Execute the full RAG pipeline and return answer + sources."""</span>

        <span class="c1"># Step 1: Retrieve candidate chunks
</span>        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[1/4] Retrieving candidates for: '</span><span class="si">{</span><span class="n">question</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>
        <span class="n">candidates</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">retriever</span><span class="p">.</span><span class="n">invoke</span><span class="p">(</span><span class="n">question</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"      Retrieved </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">candidates</span><span class="p">)</span><span class="si">}</span><span class="s"> candidates"</span><span class="p">)</span>

        <span class="c1"># Step 2: Re-rank (optional)
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">reranker</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[2/4] Re-ranking candidates..."</span><span class="p">)</span>
            <span class="n">docs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">reranker</span><span class="p">.</span><span class="n">rerank</span><span class="p">(</span><span class="n">question</span><span class="p">,</span> <span class="n">candidates</span><span class="p">,</span> <span class="n">top_n</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">docs</span> <span class="o">=</span> <span class="n">candidates</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span>

        <span class="c1"># Step 3: Format context with token budget
</span>        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[3/4] Formatting context..."</span><span class="p">)</span>
        <span class="n">context</span> <span class="o">=</span> <span class="n">format_context</span><span class="p">(</span><span class="n">docs</span><span class="p">,</span> <span class="n">max_tokens</span><span class="o">=</span><span class="mi">6000</span><span class="p">)</span>

        <span class="c1"># Step 4: Generate response
</span>        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[4/4] Generating response..."</span><span class="p">)</span>
        <span class="n">chain</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">prompt</span> <span class="o">|</span> <span class="bp">self</span><span class="p">.</span><span class="n">llm</span> <span class="o">|</span> <span class="n">StrOutputParser</span><span class="p">()</span>
        <span class="n">answer</span> <span class="o">=</span> <span class="n">chain</span><span class="p">.</span><span class="n">invoke</span><span class="p">({</span><span class="s">"context"</span><span class="p">:</span> <span class="n">context</span><span class="p">,</span> <span class="s">"question"</span><span class="p">:</span> <span class="n">question</span><span class="p">})</span>

        <span class="c1"># Collect source metadata
</span>        <span class="n">sources</span> <span class="o">=</span> <span class="nb">list</span><span class="p">({</span><span class="n">doc</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"source"</span><span class="p">,</span> <span class="s">"unknown"</span><span class="p">)</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">docs</span><span class="p">})</span>

        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"question"</span><span class="p">:</span> <span class="n">question</span><span class="p">,</span>
            <span class="s">"answer"</span><span class="p">:</span> <span class="n">answer</span><span class="p">,</span>
            <span class="s">"sources"</span><span class="p">:</span> <span class="n">sources</span><span class="p">,</span>
            <span class="s">"chunks_used"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">docs</span><span class="p">)</span>
        <span class="p">}</span>


<span class="c1"># ── Run it end-to-end ─────────────────────────────────────────────────
</span><span class="n">pipeline</span> <span class="o">=</span> <span class="n">RAGPipeline</span><span class="p">(</span>
    <span class="n">vector_store</span><span class="o">=</span><span class="n">vector_store</span><span class="p">,</span>
    <span class="n">model_name</span><span class="o">=</span><span class="s">"gpt-4o"</span><span class="p">,</span>
    <span class="n">retrieval_strategy</span><span class="o">=</span><span class="s">"mmr"</span><span class="p">,</span>
    <span class="n">use_reranker</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="s">"What is the difference between self-attention and cross-attention?"</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"═"</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Question: </span><span class="si">{</span><span class="n">result</span><span class="p">[</span><span class="s">'question'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Answer:</span><span class="se">\n</span><span class="si">{</span><span class="n">result</span><span class="p">[</span><span class="s">'answer'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Sources: </span><span class="si">{</span><span class="n">result</span><span class="p">[</span><span class="s">'sources'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Chunks used: </span><span class="si">{</span><span class="n">result</span><span class="p">[</span><span class="s">'chunks_used'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>When you run this, you’ll see the pipeline logging each stage (retrieval, re-ranking, context formatting, generation) along with the final answer and the source documents it drew from. That logging isn’t cosmetic; it’s how you debug when something goes wrong.</p>

<p>Also: <code class="language-plaintext highlighter-rouge">temperature=0</code>. I cannot stress this enough. RAG applications are not creative writing tasks. You want the model to be deterministic and faithful to its context. Set it to zero and leave it there.</p>

<hr />

<h2 id="you-cant-improve-what-you-dont-measure">You Can’t Improve What You Don’t Measure</h2>

<p>Here’s a question most RAG tutorials skip entirely: how do you know your pipeline is actually good?</p>

<p>“It seems to answer things correctly” is not an answer. I’ve seen pipelines that produce fluent, confident, well-structured responses that are subtly wrong 30% of the time. You won’t catch that without systematic evaluation.</p>

<pre><code class="language-mermaid">%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#313244', 'primaryTextColor': '#CDD6F4', 'primaryBorderColor': '#45475A', 'lineColor': '#A6ADC8', 'background': '#1E1E2E', 'mainBkg': '#313244', 'clusterBkg': '#24243E', 'fontFamily': 'JetBrains Mono, monospace', 'fontSize': '13px'}}}%%
flowchart TD
    classDef root   fill:#45475A,stroke:#CDD6F4,color:#CDD6F4
    classDef blue   fill:#313244,stroke:#89B4FA,color:#89B4FA
    classDef green  fill:#313244,stroke:#A6E3A1,color:#A6E3A1
    classDef dim    fill:#1E1E2E,stroke:#45475A,color:#A6ADC8

    ROOT["evaluate(dataset)&lt;br/&gt;RAGAS"]:::root

    subgraph RQ["Retrieval Quality"]
        CR["context_recall&lt;br/&gt;# all relevant docs retrieved?"]:::blue
        CP["context_precision&lt;br/&gt;# retrieved docs are relevant?"]:::blue
        MR["MRR / NDCG&lt;br/&gt;# best docs ranked highest?"]:::blue
    end

    subgraph GQ["Generation Quality"]
        FA["faithfulness&lt;br/&gt;# answer grounded in context?&lt;br/&gt;target &gt; 0.85"]:::green
        AR["answer_relevancy&lt;br/&gt;# answer addresses question?"]:::green
        HR["hallucination_rate&lt;br/&gt;# facts not in context?"]:::green
    end

    NOTE["# scores: 0.0 → 1.0&lt;br/&gt;# run after every pipeline change"]:::dim

    ROOT --&gt; CR &amp; CP &amp; MR
    ROOT --&gt; FA &amp; AR &amp; HR
    FA &amp; AR &amp; HR -.-&gt; NOTE
</code></pre>

<p>RAGAS is the library I use for this. It evaluates four things that matter:</p>

<p><strong>Faithfulness</strong> measures whether the answer is actually grounded in the retrieved context. This is your hallucination detector. A low score here means your LLM is drawing on its parametric memory instead of your documents.</p>

<p><strong>Answer relevancy</strong> measures whether the answer actually addresses the question. You can be faithful to the context while still giving a technically correct but non-responsive answer.</p>

<p><strong>Context recall</strong> and <strong>context precision</strong> measure the retrieval layer specifically: did you get the right documents, and only the right documents?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># pip install ragas datasets
</span><span class="kn">from</span> <span class="nn">ragas</span> <span class="kn">import</span> <span class="n">evaluate</span>
<span class="kn">from</span> <span class="nn">ragas.metrics</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">faithfulness</span><span class="p">,</span>        <span class="c1"># Is the answer grounded in the retrieved context?
</span>    <span class="n">answer_relevancy</span><span class="p">,</span>   <span class="c1"># Does the answer actually address the question?
</span>    <span class="n">context_recall</span><span class="p">,</span>     <span class="c1"># Were the relevant documents retrieved?
</span>    <span class="n">context_precision</span><span class="p">,</span>  <span class="c1"># Are the retrieved documents actually relevant?
</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">Dataset</span>

<span class="k">def</span> <span class="nf">evaluate_rag_pipeline</span><span class="p">(</span><span class="n">pipeline</span><span class="p">:</span> <span class="n">RAGPipeline</span><span class="p">,</span> <span class="n">test_cases</span><span class="p">:</span> <span class="nb">list</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="s">"""
    Evaluate the RAG pipeline on a set of test cases using RAGAS metrics.

    test_cases format:
    [
        {
            "question": "...",
            "ground_truth": "...",  # Expected answer
        },
        ...
    ]
    """</span>

    <span class="n">questions</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">answers</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">contexts</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">ground_truths</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Running evaluation on </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">test_cases</span><span class="p">)</span><span class="si">}</span><span class="s"> test cases..."</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">test</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">test_cases</span><span class="p">):</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Test </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">test_cases</span><span class="p">)</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">test</span><span class="p">[</span><span class="s">'question'</span><span class="p">][</span><span class="si">:</span><span class="mi">60</span><span class="p">]</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>

        <span class="n">result</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">test</span><span class="p">[</span><span class="s">"question"</span><span class="p">])</span>

        <span class="c1"># Also retrieve raw context for RAGAS
</span>        <span class="n">raw_docs</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">retriever</span><span class="p">.</span><span class="n">invoke</span><span class="p">(</span><span class="n">test</span><span class="p">[</span><span class="s">"question"</span><span class="p">])</span>
        <span class="n">context_texts</span> <span class="o">=</span> <span class="p">[</span><span class="n">doc</span><span class="p">.</span><span class="n">page_content</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">raw_docs</span><span class="p">[:</span><span class="mi">4</span><span class="p">]]</span>

        <span class="n">questions</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">test</span><span class="p">[</span><span class="s">"question"</span><span class="p">])</span>
        <span class="n">answers</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s">"answer"</span><span class="p">])</span>
        <span class="n">contexts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">context_texts</span><span class="p">)</span>
        <span class="n">ground_truths</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">test</span><span class="p">[</span><span class="s">"ground_truth"</span><span class="p">])</span>

    <span class="c1"># Build dataset for RAGAS
</span>    <span class="n">eval_dataset</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_dict</span><span class="p">({</span>
        <span class="s">"question"</span><span class="p">:</span> <span class="n">questions</span><span class="p">,</span>
        <span class="s">"answer"</span><span class="p">:</span> <span class="n">answers</span><span class="p">,</span>
        <span class="s">"contexts"</span><span class="p">:</span> <span class="n">contexts</span><span class="p">,</span>
        <span class="s">"ground_truth"</span><span class="p">:</span> <span class="n">ground_truths</span>
    <span class="p">})</span>

    <span class="c1"># Run RAGAS evaluation
</span>    <span class="n">scores</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span>
        <span class="n">eval_dataset</span><span class="p">,</span>
        <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="n">faithfulness</span><span class="p">,</span> <span class="n">answer_relevancy</span><span class="p">,</span> <span class="n">context_recall</span><span class="p">,</span> <span class="n">context_precision</span><span class="p">]</span>
    <span class="p">)</span>

    <span class="k">return</span> <span class="n">scores</span>


<span class="c1"># --- Define test cases ---
</span><span class="n">test_cases</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span>
        <span class="s">"question"</span><span class="p">:</span> <span class="s">"What is the role of positional encoding in transformers?"</span><span class="p">,</span>
        <span class="s">"ground_truth"</span><span class="p">:</span> <span class="s">"Positional encoding adds information about the position of tokens in a sequence, since the transformer architecture itself has no inherent notion of order."</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="s">"question"</span><span class="p">:</span> <span class="s">"How does multi-head attention differ from single-head attention?"</span><span class="p">,</span>
        <span class="s">"ground_truth"</span><span class="p">:</span> <span class="s">"Multi-head attention runs self-attention multiple times in parallel with different learned projections, allowing the model to attend to information from different representation subspaces."</span>
    <span class="p">}</span>
<span class="p">]</span>

<span class="n">scores</span> <span class="o">=</span> <span class="n">evaluate_rag_pipeline</span><span class="p">(</span><span class="n">pipeline</span><span class="p">,</span> <span class="n">test_cases</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Evaluation Results:"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Faithfulness:       </span><span class="si">{</span><span class="n">scores</span><span class="p">[</span><span class="s">'faithfulness'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Answer Relevancy:   </span><span class="si">{</span><span class="n">scores</span><span class="p">[</span><span class="s">'answer_relevancy'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Context Recall:     </span><span class="si">{</span><span class="n">scores</span><span class="p">[</span><span class="s">'context_recall'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Context Precision:  </span><span class="si">{</span><span class="n">scores</span><span class="p">[</span><span class="s">'context_precision'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<p>You’ll get four scores between 0 and 1. Faithfulness below 0.85 tells you your LLM is going off-script. Context precision below 0.80 tells you your retrieval is pulling in irrelevant chunks. Start with a test set of 20-30 questions and treat these numbers as your baseline before you change anything else.</p>

<hr />

<h2 id="what-id-tell-myself-before-starting">What I’d Tell Myself Before Starting</h2>

<p>A few things I’ve distilled from all of this:</p>

<p><strong>Chunk size is where you should spend your debugging time first.</strong> Before you blame your embedding model or experiment with exotic retrieval strategies, print out some sample chunks and ask yourself whether they contain coherent, complete thoughts. This is unglamorous work but it pays off faster than anything else.</p>

<p><strong>The indexing and query pipelines are separate systems.</strong> Build them that way from the start. Your future self will thank you when you need to re-index 50,000 documents at 2am without touching the query service.</p>

<p><strong>Use MMR.</strong> Pure cosine similarity is fine for demos. In production with a real knowledge base, you’ll get redundant results constantly. MMR takes an extra parameter and fixes this. There’s no good reason not to use it.</p>

<p>On re-ranking: I resisted it for too long. The latency cost (50-150ms) is real, but for a knowledge base application where users expect accurate answers, it’s worth it. If you’re building something latency-sensitive, benchmark both and decide with data rather than intuition.</p>

<p>The most underrated practice here is writing test cases before you optimize anything. It’s easy to spend a day tuning chunk sizes or trying different embedding models and convince yourself things are getting better based on a handful of manual tests. RAGAS scores on a fixed test set give you a real signal. Without them, you’re just guessing.</p>

<p>I’m still not 100% sure semantic chunking is worth the overhead for most applications. It produces better-quality chunks in theory, and in isolated tests I’ve seen it improve context precision by a few points. But it’s significantly slower at indexing time and adds a dependency on another embedding pass. For now, I default to recursive character splitting and only reach for semantic chunking when I have evidence that chunk quality is the bottleneck.</p>

<hr />

<h2 id="where-this-goes-next">Where This Goes Next</h2>

<p>The honest answer is that RAG is still an unsolved problem. The pipeline in this post works well, but I’ve been watching a few developments closely.</p>

<p>Query rewriting is probably the next thing I add to this stack — having the LLM rephrase the user’s raw question before retrieval catches a lot of cases where users ask things in a way that’s natural for a human but terrible for semantic search. Hybrid search (combining BM25 keyword matching with vector similarity) is also on my list, especially for domains where exact terminology matters.</p>

<p>What I keep coming back to, though, is something less exciting: better document preparation. The more time I’ve spent on these systems, the more I believe that the quality of your source documents matters more than almost any retrieval optimization you can apply downstream. How they’re structured, how consistently they’re formatted, how much noise is stripped out before they hit the chunker: all of that compounds through every stage of the pipeline. Garbage in, garbage out, regardless of how clever your pipeline is.</p>

<p>If you build this and hit walls, the most useful thing you can do is instrument your pipeline end to end and look at where quality is leaking. Usually it’s obvious once you’re looking at actual data rather than gut-checking demo queries.</p>

<hr />]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="ai" /><category term="engineering" /><category term="rag" /><category term="rag" /><category term="langchain" /><category term="chromadb" /><category term="openai" /><category term="vector-search" /><category term="llm" /><summary type="html"><![CDATA[A practitioner's guide to building production-grade RAG pipelines — covering chunking strategy, retrieval tuning, re-ranking, and evaluation. Lessons from real failures, not sanitized tutorials.]]></summary></entry><entry><title type="html">Zero Vibe Coding: How I Engineered a Full-Stack Finance App with Claude Code’s Multi-Agent Pipeline</title><link href="https://shanmuga-sundaram-n.github.io/blog/2026/03/20/zero-vibe-coding-finance-tracker/" rel="alternate" type="text/html" title="Zero Vibe Coding: How I Engineered a Full-Stack Finance App with Claude Code’s Multi-Agent Pipeline" /><published>2026-03-20T12:00:00+00:00</published><updated>2026-03-20T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2026/03/20/zero-vibe-coding-finance-tracker</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2026/03/20/zero-vibe-coding-finance-tracker/"><![CDATA[<p>When we talk about AI-assisted coding, the conversation usually revolves around “vibe coding” — throwing vague prompts at an LLM, connecting a UI layout tool, and crossing our fingers that the underlying architecture holds together.</p>

<p>I wanted to see what happens when you treat AI not as a magic wand, but as an embedded enterprise engineering team. To test this, I built a production-ready Personal Finance Tracker entirely from the terminal. I strictly avoided UI Model Context Protocol (MCP) integrations like Figma. Instead, I forced the <strong>Claude Code CLI</strong> into a rigid, Spec-Driven, Multi-Agent Development Environment.</p>

<p>You can check out the final codebase here: <a href="https://github.com/shanmuga-sundaram-n/personal-finance-tracker">Personal Finance Tracker on GitHub</a></p>

<h2 id="the-final-product">The Final Product</h2>

<p>Before diving into <em>how</em> the AI built this, here’s a look at the final application running locally — clean, modern interface, responsive sidebar navigation, and distinct feature pages (Dashboard, Transactions, Budgets) generated without ever touching a visual design tool:</p>

<p><img src="/assets/images/projects/personal-finance-tracker.webp" alt="Personal Finance Tracker UI Demo" /></p>

<hr />

<h2 id="1-unleashing-claude-code-cli-features">1. Unleashing Claude Code CLI Features</h2>

<p>The Claude Code CLI provides native capabilities that elevate it out of a standard web chat interface. By leveraging the local filesystem, I turned the CLI into an automated powerhouse:</p>

<ul>
  <li>
    <p><strong>Custom Prompt Hooks (<code class="language-plaintext highlighter-rouge">.claude/hooks/</code>):</strong> Building a modern React App requires strict linting, and AI can easily format things incorrectly. I created a <code class="language-plaintext highlighter-rouge">.claude/hooks/post-edit-lint.sh</code> script. Every time Claude Code edited a <code class="language-plaintext highlighter-rouge">.tsx</code> or <code class="language-plaintext highlighter-rouge">.ts</code> file, the CLI automatically fired off <code class="language-plaintext highlighter-rouge">$ESLINT_BIN --fix</code>. This completely eliminated broken CI builds due to styling errors.</p>
  </li>
  <li>
    <p><strong>Persistent CLI Memory (<code class="language-plaintext highlighter-rouge">.claude/agent-memory/</code>):</strong> I stored detailed markdown files like <code class="language-plaintext highlighter-rouge">DOMAIN-OWNERSHIP.md</code> and <code class="language-plaintext highlighter-rouge">hexagonal-architecture.md</code>. Whenever I prompted Claude Code, it automatically fetched this local context to write code compliant with my architectural decisions.</p>
  </li>
  <li>
    <p><strong>Context Compaction (<code class="language-plaintext highlighter-rouge">/compact</code>):</strong> Building a complex backend generates massive CLI context. Once Claude finished planning a feature, I simply typed <code class="language-plaintext highlighter-rouge">/compact</code>. This reset the active context window while keeping core memories intact — saving tokens and keeping the CLI fast.</p>
  </li>
  <li>
    <p><strong>Intelligent Commits (<code class="language-plaintext highlighter-rouge">/commit</code>):</strong> After <code class="language-plaintext highlighter-rouge">./gradlew test</code> passed, rather than typing out git messages, I ran <code class="language-plaintext highlighter-rouge">/commit</code>. Claude Code read the actual <code class="language-plaintext highlighter-rouge">git diff</code> and generated a perfect, semantic commit message right in the terminal.</p>
  </li>
</ul>

<h3 id="simulated-terminal-run-claude-code-in-action">Simulated Terminal Run: Claude Code in Action</h3>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>claude
╭────────────────────────────────────────────────────────────────────────╮
│ Claude Code <span class="o">(</span>v0.x.x<span class="o">)</span> - Project: personal-finance-tracker               │
╰────────────────────────────────────────────────────────────────────────╯
▶ I want to add double-entry transfers between accounts. Follow the
  <span class="sb">`</span>Transaction.md<span class="sb">`</span> PRD and use the solution-planner workflow.

<span class="o">(</span>Claude is thinking...<span class="o">)</span>
<span class="o">[</span>Reading files: docs/product/Transaction.md,
 .claude/agent-memory/tech-lead/architecture-decisions.md]

I will act as the solution-planner. Let<span class="s1">'s start the pipeline.
1. Running Domain Analysis against FR-1 (Double-Entry Transfers).
   - The transaction must atomically generate a TRANSFER_OUT and TRANSFER_IN.
2. Running Tech-Lead Architecture Review...
   - I will utilize a Spring `@Transactional` block.

Would you like me to spawn `full-stack-dev` to implement this now? (Y/n) &gt; y
</span></code></pre></div></div>

<hr />

<h2 id="2-coordinated-execution-the-7-agent-pipeline">2. Coordinated Execution: The 7-Agent Pipeline</h2>

<p>Instead of hooking up an external Figma MCP and letting a single AI guess at the UI and database simultaneously, I enforced spec-driven development using a pipeline of 7 distinct Markdown-based personas inside <code class="language-plaintext highlighter-rouge">.claude/agents/</code>.</p>

<p>The most impressive part wasn’t the AI’s coding ability — it was <strong>the strict handoff mechanism</strong> between agents. They never “spoke” in a chaotic group chat. They communicated by writing strict Markdown files (Briefs) to each other, orchestrated entirely by the <code class="language-plaintext highlighter-rouge">solution-planner</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User Request
    │
    ▼
solution-planner (Orchestrator)
    │
    ├── 1. personal-finance-analyst  → Domain Brief
    ├── 2. tech-lead                 → Architecture Brief
    ├── 3. full-stack-dev            → Implementation
    ├── 4. ux-ui-designer            → Accessibility Audit
    ├── 5. qa-automation-tester      → Test Gatekeeper
    └── 6. devops-engineer           → CI/CD &amp; Containers
                                            │
                                            ▼
                                    Feature Complete
</code></pre></div></div>

<p>Here’s exactly how a feature flowed through this virtual team:</p>

<p><strong>Step 1 — <code class="language-plaintext highlighter-rouge">solution-planner</code> (The Orchestrator):</strong> The central brain. This agent was explicitly instructed to <em>never</em> write software code. Its only job was to take my feature request, kickstart the pipeline, and gather outputs from the analysts.</p>

<p><strong>Step 2 — <code class="language-plaintext highlighter-rouge">personal-finance-analyst</code> (Domain Brief):</strong> Before pushing code, the planner invoked the analyst. This agent evaluated feature requests against strict PRDs, caught financial edge cases like rounding errors, and generated a <strong>Domain Brief</strong>.</p>

<p><strong>Step 3 — <code class="language-plaintext highlighter-rouge">tech-lead</code> (Architecture Brief):</strong> The architect analyzed business rules and translated them into Hexagonal Architecture and Liquibase schema designs. It generated an <strong>Architecture Brief</strong> outlining exact Java classes, <code class="language-plaintext highlighter-rouge">@Transactional</code> boundaries, and interfaces — catching cross-context import violations before they happened.</p>

<p><strong>Step 4 — <code class="language-plaintext highlighter-rouge">full-stack-dev</code> (Execution):</strong> The developer received the merged Feature Implementation Brief and focused on end-to-end data flow from the PostgreSQL backend to the React Vite frontend — avoiding N+1 queries in JPA and using proper React state management.</p>

<p><strong>Step 5 — <code class="language-plaintext highlighter-rouge">ux-ui-designer</code> (Anti-Vibe Coding Audit):</strong> By explicitly avoiding UI MCPs, I forced Claude to rely on strict UX heuristics. This agent reviewed React code strictly against WCAG 2.1 AA standards and Fitts’s Law spacing (touch targets ≥ 44px).</p>

<p><strong>Step 6 — <code class="language-plaintext highlighter-rouge">qa-automation-tester</code> (The Gatekeeper):</strong> This agent wrote integration tests via JUnit and MockMvc, utilized <code class="language-plaintext highlighter-rouge">axe-core</code> for accessibility enforcement, and refused to complete until <code class="language-plaintext highlighter-rouge">./gradlew test</code> passed in the terminal.</p>

<p><strong>Step 7 — <code class="language-plaintext highlighter-rouge">devops-engineer</code> (Infrastructure):</strong> Configured Testcontainers (<code class="language-plaintext highlighter-rouge">postgres:15.2</code>), wrote multi-stage Dockerfiles with Alpine JRE layers, and managed GitHub Actions caching to slash build times.</p>

<hr />

<h2 id="3-engineering-strict-guardrails">3. Engineering Strict Guardrails</h2>

<p>To ensure the CLI didn’t drift as the conversation grew, I enforced architectural purity with <strong>hard automated constraints</strong> rather than hoping Claude would follow instructions.</p>

<h3 id="guardrail-a-pure-hexagonal-boundaries-via-archunit">Guardrail A: Pure Hexagonal Boundaries via ArchUnit</h3>

<p>The domain layer must be completely unaware of Spring or databases. If Claude generated a <code class="language-plaintext highlighter-rouge">Lombok</code> or <code class="language-plaintext highlighter-rouge">@Autowired</code> import inside the domain logic, the build failed physically in the terminal:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// ArchitectureTest.java</span>
<span class="nd">@ArchTest</span>
<span class="kd">static</span> <span class="kd">final</span> <span class="nc">ArchRule</span> <span class="n">domain_must_not_import_spring</span> <span class="o">=</span>
        <span class="n">noClasses</span><span class="o">()</span>
                <span class="o">.</span><span class="na">that</span><span class="o">().</span><span class="na">resideInAPackage</span><span class="o">(</span><span class="s">"..domain.."</span><span class="o">)</span>
                <span class="o">.</span><span class="na">should</span><span class="o">().</span><span class="na">dependOnClassesThat</span><span class="o">()</span>
                <span class="o">.</span><span class="na">resideInAnyPackage</span><span class="o">(</span><span class="s">"org.springframework.."</span><span class="o">,</span> <span class="s">"jakarta.persistence.."</span><span class="o">,</span> <span class="s">"lombok.."</span><span class="o">)</span>
                <span class="o">.</span><span class="na">as</span><span class="o">(</span><span class="s">"Domain classes must not depend on Spring, JPA, Jackson, or Lombok"</span><span class="o">);</span>
</code></pre></div></div>

<h3 id="guardrail-b-clean-wiring-without-framework-pollution">Guardrail B: Clean Wiring Without Framework Pollution</h3>

<p>Claude was forbidden from using <code class="language-plaintext highlighter-rouge">@Service</code> or <code class="language-plaintext highlighter-rouge">@Component</code> on core business logic. Instead, it learned to construct a dedicated configuration class to wire dependencies manually:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// account/config/AccountConfig.java</span>
<span class="nd">@Configuration</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">AccountConfig</span> <span class="o">{</span>
    <span class="nd">@Bean</span>
    <span class="kd">public</span> <span class="nc">AccountCommandService</span> <span class="nf">accountCommandService</span><span class="o">(</span>
            <span class="nc">AccountPersistencePort</span> <span class="n">accountPersistencePort</span><span class="o">,</span>
            <span class="nc">AccountEventPublisherPort</span> <span class="n">accountEventPublisherPort</span><span class="o">)</span> <span class="o">{</span>
        <span class="k">return</span> <span class="k">new</span> <span class="nf">AccountCommandService</span><span class="o">(</span><span class="n">accountPersistencePort</span><span class="o">,</span> <span class="n">accountEventPublisherPort</span><span class="o">);</span>
    <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<h3 id="guardrail-c-persistent-database-models-vs-domain-models">Guardrail C: Persistent Database Models vs Domain Models</h3>

<p>To avoid attaching <code class="language-plaintext highlighter-rouge">@Entity</code> to a domain object (which allows bad states to bypass business invariants), Claude maintained two separate class structures:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// account/adapter/outbound/persistence/AccountJpaEntity.java</span>
<span class="nd">@Entity</span>
<span class="nd">@Table</span><span class="o">(</span><span class="n">schema</span> <span class="o">=</span> <span class="s">"finance_tracker"</span><span class="o">,</span> <span class="n">name</span> <span class="o">=</span> <span class="s">"accounts"</span><span class="o">)</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">AccountJpaEntity</span> <span class="kd">extends</span> <span class="nc">AuditableJpaEntity</span> <span class="o">{</span>

    <span class="nd">@Column</span><span class="o">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"current_balance"</span><span class="o">,</span> <span class="n">nullable</span> <span class="o">=</span> <span class="kc">false</span><span class="o">,</span> <span class="n">precision</span> <span class="o">=</span> <span class="mi">19</span><span class="o">,</span> <span class="n">scale</span> <span class="o">=</span> <span class="mi">4</span><span class="o">)</span>
    <span class="kd">private</span> <span class="nc">BigDecimal</span> <span class="n">currentBalance</span><span class="o">;</span> <span class="c1">// Rule: Must be exactly NUMERIC(19,4)</span>

    <span class="nd">@Version</span>
    <span class="nd">@Column</span><span class="o">(</span><span class="n">name</span> <span class="o">=</span> <span class="s">"version"</span><span class="o">,</span> <span class="n">nullable</span> <span class="o">=</span> <span class="kc">false</span><span class="o">)</span>
    <span class="kd">private</span> <span class="nc">Long</span> <span class="n">version</span><span class="o">;</span> <span class="c1">// Rule: Required for optimistic locking</span>
<span class="o">}</span>
</code></pre></div></div>

<h3 id="guardrail-d-anti-corruption-layers-bounded-context-isolation">Guardrail D: Anti-Corruption Layers (Bounded Context Isolation)</h3>

<p>If the <code class="language-plaintext highlighter-rouge">budget</code> module needed <code class="language-plaintext highlighter-rouge">transaction</code> data, Claude had to build an explicit cross-context outbound port adapter — it could not directly import across bounded contexts:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// budget/domain/port/outbound/TransactionSummaryPort.java</span>
<span class="kd">public</span> <span class="kd">interface</span> <span class="nc">TransactionSummaryPort</span> <span class="o">{</span>
    <span class="nc">Money</span> <span class="nf">sumExpensesForCategory</span><span class="o">(</span><span class="nc">CategoryId</span> <span class="n">categoryId</span><span class="o">,</span> <span class="nc">LocalDate</span> <span class="n">start</span><span class="o">,</span> <span class="nc">LocalDate</span> <span class="n">end</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>

<hr />

<h2 id="4-feature-spotlight-dynamic-budget-aggregation">4. Feature Spotlight: Dynamic Budget Aggregation</h2>

<p><strong>The Challenge:</strong> Storing a physical <code class="language-plaintext highlighter-rouge">spentAmount</code> column on a Budget table is a classic beginner mistake — if you delete a transaction from 3 weeks ago, the budget’s <code class="language-plaintext highlighter-rouge">spentAmount</code> goes out of sync.</p>

<p><strong>The Solution:</strong> Because Claude Code was forced to follow the PRD, it leveraged Guardrail D to build a <strong>Runtime Synthesis Engine</strong> — dynamically computing usage on the fly rather than storing stale state:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Inside BudgetQueryService (Domain logic)</span>
<span class="nc">Money</span> <span class="n">spentThisPeriod</span> <span class="o">=</span> <span class="n">transactionSummaryPort</span><span class="o">.</span><span class="na">sumExpensesForCategory</span><span class="o">(</span>
    <span class="n">budget</span><span class="o">.</span><span class="na">getCategoryId</span><span class="o">(),</span>
    <span class="n">budget</span><span class="o">.</span><span class="na">getStartDate</span><span class="o">(),</span>
    <span class="n">budget</span><span class="o">.</span><span class="na">getEndDate</span><span class="o">()</span>
<span class="o">);</span>
<span class="n">budget</span><span class="o">.</span><span class="na">evaluateStatus</span><span class="o">(</span><span class="n">spentThisPeriod</span><span class="o">);</span> <span class="c1">// Flags true if spent &gt;= 85% alert threshold</span>
</code></pre></div></div>

<h3 id="qa-execution">QA Execution</h3>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>▶ Automatically running QA sequence <span class="k">for </span>new budget logic...

<span class="nv">$ </span>./gradlew <span class="nb">test</span> <span class="nt">--tests</span> <span class="s2">"com.shan.cyber.tech.financetracker.budget.*"</span>

<span class="o">&gt;</span> Task :application:test
com.shan.cyber.tech.financetracker.budget.domain.BudgetQueryServiceTest
  ✔ dynamic_synthesis_calculates_correct_usage_percent<span class="o">()</span>
  ✔ threshold_alert_triggered_when_over_85_percent<span class="o">()</span>

BUILD SUCCESSFUL <span class="k">in </span>3s

<span class="o">[</span>Claude]: All tests passed. ArchUnit hex boundaries are clean.
          Should I <span class="sb">`</span>/commit<span class="sb">`</span> these changes? <span class="o">(</span>Y/n<span class="o">)</span> <span class="o">&gt;</span> y
</code></pre></div></div>

<hr />

<h2 id="final-thoughts-from-chatbot-to-engineering-partner">Final Thoughts: From Chatbot to Engineering Partner</h2>

<p>Building a production-ready application with Claude Code CLI reveals a massive shift in how AI should be utilized for software engineering. By rejecting “vibe coding” and external MCP crutches, and instead relying on Claude Code’s native hooks, persistent memory, and <code class="language-plaintext highlighter-rouge">/compact</code> commands, I was able to enforce a rigid, 7-agent spec-driven environment.</p>

<p>This process didn’t just build an app for me — it taught me how to properly design scalable databases, orchestrate Hexagonal solutions, and respect WCAG UX heuristics. By forcing Claude Code to respect pipeline personas and automated tests, you don’t just get an AI assistant that writes text files — you get a senior engineering team embedded directly in your terminal.</p>

<p>Check out the full source: <a href="https://github.com/shanmuga-sundaram-n/personal-finance-tracker">github.com/shanmuga-sundaram-n/personal-finance-tracker</a></p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="ai" /><category term="engineering" /><category term="claude-code" /><category term="claude-code" /><category term="multi-agent" /><category term="hexagonal-architecture" /><category term="spring-boot" /><category term="react" /><summary type="html"><![CDATA[How I built a production-ready Personal Finance Tracker entirely from the terminal using Claude Code CLI with a strict 7-agent spec-driven pipeline — no UI MCP tools, no vibe coding.]]></summary></entry><entry><title type="html">LLM Tokenizers: The Hidden Engine Behind AI Language Models</title><link href="https://shanmuga-sundaram-n.github.io/blog/2025/03/08/llm-tokenizers-hidden-engine/" rel="alternate" type="text/html" title="LLM Tokenizers: The Hidden Engine Behind AI Language Models" /><published>2025-03-08T12:00:00+00:00</published><updated>2025-03-08T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2025/03/08/llm-tokenizers-hidden-engine</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2025/03/08/llm-tokenizers-hidden-engine/"><![CDATA[<p><img src="/assets/images/blog/llm-tokenizers.jpg" alt="LLM Tokenizers" /></p>

<p>Every interaction with a large language model begins long before any neural network computation. It starts with <strong>tokenization</strong> — the process of converting human language into the numerical representations that models actually process. Understanding tokenizers unlocks a deeper understanding of why LLMs behave the way they do.</p>

<hr />

<h2 id="what-is-tokenization">What is Tokenization?</h2>

<p>Tokenization transforms text into discrete units called <strong>tokens</strong> — numerical representations the model can process. Rather than understanding words directly, models convert input into token IDs corresponding to entries in a vocabulary.</p>

<p>“Hello, world!” might become <code class="language-plaintext highlighter-rouge">[15496, 11, 995, 0]</code> — four numbers that the model processes as a sequence. The model never sees the original text; it only sees tokens.</p>

<hr />

<h2 id="why-tokenization-matters">Why Tokenization Matters</h2>

<p>The choice of tokenizer has profound downstream effects:</p>

<ol>
  <li><strong>Context window capacity</strong> — the maximum text a model can process is measured in tokens, not words or characters</li>
  <li><strong>API costs</strong> — most LLM APIs charge per token; tokenization efficiency directly affects cost</li>
  <li><strong>Cross-lingual performance</strong> — languages with inefficient tokenization get fewer “thoughts” per context window</li>
  <li><strong>Specialised content</strong> — code, mathematical notation, and domain-specific text tokenize very differently than prose</li>
</ol>

<hr />

<h2 id="three-main-tokenization-approaches">Three Main Tokenization Approaches</h2>

<h3 id="1-word-based-tokenization">1. Word-Based Tokenization</h3>

<p>The system scans text character-by-character, treating delimiters (spaces, punctuation) as token boundaries. “Hello world!” becomes individual word units.</p>

<p><strong>Advantages:</strong></p>
<ul>
  <li>Intuitive and semantically meaningful</li>
  <li>Simple to implement</li>
  <li>Efficient for common vocabulary</li>
</ul>

<p><strong>Limitations:</strong></p>
<ul>
  <li>Massive vocabularies for morphologically rich languages</li>
  <li>Out-of-vocabulary (OOV) problems for rare or new words</li>
  <li>Compound word challenges (German, Finnish, Turkish suffer most)</li>
</ul>

<hr />

<h3 id="2-character-based-tokenization">2. Character-Based Tokenization</h3>

<p>Each character becomes a token. “Hello world” breaks into 11 tokens: <code class="language-plaintext highlighter-rouge">H-e-l-l-o-[space]-w-o-r-l-d</code>.</p>

<p><strong>Advantages:</strong></p>
<ul>
  <li>Tiny vocabulary (100–200 tokens covers most languages)</li>
  <li>Eliminates unknown word problems entirely</li>
</ul>

<p><strong>Trade-offs:</strong></p>
<ul>
  <li>Sequences 5–10× longer than word-based</li>
  <li>The model must reconstruct semantic meaning from individual characters</li>
  <li>Higher computational cost for the same amount of text</li>
</ul>

<hr />

<h3 id="3-subword-tokenization-the-modern-standard">3. Subword Tokenization (The Modern Standard)</h3>

<p>Modern LLMs — GPT, Claude, Llama, Gemini — all use subword tokenization. Words are split into meaningful units: “Unlikeliest” becomes <code class="language-plaintext highlighter-rouge">["Un", "likely", "est"]</code>.</p>

<p>The dominant algorithm is <strong>Byte-Pair Encoding (BPE)</strong>:</p>

<ol>
  <li>Start with a vocabulary of individual characters</li>
  <li>Count the frequency of all adjacent token pairs</li>
  <li>Merge the most common pair into a new token</li>
  <li>Repeat until reaching the target vocabulary size (typically 10,000–100,000 tokens)</li>
</ol>

<p>The result: common words become single tokens, rare words get split into recognisable subword units, and the vocabulary is compact enough to be manageable.</p>

<hr />

<h2 id="impact-on-model-performance">Impact on Model Performance</h2>

<h3 id="economic-implications">Economic Implications</h3>

<p>Tokenization efficiency has real cost consequences:</p>

<ul>
  <li><strong>Training efficiency</strong> improves with optimised tokenization</li>
  <li><strong>Fine-tuning costs</strong> scale directly with tokens per training example</li>
  <li><strong>Inference latency</strong> varies based on how a given input tokenizes</li>
</ul>

<p>A well-structured prompt in English might use significantly fewer tokens than an equivalent prompt in another language — directly affecting both cost and available reasoning capacity.</p>

<h3 id="cross-lingual-disparities">Cross-Lingual Disparities</h3>

<p>This is one of the most significant fairness issues in modern LLMs. Languages with less efficient tokenization:</p>

<ul>
  <li>Get fewer reasoning steps within a fixed context window</li>
  <li>Cost more per equivalent amount of text</li>
  <li>May have lower model quality due to less training data representation</li>
</ul>

<p>English typically tokenizes very efficiently in models trained primarily on English text. Languages like Thai, Arabic, or many African languages often require significantly more tokens for equivalent content.</p>

<h3 id="technical-effects-on-model-reasoning">Technical Effects on Model Reasoning</h3>

<ul>
  <li><strong>Attention dilution</strong> — when semantic units are spread across many tokens, the model’s attention mechanism must work harder to connect related concepts</li>
  <li><strong>Boundary artifacts</strong> — token boundaries that misalign with semantic units can affect how the model processes meaning</li>
  <li><strong>Embedding geometry</strong> — the choice of tokens shapes the internal representation of concepts in the model’s embedding space</li>
</ul>

<hr />

<h2 id="practical-optimisation-strategies">Practical Optimisation Strategies</h2>

<p><strong>For prompt engineering:</strong></p>
<ul>
  <li>Structure prompts with awareness of how your target language tokenizes</li>
  <li>Use tokenizer visualisation tools (like OpenAI’s tokenizer playground) to inspect token boundaries</li>
  <li>Prefer structured formats (JSON, markdown) that often tokenize efficiently</li>
</ul>

<p><strong>For data preparation:</strong></p>
<ul>
  <li>Select formats based on tokenization performance for your use case</li>
  <li>Apply semantic compression — preserve meaning with fewer tokens where possible</li>
  <li>Be aware that code tokenizes very differently than prose</li>
</ul>

<hr />

<h2 id="the-future-of-tokenization">The Future of Tokenization</h2>

<p>Emerging approaches are addressing current limitations:</p>

<ul>
  <li><strong>Character-level fallbacks</strong> for rare words and multilingual content</li>
  <li><strong>Learned tokenizers</strong> that adapt during pre-training</li>
  <li><strong>Semantic tokenization</strong> that incorporates meaning-based rather than purely frequency-based boundaries</li>
  <li><strong>Byte-level models</strong> that operate directly on raw bytes, eliminating the tokenization step entirely</li>
</ul>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>Tokenization forms the crucial bridge between human language and machine understanding. It shapes everything from API costs to model reasoning depth to cross-lingual fairness — yet it’s almost entirely invisible to most users.</p>

<p>Understanding how your text becomes tokens isn’t just an academic exercise. It directly informs better prompt design, more accurate cost estimation, and a clearer mental model of why LLMs behave the way they do.</p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="ai" /><category term="ml" /><category term="llm" /><category term="tokenization" /><category term="nlp" /><category term="ai" /><category term="machine-learning" /><category term="transformers" /><summary type="html"><![CDATA[Tokenizers are the invisible foundation of every large language model — shaping context limits, API costs, cross-lingual fairness, and model reasoning. Here's how they work and why they matter.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/llm-tokenizers.jpg" /><media:content medium="image" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/llm-tokenizers.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Mastering Backpressure in Reactive Programming: A Deep Dive</title><link href="https://shanmuga-sundaram-n.github.io/blog/2024/12/28/mastering-backpressure-reactive-programming/" rel="alternate" type="text/html" title="Mastering Backpressure in Reactive Programming: A Deep Dive" /><published>2024-12-28T12:00:00+00:00</published><updated>2024-12-28T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2024/12/28/mastering-backpressure-reactive-programming</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2024/12/28/mastering-backpressure-reactive-programming/"><![CDATA[<p><img src="/assets/images/blog/backpressure-reactive.jpg" alt="Mastering Backpressure in Reactive Programming" /></p>

<p>Reactive programming allows developers to build highly responsive and scalable systems that handle asynchronous data flows. But a fundamental challenge emerges when <strong>producers emit data faster than consumers can process it</strong>.</p>

<p>Without a solution, this imbalance leads to memory overload, CPU exhaustion, and cascading failures. The solution is <strong>backpressure</strong>.</p>

<hr />

<h2 id="what-is-backpressure">What is Backpressure?</h2>

<p>Backpressure is the mechanism that allows a consumer to communicate to its producer that it cannot keep up with the current data rate. Rather than silently dropping messages or crashing under load, a well-designed reactive system uses backpressure to apply flow control.</p>

<p>Think of it like a water pipe: if you push water in faster than it can flow out, something bursts. Backpressure is the pressure relief valve.</p>

<hr />

<h2 id="why-it-matters">Why It Matters</h2>

<p>Without proper backpressure handling:</p>

<ul>
  <li><strong>Memory overload</strong> — unbounded buffers fill up and cause OOM errors</li>
  <li><strong>CPU exhaustion</strong> — the consumer thrashes trying to process an overwhelming queue</li>
  <li><strong>Cascading failures</strong> — one slow consumer can destabilise an entire pipeline</li>
  <li><strong>Silent data loss</strong> — messages get dropped without any indication</li>
</ul>

<hr />

<h2 id="four-core-backpressure-strategies">Four Core Backpressure Strategies</h2>

<h3 id="1-buffering">1. Buffering</h3>

<p>Temporarily store excess data in a bounded buffer. When the buffer fills, apply additional pressure upstream.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Project Reactor</span>
<span class="nc">Flux</span><span class="o">.</span><span class="na">range</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="mi">1000</span><span class="o">)</span>
    <span class="o">.</span><span class="na">onBackpressureBuffer</span><span class="o">(</span><span class="mi">100</span><span class="o">)</span> <span class="c1">// buffer up to 100 items</span>
    <span class="o">.</span><span class="na">subscribe</span><span class="o">(</span><span class="n">item</span> <span class="o">-&gt;</span> <span class="n">process</span><span class="o">(</span><span class="n">item</span><span class="o">));</span>
</code></pre></div></div>

<p><strong>Use when:</strong> Data loss is unacceptable and consumers will eventually catch up. Be careful with buffer size — an unbounded buffer is no protection at all.</p>

<hr />

<h3 id="2-dropping">2. Dropping</h3>

<p>Discard items when the consumer falls behind. Simpler than buffering, but only appropriate when losing some messages is acceptable.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// RxJava</span>
<span class="nc">Flowable</span><span class="o">.</span><span class="na">interval</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="nc">TimeUnit</span><span class="o">.</span><span class="na">MILLISECONDS</span><span class="o">)</span>
    <span class="o">.</span><span class="na">onBackpressureDrop</span><span class="o">()</span>
    <span class="o">.</span><span class="na">observeOn</span><span class="o">(</span><span class="nc">Schedulers</span><span class="o">.</span><span class="na">computation</span><span class="o">())</span>
    <span class="o">.</span><span class="na">subscribe</span><span class="o">(</span><span class="n">item</span> <span class="o">-&gt;</span> <span class="n">slowProcess</span><span class="o">(</span><span class="n">item</span><span class="o">));</span>
</code></pre></div></div>

<p><strong>Use when:</strong> You’re processing real-time streams (metrics, sensor data) where stale data has no value.</p>

<hr />

<h3 id="3-throttling">3. Throttling</h3>

<p>Control the emission speed from the producer to match consumer capacity. Instead of buffering or dropping, you slow the source down.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Akka Streams</span>
<span class="nc">Source</span><span class="o">.</span><span class="na">repeat</span><span class="o">(</span><span class="s">"event"</span><span class="o">)</span>
    <span class="o">.</span><span class="na">throttle</span><span class="o">(</span><span class="mi">10</span><span class="o">,</span> <span class="nc">Duration</span><span class="o">.</span><span class="na">ofSeconds</span><span class="o">(</span><span class="mi">1</span><span class="o">))</span> <span class="c1">// max 10 elements per second</span>
    <span class="o">.</span><span class="na">runWith</span><span class="o">(</span><span class="nc">Sink</span><span class="o">.</span><span class="na">foreach</span><span class="o">(</span><span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">::</span><span class="n">println</span><span class="o">),</span> <span class="n">system</span><span class="o">);</span>
</code></pre></div></div>

<p><strong>Use when:</strong> The producer is controllable and you want to maintain a steady, sustainable flow rather than bursts.</p>

<hr />

<h3 id="4-requesting-pull-based">4. Requesting (Pull-based)</h3>

<p>Consumers explicitly request a specific number of items from the producer. This is the most precise form of backpressure — the foundation of the Reactive Streams specification.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Project Reactor — limit rate to 10 items at a time</span>
<span class="nc">Flux</span><span class="o">.</span><span class="na">range</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="mi">1000</span><span class="o">)</span>
    <span class="o">.</span><span class="na">limitRate</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span> <span class="c1">// consumer pulls 10 items at a time</span>
    <span class="o">.</span><span class="na">subscribe</span><span class="o">(</span><span class="n">item</span> <span class="o">-&gt;</span> <span class="n">process</span><span class="o">(</span><span class="n">item</span><span class="o">));</span>
</code></pre></div></div>

<p><strong>Use when:</strong> You want fine-grained control over throughput and can predict consumer processing capacity.</p>

<hr />

<h2 id="framework-comparison">Framework Comparison</h2>

<table>
  <thead>
    <tr>
      <th>Framework</th>
      <th>Default Strategy</th>
      <th>Key API</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Project Reactor</td>
      <td>Error on overflow</td>
      <td><code class="language-plaintext highlighter-rouge">onBackpressureBuffer()</code>, <code class="language-plaintext highlighter-rouge">limitRate()</code></td>
    </tr>
    <tr>
      <td>RxJava</td>
      <td>Configurable</td>
      <td><code class="language-plaintext highlighter-rouge">onBackpressureDrop()</code>, <code class="language-plaintext highlighter-rouge">onBackpressureLatest()</code></td>
    </tr>
    <tr>
      <td>Akka Streams</td>
      <td>Built-in propagation</td>
      <td><code class="language-plaintext highlighter-rouge">throttle()</code>, <code class="language-plaintext highlighter-rouge">buffer()</code></td>
    </tr>
  </tbody>
</table>

<p>All three follow the <strong>Reactive Streams specification</strong>, which standardises backpressure handling through the <code class="language-plaintext highlighter-rouge">Publisher</code>/<code class="language-plaintext highlighter-rouge">Subscriber</code> contract.</p>

<hr />

<h2 id="best-practices">Best Practices</h2>

<ol>
  <li><strong>Match strategy to data criticality</strong> — financial transactions need buffering; live metrics can afford dropping</li>
  <li><strong>Always bound your buffers</strong> — unbounded buffers defer the problem rather than solving it</li>
  <li><strong>Test under realistic load</strong> — backpressure issues only surface at production traffic levels</li>
  <li><strong>Understand your consumer’s processing capacity</strong> — profile before tuning</li>
  <li><strong>Monitor queue depths</strong> — expose buffer utilisation as a metric to catch pressure building up</li>
</ol>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>Backpressure isn’t an edge case — it’s a core concern in any reactive system that handles real-world load. Choosing the right strategy (buffer, drop, throttle, or request) depends on your data’s criticality and your consumer’s characteristics. The frameworks provide the tools; understanding the trade-offs is the engineering judgment that makes them work.</p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="engineering" /><category term="java" /><category term="reactive-programming" /><category term="backpressure" /><category term="project-reactor" /><category term="rxjava" /><category term="akka" /><category term="java" /><category term="streaming" /><summary type="html"><![CDATA[Backpressure is the fundamental mechanism that keeps reactive systems from collapsing under load. This deep dive covers what it is, why it matters, and how to implement it correctly across Project Reactor, RxJava, and Akka Streams.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/backpressure-reactive.jpg" /><media:content medium="image" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/backpressure-reactive.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Traditional Coding vs AI-Assisted Coding vs Vibe Coding: A New Spectrum</title><link href="https://shanmuga-sundaram-n.github.io/blog/2024/12/01/traditional-vs-ai-vs-vibe-coding/" rel="alternate" type="text/html" title="Traditional Coding vs AI-Assisted Coding vs Vibe Coding: A New Spectrum" /><published>2024-12-01T12:00:00+00:00</published><updated>2024-12-01T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2024/12/01/traditional-vs-ai-vs-vibe-coding</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2024/12/01/traditional-vs-ai-vs-vibe-coding/"><![CDATA[<p><img src="/assets/images/blog/traditional-vs-ai-vs-vibe-coding.jpg" alt="Traditional vs AI-Assisted vs Vibe Coding" /></p>

<p>Something has shifted in how software gets written. As an engineer, I’ve noticed that I no longer work in a single mode throughout the day. Depending on the task, the stakes, and the context, I switch between three fundamentally different headspaces — and recognising which one to be in has become as important as any technical skill.</p>

<hr />

<h2 id="mode-1-traditional-coding">Mode 1: Traditional Coding</h2>

<p>Traditional coding is manual, deliberate, and total-control. You write every line, reason through every decision, and understand the full implications of the code you produce.</p>

<p>There is a deep satisfaction in knowing you understand every single line. When something breaks, you know where to look. When a colleague asks why something works a certain way, you can explain it fully.</p>

<p>This mode is slower — but it builds the deepest understanding. It’s where you learn. It’s where complex, business-critical, or security-sensitive code should live.</p>

<p><strong>Best for:</strong></p>
<ul>
  <li>Core domain logic</li>
  <li>Security-critical code paths</li>
  <li>Novel problems with no clear prior art</li>
  <li>Situations where you need to deeply understand what you’re building</li>
</ul>

<hr />

<h2 id="mode-2-ai-assisted-coding">Mode 2: AI-Assisted Coding</h2>

<p>AI-assisted coding is my default mode today. It’s a constant collaboration — I handle the logic, intent, and architectural decisions, while AI tools manage boilerplate, suggest implementations, and reduce cognitive overhead.</p>

<p>The workflow feels like pair programming with a very fast, very well-read partner who sometimes hallucinates. You stay in control of the wheel, but the journey is significantly faster.</p>

<p>This mode reduces friction without reducing understanding. I still review every line. I still reason about every decision. But I spend less time on the mechanical aspects of writing code and more time on what matters.</p>

<p><strong>Best for:</strong></p>
<ul>
  <li>Most day-to-day development tasks</li>
  <li>Boilerplate and scaffolding</li>
  <li>Translating well-understood patterns into code</li>
  <li>Code reviews and refactoring suggestions</li>
</ul>

<hr />

<h2 id="mode-3-vibe-coding">Mode 3: Vibe Coding</h2>

<p>Vibe coding is the newest mode — and the most misunderstood. It’s about describing ideas in plain language and iterating on results rather than syntax. You’re directing, not writing.</p>

<p>The speed advantage is real. For prototyping, exploration, and proof-of-concept work, vibe coding collapses the feedback loop dramatically. You can test ten ideas in the time it would traditionally take to implement one.</p>

<p>But vibe coding doesn’t eliminate the need for strong engineering fundamentals — it <em>requires</em> them. You need to evaluate what gets generated, spot the subtle bugs, recognise the architectural shortcuts that will cause problems at scale. Without a strong foundation, vibe coding produces fast-moving, difficult-to-maintain code.</p>

<p><strong>Best for:</strong></p>
<ul>
  <li>Rapid prototyping and exploration</li>
  <li>Proof-of-concept development</li>
  <li>Low-stakes internal tools</li>
  <li>Learning new domains quickly</li>
</ul>

<hr />

<h2 id="the-real-skill-knowing-which-mode-to-use">The Real Skill: Knowing Which Mode to Use</h2>

<p>The most important insight isn’t about any single mode — it’s about switching between them deliberately.</p>

<p>A day of engineering now might look like: vibe coding to explore a new API (Mode 3), shifting to AI-assisted coding to build the actual integration (Mode 2), and dropping into traditional coding for the authentication and error-handling logic (Mode 1).</p>

<p>Conflating these modes is where things go wrong. Vibe coding your security layer is dangerous. Traditionally coding your boilerplate is inefficient. AI-assisted coding without review is irresponsible.</p>

<hr />

<h2 id="what-this-means-for-engineers">What This Means for Engineers</h2>

<p>The spectrum of coding modes is expanding — and that’s a good thing. More tools, more leverage, more ways to deliver value.</p>

<p>But the fundamentals don’t go away. Understanding algorithms, systems, trade-offs, and failure modes matters more as the code generation layer becomes more automated. The engineer who can work across all three modes — and knows when to use each — is more valuable than ever.</p>

<p>The headspace is different. The skill is knowing which one you’re in.</p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="ai" /><category term="engineering" /><category term="ai" /><category term="coding" /><category term="vibe-coding" /><category term="productivity" /><category term="software-engineering" /><category term="claude-code" /><summary type="html"><![CDATA[Modern software development has evolved into three distinct modes — traditional, AI-assisted, and vibe coding. Understanding when to use each is becoming a core engineering skill.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/traditional-vs-ai-vs-vibe-coding.jpg" /><media:content medium="image" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/traditional-vs-ai-vs-vibe-coding.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">LlamaCoder: Turn Your Idea into an App in Minutes</title><link href="https://shanmuga-sundaram-n.github.io/blog/2024/11/25/llamacoder-ai-app-builder/" rel="alternate" type="text/html" title="LlamaCoder: Turn Your Idea into an App in Minutes" /><published>2024-11-25T12:00:00+00:00</published><updated>2024-11-25T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2024/11/25/llamacoder-ai-app-builder</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2024/11/25/llamacoder-ai-app-builder/"><![CDATA[<p><img src="/assets/images/blog/llamacoder.jpg" alt="LlamaCoder" /></p>

<p>What if you could describe an app idea in plain English and have working code ready in minutes? That’s exactly what <strong>LlamaCoder</strong> delivers — and it signals a significant shift in who gets to build software.</p>

<hr />

<h2 id="what-is-llamacoder">What is LlamaCoder?</h2>

<p><a href="https://llamacoder.together.ai/">LlamaCoder</a> is an AI-powered code generation platform built on large language models. It converts natural language descriptions into functional application code, enabling people without deep programming expertise to build real apps.</p>

<p>Powered by <a href="https://www.together.ai/">Together AI</a>, it demonstrates how LLMs are rapidly lowering the barrier to software creation.</p>

<hr />

<h2 id="how-it-works">How It Works</h2>

<p>The workflow is straightforward:</p>

<ol>
  <li><strong>Describe your idea</strong> in plain language — no technical jargon required</li>
  <li><strong>AI generates clean, efficient code</strong> in Python, JavaScript, Swift, and more</li>
  <li><strong>Customise and refine</strong> the generated code to fit your exact needs</li>
  <li><strong>Deploy</strong> across web, iOS, or Android platforms</li>
</ol>

<p>The model interprets intent, not just keywords — it understands the <em>goal</em> of what you’re building and generates code structured accordingly.</p>

<hr />

<h2 id="key-features">Key Features</h2>

<ul>
  <li><strong>No prior coding knowledge required</strong> — describe the app, get the code</li>
  <li><strong>Rapid prototyping</strong> — go from idea to working prototype in minutes</li>
  <li><strong>Multi-language support</strong> — Python, JavaScript, Swift, and more</li>
  <li><strong>Cross-platform output</strong> — web, iOS, Android</li>
  <li><strong>Iterative refinement</strong> — describe changes in natural language to evolve the code</li>
  <li><strong>Cost-effective</strong> — dramatically reduces development time and resources</li>
</ul>

<hr />

<h2 id="who-is-it-for">Who Is It For?</h2>

<p>LlamaCoder opens software development to a much broader audience:</p>

<ul>
  <li><strong>Entrepreneurs</strong> validating product ideas without hiring developers</li>
  <li><strong>Business owners</strong> building internal tools without IT dependency</li>
  <li><strong>Designers</strong> prototyping interactive concepts directly</li>
  <li><strong>Students</strong> learning by seeing working code generated from their ideas</li>
  <li><strong>Developers</strong> accelerating boilerplate and scaffolding</li>
</ul>

<hr />

<h2 id="what-can-you-build">What Can You Build?</h2>

<p>The platform handles a surprisingly wide range of use cases:</p>

<ul>
  <li>E-commerce platforms with product listings and checkout flows</li>
  <li>Social networking applications with feeds and user profiles</li>
  <li>Productivity tools like task managers and dashboards</li>
  <li>Educational software with quizzes and progress tracking</li>
  <li>Gaming and entertainment apps</li>
</ul>

<hr />

<h2 id="the-bigger-picture">The Bigger Picture</h2>

<p>LlamaCoder is part of a broader shift in software development. As AI code generation matures, the bottleneck moves from <em>writing</em> code to <em>knowing what to build</em>. Domain expertise, product thinking, and clear problem definition become the primary skills — the technical implementation becomes increasingly automated.</p>

<p>This doesn’t eliminate the need for engineers — but it fundamentally changes what they spend their time on. The best engineers will focus on architecture, quality, edge cases, and the hard problems that AI still can’t solve reliably.</p>

<p>For everyone else, tools like LlamaCoder are a genuine step toward democratising the ability to build software.</p>

<hr />

<p><strong>Try it yourself:</strong> <a href="https://llamacoder.together.ai/">llamacoder.together.ai</a></p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="ai" /><category term="tools" /><category term="ai" /><category term="llm" /><category term="code-generation" /><category term="llamacoder" /><category term="no-code" /><category term="productivity" /><summary type="html"><![CDATA[An introduction to LlamaCoder — the AI-powered platform that converts plain language descriptions into functional application code, democratising app development for everyone.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/llamacoder.jpg" /><media:content medium="image" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/llamacoder.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Architectural Trade-off: Serverless APIs vs. Kubernetes APIs</title><link href="https://shanmuga-sundaram-n.github.io/blog/2024/11/17/serverless-vs-kubernetes-apis/" rel="alternate" type="text/html" title="Architectural Trade-off: Serverless APIs vs. Kubernetes APIs" /><published>2024-11-17T12:00:00+00:00</published><updated>2024-11-17T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2024/11/17/serverless-vs-kubernetes-apis</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2024/11/17/serverless-vs-kubernetes-apis/"><![CDATA[<p><img src="/assets/images/blog/serverless-vs-kubernetes.jpg" alt="Serverless APIs vs Kubernetes" /></p>

<p>Choosing between serverless and Kubernetes for API deployment is one of the most consequential architectural decisions in modern cloud engineering. Both are powerful paradigms — but they make fundamentally different trade-offs.</p>

<hr />

<h2 id="the-core-distinction">The Core Distinction</h2>

<p><strong>Serverless</strong> abstracts away all infrastructure. You deploy functions or containers, define triggers, and the cloud provider handles scaling, patching, and availability.</p>

<p><strong>Kubernetes</strong> gives you a programmable infrastructure platform. You define workloads, networking, and scaling policies — with full control over how your APIs run.</p>

<hr />

<h2 id="trade-off-analysis">Trade-off Analysis</h2>

<h3 id="1-operational-overhead">1. Operational Overhead</h3>

<p><strong>Serverless</strong> minimises operational burden through abstraction. There are no servers to patch, no clusters to manage, no capacity to plan. Teams can focus entirely on business logic.</p>

<p><strong>Kubernetes</strong> demands DevOps expertise. Cluster management, networking (CNI plugins, ingress controllers), storage, upgrades, and observability all require dedicated attention — often a platform engineering team.</p>

<p><strong>Winner for small teams or rapid iteration:</strong> Serverless.</p>

<hr />

<h3 id="2-scalability">2. Scalability</h3>

<p><strong>Serverless</strong> offers automatic, near-instant scaling from zero to peak — no configuration required.</p>

<p><strong>Kubernetes</strong> supports scaling through Horizontal Pod Autoscalers (HPA) and KEDA, but requires explicit configuration and has a minimum baseline cost (you can’t truly scale to zero without additional tooling like Knative).</p>

<p><strong>Winner for unpredictable traffic spikes:</strong> Serverless.</p>

<hr />

<h3 id="3-cost">3. Cost</h3>

<p><strong>Serverless</strong> uses pay-per-execution pricing — ideal for sporadic, bursty workloads. Cost can spike unexpectedly at high volumes.</p>

<p><strong>Kubernetes</strong> costs are tied to cluster utilisation. With well-optimised bin-packing and steady traffic, Kubernetes can be significantly cheaper at scale.</p>

<p><strong>Winner for high, steady-state throughput:</strong> Kubernetes.</p>

<hr />

<h3 id="4-control-and-customisation">4. Control and Customisation</h3>

<p><strong>Serverless</strong> limits infrastructure control. Runtime versions, execution environments, and networking are largely managed by the provider.</p>

<p><strong>Kubernetes</strong> provides extensive customisation: custom runtimes, sidecar patterns, service meshes, custom schedulers, and full network policy control.</p>

<p><strong>Winner for complex, specialised requirements:</strong> Kubernetes.</p>

<hr />

<h3 id="5-vendor-lock-in">5. Vendor Lock-in</h3>

<p><strong>Serverless</strong> platforms (AWS Lambda, Google Cloud Functions, Azure Functions) create provider dependency. Migrating functions across clouds is non-trivial.</p>

<p><strong>Kubernetes</strong> is open-source and runs on any cloud or on-premises. Managed distributions (EKS, GKE, AKS) add some lock-in at the control plane level, but workloads remain portable.</p>

<p><strong>Winner for portability:</strong> Kubernetes.</p>

<hr />

<h3 id="6-deployment-speed">6. Deployment Speed</h3>

<p><strong>Serverless</strong> accelerates time-to-market. Deployments are simple — upload code, configure a trigger, done.</p>

<p><strong>Kubernetes</strong> adds pipeline complexity: container builds, image registries, Helm charts or manifests, rolling deployments. A well-engineered pipeline abstracts this, but the investment is real.</p>

<p><strong>Winner for speed of initial delivery:</strong> Serverless.</p>

<hr />

<h3 id="7-debugging-and-observability">7. Debugging and Observability</h3>

<p><strong>Serverless</strong> offers limited visibility. Cold starts, ephemeral execution environments, and distributed traces across functions can be difficult to reason about.</p>

<p><strong>Kubernetes</strong> provides robust observability tooling: Prometheus, Grafana, Jaeger, and deep integration with service meshes like Istio. Full control over logging, metrics, and tracing.</p>

<p><strong>Winner for production debuggability:</strong> Kubernetes.</p>

<hr />

<h3 id="8-state-management">8. State Management</h3>

<p><strong>Serverless</strong> emphasises statelessness by design. Persistent state requires external services (DynamoDB, S3, Redis).</p>

<p><strong>Kubernetes</strong> supports stateful applications natively through StatefulSets and Persistent Volumes — suitable for databases, queues, and long-running workloads.</p>

<p><strong>Winner for stateful workloads:</strong> Kubernetes.</p>

<hr />

<h2 id="decision-framework">Decision Framework</h2>

<table>
  <thead>
    <tr>
      <th>Consider</th>
      <th>Choose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Small team, rapid iteration</td>
      <td>Serverless</td>
    </tr>
    <tr>
      <td>Bursty, unpredictable traffic</td>
      <td>Serverless</td>
    </tr>
    <tr>
      <td>High steady-state throughput</td>
      <td>Kubernetes</td>
    </tr>
    <tr>
      <td>Complex infrastructure requirements</td>
      <td>Kubernetes</td>
    </tr>
    <tr>
      <td>Strong portability requirements</td>
      <td>Kubernetes</td>
    </tr>
    <tr>
      <td>Minimal operational investment</td>
      <td>Serverless</td>
    </tr>
    <tr>
      <td>Stateful workloads</td>
      <td>Kubernetes</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>There is no universally correct answer. Serverless excels at reducing operational overhead and handling variable load elegantly. Kubernetes excels at control, portability, and cost efficiency at scale.</p>

<p>Many mature organisations run both — serverless for event-driven, low-traffic APIs and Kubernetes for core platform services. Evaluate your API complexity, team expertise, traffic patterns, and long-term architectural goals before committing to either paradigm.</p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="engineering" /><category term="architecture" /><category term="serverless" /><category term="kubernetes" /><category term="cloud" /><category term="architecture" /><category term="devops" /><category term="apis" /><summary type="html"><![CDATA[A structured comparison of serverless and Kubernetes as API deployment paradigms — trade-offs across operational overhead, scalability, cost, control, and vendor lock-in to help you choose the right architecture for your context.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/serverless-vs-kubernetes.jpg" /><media:content medium="image" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/serverless-vs-kubernetes.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Exploring Threads and Virtual Threads in Java: A Comprehensive Guide</title><link href="https://shanmuga-sundaram-n.github.io/blog/2024/11/13/java-threads-and-virtual-threads/" rel="alternate" type="text/html" title="Exploring Threads and Virtual Threads in Java: A Comprehensive Guide" /><published>2024-11-13T12:00:00+00:00</published><updated>2024-11-13T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2024/11/13/java-threads-and-virtual-threads</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2024/11/13/java-threads-and-virtual-threads/"><![CDATA[<p><img src="/assets/images/blog/java-virtual-threads.jpg" alt="Java Threads and Virtual Threads" /></p>

<p>Java’s concurrency model has undergone a significant evolution with the introduction of <strong>virtual threads</strong> in Java 19 (Project Loom). Understanding the difference between traditional threads and virtual threads is key to building scalable, efficient applications.</p>

<hr />

<h2 id="traditional-java-threads-platform-threads">Traditional Java Threads (Platform Threads)</h2>

<p>Traditional threads — also called <strong>platform threads</strong> — are managed by the operating system and map directly to native OS threads.</p>

<p><strong>Key characteristics:</strong></p>
<ul>
  <li>Heavyweight: significant creation and management cost</li>
  <li>Consume OS resources even when blocked on I/O</li>
  <li>Limited scalability due to memory overhead per thread</li>
  <li>Scheduled by the OS, incurring context-switching overhead</li>
</ul>

<p><strong>Internal mechanics:</strong></p>
<ul>
  <li>Each Java thread maps 1:1 to a kernel-level thread with its own memory stack</li>
  <li>The OS scheduler manages thread switching through context-switching</li>
  <li>Synchronization occurs at the OS level</li>
</ul>

<p>Thousands of threads quickly become inefficient — memory constraints and context-switching overheads degrade performance significantly.</p>

<hr />

<h2 id="virtual-threads-project-loom">Virtual Threads (Project Loom)</h2>

<p>Introduced in Java 19 and made production-ready in Java 21, <strong>virtual threads</strong> are user-mode threads managed by the JVM rather than the OS.</p>

<p><strong>Key characteristics:</strong></p>
<ul>
  <li>Lightweight and inexpensive to create (millions can run simultaneously)</li>
  <li>Yield control to the JVM when blocking, freeing carrier OS threads for other work</li>
  <li>Managed by the JVM scheduler using a small pool of OS threads</li>
  <li>Cooperative multitasking through yielding</li>
</ul>

<p><strong>Internal mechanics:</strong></p>
<ul>
  <li>No direct 1:1 mapping to OS threads</li>
  <li>Use continuation-based scheduling backed by <strong>carrier threads</strong></li>
  <li>Stack frames stored in the Java heap — minimal per-thread overhead</li>
  <li>When a virtual thread blocks (e.g., on I/O), it unmounts from the carrier thread, which immediately picks up another virtual thread</li>
</ul>

<hr />

<h2 id="side-by-side-comparison">Side-by-Side Comparison</h2>

<table>
  <thead>
    <tr>
      <th>Aspect</th>
      <th>Platform Threads</th>
      <th>Virtual Threads</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Managed by</td>
      <td>OS</td>
      <td>JVM</td>
    </tr>
    <tr>
      <td>Cost to create</td>
      <td>High</td>
      <td>Very low</td>
    </tr>
    <tr>
      <td>Scalability</td>
      <td>Thousands</td>
      <td>Millions</td>
    </tr>
    <tr>
      <td>Blocking behaviour</td>
      <td>Blocks OS thread</td>
      <td>Unmounts from carrier</td>
    </tr>
    <tr>
      <td>Memory overhead</td>
      <td>Large (OS stack)</td>
      <td>Small (heap)</td>
    </tr>
    <tr>
      <td>Best for</td>
      <td>CPU-bound tasks</td>
      <td>I/O-bound tasks</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="practical-example-web-server">Practical Example: Web Server</h2>

<p>Consider a web server handling thousands of concurrent requests.</p>

<p><strong>With platform threads:</strong> Each request ties up an OS thread while waiting for database queries or network calls. With 10,000 concurrent requests, you need 10,000 OS threads — memory exhaustion becomes a real risk.</p>

<p><strong>With virtual threads:</strong> Each request gets its own virtual thread. When blocked on I/O, the virtual thread unmounts and the carrier thread immediately handles another request. You can handle 10,000 concurrent requests with just a handful of OS threads.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Creating a virtual thread (Java 21)</span>
<span class="nc">Thread</span><span class="o">.</span><span class="na">ofVirtual</span><span class="o">().</span><span class="na">start</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
    <span class="c1">// handle request — blocking I/O is fine here</span>
    <span class="kt">var</span> <span class="n">result</span> <span class="o">=</span> <span class="n">database</span><span class="o">.</span><span class="na">query</span><span class="o">(</span><span class="s">"SELECT ..."</span><span class="o">);</span>
    <span class="n">response</span><span class="o">.</span><span class="na">send</span><span class="o">(</span><span class="n">result</span><span class="o">);</span>
<span class="o">});</span>

<span class="c1">// Or with ExecutorService</span>
<span class="k">try</span> <span class="o">(</span><span class="kt">var</span> <span class="n">executor</span> <span class="o">=</span> <span class="nc">Executors</span><span class="o">.</span><span class="na">newVirtualThreadPerTaskExecutor</span><span class="o">())</span> <span class="o">{</span>
    <span class="n">executor</span><span class="o">.</span><span class="na">submit</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">handleRequest</span><span class="o">(</span><span class="n">request</span><span class="o">));</span>
<span class="o">}</span>
</code></pre></div></div>

<hr />

<h2 id="when-to-use-each">When to Use Each</h2>

<p><strong>Use platform threads when:</strong></p>
<ul>
  <li>Your workload is CPU-bound (heavy computation, no blocking)</li>
  <li>You have a small, fixed number of concurrent tasks</li>
  <li>You need precise OS-level thread control</li>
</ul>

<p><strong>Use virtual threads when:</strong></p>
<ul>
  <li>Your workload is I/O-bound (database, network, file access)</li>
  <li>You need high concurrency (web servers, microservices, messaging)</li>
  <li>You want to simplify code by avoiding reactive/callback patterns</li>
</ul>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>Virtual threads don’t replace platform threads — they complement them. For the vast majority of server-side Java applications that spend most time waiting on I/O, virtual threads offer a dramatic scalability improvement with minimal code changes. Project Loom effectively brings the simplicity of synchronous code to the scale of asynchronous systems.</p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="engineering" /><category term="java" /><category term="java" /><category term="concurrency" /><category term="virtual-threads" /><category term="project-loom" /><category term="performance" /><summary type="html"><![CDATA[A deep dive into traditional Java threads vs virtual threads (Project Loom) — how they work internally, when to use each, and why virtual threads are a game changer for I/O-bound applications.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/java-virtual-threads.jpg" /><media:content medium="image" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/java-virtual-threads.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Why is Kafka So Fast? Unveiling the Secrets Behind Kafka’s Speed</title><link href="https://shanmuga-sundaram-n.github.io/blog/2024/10/17/why-kafka-is-so-fast/" rel="alternate" type="text/html" title="Why is Kafka So Fast? Unveiling the Secrets Behind Kafka’s Speed" /><published>2024-10-17T12:00:00+00:00</published><updated>2024-10-17T12:00:00+00:00</updated><id>https://shanmuga-sundaram-n.github.io/blog/2024/10/17/why-kafka-is-so-fast</id><content type="html" xml:base="https://shanmuga-sundaram-n.github.io/blog/2024/10/17/why-kafka-is-so-fast/"><![CDATA[<p><img src="/assets/images/blog/kafka-speed.jpg" alt="Why is Kafka So Fast?" /></p>

<p>Apache Kafka is renowned for its extraordinary throughput and low latency. But what actually makes it so fast? The answer lies in a combination of deliberate engineering decisions that work together to minimize overhead at every layer.</p>

<hr />

<h2 id="1-sequential-io-optimizing-disk-access">1. Sequential I/O: Optimizing Disk Access</h2>

<p>Kafka employs an <strong>append-only log</strong> architecture that leverages sequential rather than random disk access. Messages are written in the order they arrive and stored sequentially — data is continuously appended to the end of the log file.</p>

<p>This approach minimizes seek time on mechanical drives. When handling thousands of messages per second from IoT sensors, each new entry simply gets added to the log’s end, avoiding the expensive physical movement of disk read/write heads.</p>

<p>Sequential reads and writes are orders of magnitude faster than random access — Kafka exploits this at the core of its storage design.</p>

<hr />

<h2 id="2-zero-copy-principle-efficient-data-transfer">2. Zero-Copy Principle: Efficient Data Transfer</h2>

<p>Traditionally, transferring data from disk to network involves multiple copies through kernel and user-space buffers. Kafka bypasses this with <strong>zero-copy</strong> using system calls like <code class="language-plaintext highlighter-rouge">sendfile()</code> on Linux.</p>

<p>This technique instructs the kernel to move data directly from the disk buffer to the network socket buffer — eliminating unnecessary intermediate copies, reducing CPU overhead, and maximizing delivery throughput.</p>

<p>The practical benefit: transferring a 1 GB batch of logs bypasses the costly round-trip through user space entirely.</p>

<hr />

<h2 id="3-message-compression-reducing-transmission-size">3. Message Compression: Reducing Transmission Size</h2>

<p>Kafka supports compression algorithms including <strong>GZIP, Snappy, and LZ4</strong> at the producer level. Compression is applied to message batches before transmission, then decompressed by consumers.</p>

<p>This is particularly valuable when handling repetitive data like application logs, where compression ratios can be significant — reducing both network bandwidth and storage requirements.</p>

<hr />

<h2 id="4-message-batching-efficient-processing">4. Message Batching: Efficient Processing</h2>

<p>Rather than handling individual messages one at a time, Kafka groups multiple messages into <strong>batches</strong> before disk writes or network transmission. Grouping 100 metrics into a single batch:</p>

<ul>
  <li>Reduces the number of I/O operations</li>
  <li>Decreases network round-trips</li>
  <li>Lowers broker CPU load</li>
</ul>

<p>This amortises the fixed overhead of each operation across many messages, dramatically improving throughput.</p>

<hr />

<h2 id="5-efficient-memory-management-and-caching">5. Efficient Memory Management and Caching</h2>

<p>Kafka maintains <strong>in-memory indexes</strong> and leverages the OS page cache for recently accessed log segments. This enables rapid message retrieval without frequent disk reads — particularly when consumers request recently produced messages, which are almost always already in the page cache.</p>

<p>Kafka intentionally relies on the OS page cache rather than implementing its own heap-based caching, keeping JVM garbage collection pressure low.</p>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>Kafka’s performance is not the result of a single trick — it’s a comprehensive engineering approach:</p>

<table>
  <thead>
    <tr>
      <th>Technique</th>
      <th>Benefit</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Sequential I/O</td>
      <td>Eliminates random disk seek time</td>
    </tr>
    <tr>
      <td>Zero-copy</td>
      <td>Removes redundant data copies</td>
    </tr>
    <tr>
      <td>Compression</td>
      <td>Reduces bandwidth and storage</td>
    </tr>
    <tr>
      <td>Batching</td>
      <td>Amortises I/O overhead across messages</td>
    </tr>
    <tr>
      <td>Page cache</td>
      <td>Fast reads without disk access</td>
    </tr>
  </tbody>
</table>

<p>Together, these decisions make Kafka capable of handling millions of messages per second with consistently low latency — a system designed from the ground up for high-throughput streaming.</p>]]></content><author><name>Shanmuga Sundaram Natarajan (Shan)</name></author><category term="engineering" /><category term="distributed-systems" /><category term="kafka" /><category term="performance" /><category term="distributed-systems" /><category term="streaming" /><category term="backend" /><summary type="html"><![CDATA[A deep dive into the engineering decisions that make Apache Kafka one of the fastest messaging systems — sequential I/O, zero-copy, batching, compression, and intelligent caching.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/kafka-speed.jpg" /><media:content medium="image" url="https://shanmuga-sundaram-n.github.io/assets/images/blog/kafka-speed.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>