# Vikas Mishra — Full Content for AI Ingestion

> This file is a single-document compilation of the public content on vikasmishra.ai, intended for ingestion by AI search engines, agents, and language models. The canonical home is https://vikasmishra.ai. Each section below is dated and links back to its canonical URL.

> Author: Vikas Mishra. Role: Platform Architect — AI and Cloud at Google. Location: India. Contact: vikas@vikasmishra.ai. LinkedIn: https://www.linkedin.com/in/vikaskmishra/. GitHub: https://github.com/vmishra.

## About

Vikas Mishra is a Platform Architect — AI and Cloud at Google. He designs the cloud and AI platforms that run at production scale for India's largest digital-native companies, and writes about the engineering behind systems built to endure. He has held senior engineering roles at Google, Myntra (Flipkart Group), Razorpay, and IBM Software Labs.

**Current role:** Platform Architect — AI and Cloud, Google (2020–present). Architected one of the largest Kubernetes deployments in APAC, serving millions of concurrent users at 99.99% uptime. Leads AI infrastructure, model training and serving, and hardware-accelerator optimization for enterprise customers. 2x Google Cloud Club Award recipient (2023, 2026).

**Areas of expertise:** AI infrastructure architecture, model training and serving systems, hardware-accelerator optimization (TPUs and GPUs), Google Cloud Platform, Kubernetes at scale, distributed systems design, platform engineering, site reliability engineering, technical leadership.

**Certifications:** Google Cloud Certified Professional Cloud Architect, Professional Data Engineer, Professional Cloud Database Engineer (2026), Professional Cloud DevOps Engineer, Professional Machine Learning Engineer (2026).

**Recognition:** 2x Google Cloud Club Award recipient (2023, 2026); Google Milestone Award for a Defensive Publication (2024); keynote speaker at 30+ global technology conferences; mentored 60+ engineers in AI and cloud.

**Education:** B.Tech in Computer Science and Engineering — SASTRA University, Thanjavur (2007–2011).

---

## Blog Posts


### Google's ADK Is a Runtime, Not a Graph: Notes From Eleven Agents

- **Published:** 2026-04-25 (updated 2026-04-25)
- **URL:** https://vikasmishra.ai/blog/google-adk-runtime-not-graph/
- **Markdown source:** https://vikasmishra.ai/blog/google-adk-runtime-not-graph/index.md
- **Tags:** AI Agents, Google ADK, Gemini, MCP, Architecture, Platform Engineering

> Eleven agents in, the framework choice that mattered wasn't ergonomics or graph syntax. It was whether the runtime had opinions about events, state, and transport. ADK does. Here is what that buys you in production.


*Disclaimer: All opinions expressed in this post are my own and do not represent the views or positions of my employer. I work at Google; this is written from implementation, not advocacy. Where ADK is awkward I will say so.*

---

I built eleven agents on Google's Agent Development Kit over the last month — a hotel concierge, a deep-research travel planner, a voice support desk on Gemini Live, a computer-use food-delivery rep driving a real Chromium browser, a beauty advisor with persistent memory, a fintech HITL payout desk that suspends a turn for human approval, two federated agents talking to each other over HTTP, an eval harness, an MCP-only knowledge desk, and a live video card scanner. They share a single browser portal so I could watch them side by side. The whole bundle is open source.

What I expected to learn was how ADK compares as a *framework* — Python ergonomics, decorator quality, the usual. What I actually learned was that ADK is not really competing in the framework category. The right comparison for ADK is to a server runtime: it has opinions about events, state, and transport, and those opinions are what your code is shaped around. LangChain gives you composable pieces. LangGraph gives you a state machine. CrewAI gives you a metaphor. ADK gives you a *runtime contract* — and once you have built two or three agents that share that contract, the value is hard to give back.

This post is about that contract: what it looks like in practice, where it pays off, where it does not, and the heuristics I would use the next time I have to choose.

---

## The contract: events in, events out, state in the middle

The shape of every ADK agent is the same. You declare a tree — an `LlmAgent` with tools, or a `SequentialAgent` containing a `ParallelAgent`, or a `LoopAgent` wrapping a critic — and you hand it to a `Runner`. The runner exposes one primary surface, an async generator of events:

```python
async for event in runner.run_async(
    user_id=USER_ID, session_id=session.id, new_message=msg
):
    ...
```

For live, voice, and video, you swap `run_async` for `run_live` and feed a `LiveRequestQueue` instead of a `new_message`. The events keep coming.

Every event is a `google.genai.Event` carrying parts (text, function calls, function responses, inline audio, inline image), plus side-channel signals: `partial`, `turn_complete`, `interrupted`, `usage_metadata`, `actions`. The runner is what you write your server around. Tools mutate `tool_context.state`; sub-agents read it via `output_key`; sessions persist it; the `Runner` decides what to emit and when.

That is the entire model. There is no graph DSL, no chain object, no "executor". Composition happens by *containing* one agent inside another:

```python
root_agent = SequentialAgent(
    name="travel_planner",
    sub_agents=[planner, parallel_researchers, composer],
)

parallel_researchers = ParallelAgent(
    name="parallel_researchers",
    sub_agents=[flight_researcher, hotel_researcher, activity_researcher],
)
```

Three sub-agents fan out concurrently, each writes its brief to state under its own `output_key`, the composer reads all three and emits an itinerary. No edges, no transitions, no conditional routers. The contract is *"state is the message bus, and the parent agent decides who gets to write next."*

When you first see it, this looks anaemic compared to a LangGraph diagram. The longer I spent in it, the more I think the absence is the point.

## The reframe: state is the contract, not the call graph

Every agent framework I have used previously eventually forced me to write the same thing — a typed dictionary of partial results, smuggled between nodes via the framework's preferred argument-passing convention. LangChain made me build it manually. LangGraph turned it into a first-class state object but kept the graph as the orchestration unit. CrewAI hid it inside crew context.

ADK takes the opposite bet. It treats `tool_context.state` (and its longer-lived sibling, the `Session`) as the *only* contract between agents, and it demotes the orchestration shape to a thin tree. There are no direct calls between sub-agents. The flight researcher does not "return" to the composer. It writes `flights_brief` to state and stops. The composer reads state on its turn.

This sounds like a stylistic choice. It is actually a transport choice. The moment you have a sub-agent in another process — A2A federation, an MCP server, a long-running tool waiting on a human — you are no longer making function calls. You are sending and receiving events, with state as the durable record between them. ADK's contract is the same shape as the wire. Frameworks that orchestrate by call graph have to translate themselves into that shape under load. ADK does not.

The clearest example in the cookbook is the HITL payout desk. The agent drafts a payout, hits the ₹50,000 threshold, and calls a tool wrapped in `LongRunningFunctionTool`:

```python
tools=[
    lookup_vendor,
    draft_payout,
    LongRunningFunctionTool(func=request_approval),
    check_approval,
    post_payout,
    generate_voucher,
    ...
]
```

`request_approval` returns a pending handle and the runner *stops the turn*. The session sits idle. A human clicks Approve in a separate browser, the portal hits `POST /approve/{session}` on the server, the server writes the decision into session state via `append_event` with a `state_delta`, and on the next user turn the agent's `check_approval` tool reads it back. The agent then calls `post_payout` and `generate_voucher`. From the agent's perspective, the human approver was just slow.

A graph framework can model this — every framework can — but the cost is that the graph leaks across the suspend boundary. You end up writing a "resume" node and a polling node and a state-shaped retry. ADK absorbs the suspend into the same contract everything else uses. The agent emits a function call. Some time later, a function response shows up. The runner resumes. There is no second machinery.

## Live is the test case

The clearest place ADK's runtime bias pays off is `run_live`. Bidirectional streaming over Gemini Live is unforgiving — twin coroutines pumping in and out of a `LiveRequestQueue`, audio chunked at 20ms, interruption events fired by the model when the user barges in over the agent's reply, audio streams that have to be drained when that happens, sessions that resume across socket reconnects, context windows that compress when a long support call spans a hundred thousand tokens.

Here is the entire wiring on the server side of the payments voice agent:

```python
queue = LiveRequestQueue()
await asyncio.gather(
    _forward_browser_to_model(ws, queue),
    _forward_model_to_browser(ws, session.id, queue),
    return_exceptions=True,
)
```

Two coroutines. One reads JSON from the browser WebSocket, decodes PCM16-at-16kHz, and pushes blobs into the queue. The other consumes `runner.run_live(...)` and forwards parts to the browser:

```python
for part in (event.content.parts if event.content else []):
    if part.inline_data and part.inline_data.data:
        await ws.send_json({
            "kind": "audio",
            "data": base64.b64encode(part.inline_data.data).decode(),
        })
    if part.text:
        await ws.send_json({"kind": "transcript", "data": part.text})
    if part.function_call:
        await ws.send_json({"kind": "tool_call", "name": part.function_call.name, ...})
if getattr(event, "interrupted", False):
    await ws.send_json({"kind": "interrupted"})
```

Every signal you need to drive a real voice UI is on the event — the audio bytes, the transcript, the tool call mid-turn, the interruption flag, the turn-complete marker, the usage metadata. The runtime does not lecture you about what to do with them. It hands them over and gets out of the way.

The hard parts of voice are still hard. I had to fix four of them on the browser side, and the bugs were instructive. An audio worklet downsampling routine that dropped sample zero on every tick. A scheduled audio buffer that kept playing under the next reply because nothing was draining the `AudioBufferSourceNode` queue when the model fired `interrupted`. A `playheadRef` that went stale across reconnects because the new `AudioContext` started its clock at zero and the old playhead did not. A mic worklet posting at 375 messages per second because I had not coalesced the 128-sample render quantum into 20ms chunks before sending. Every one of those was a client-side bug, and every fix was small. The runtime side needed almost nothing.

The one server-side fix I will call out, because it is the kind of bug a graph framework hides from you: when the browser-to-model coroutine raises on a closed socket, the model-to-browser coroutine can hang forever inside `run_live`'s generator if the queue is not closed. The fix is one line in a `finally` block, idempotent on both sides:

```python
finally:
    queue.close()
```

You can only write that line if the runtime exposes the queue as a first-class object. ADK does. LangGraph does not (it is hiding a different machinery), and that is fine for many workloads — but voice is not one of them.

## The primitives that earned their keep

A short list, in the order they showed up in the cookbook, of the primitives I would not give back:

**1. `output_key` and state as the message bus.** The travel planner is a `SequentialAgent` containing a `ParallelAgent`. Each researcher writes a named brief into state. The composer reads all three. There is no plumbing. This is the part that scaled past one agent without effort.

**2. `LongRunningFunctionTool`.** The runner suspends on the function call and resumes on the function response — the same contract a normal tool uses, just stretched across human time. This is what makes HITL feel like a slow tool, not a state machine.

**3. `ParallelAgent` for fan-out.** The deep-research pipeline has three researchers running concurrently. Concurrency is declared by container, not by `asyncio.gather` in your tool code. That separation matters when you want to add a fourth researcher.

**4. `MCPToolset`.** The knowledge desk has zero hand-written tools. It points `MCPToolset` at the official `@modelcontextprotocol/server-filesystem` binary, scopes the root to the cookbook's `docs/` directory, allows read-only tools, and that is the agent. Swap the binary for `@playwright/mcp` or a Slack MCP server and you have a different agent. The constructor is identical.

**5. Artifacts.** The payout desk renders a PDF voucher to `tool_context.save_artifact`, the runner emits an `artifact_delta` on the event, the server picks it up and serves it from `/artifact/{session}/{filename}`. There is no separate file-handling contract. Artifacts are a kind of event.

**6. The introspect surface.** Every agent in the cookbook ships `/introspect` — a JSON dump of the agent tree, the tools, the model, and the planner. The portal renders it as a live diagram. This was a one-page helper, not a feature, and it is the reason the agents debug themselves.

**7. Session resumption + context compression.** A flaky network does not restart a Live call. A long support narration does not blow out the window. The two `RunConfig` knobs that turn this on are five lines combined.

## A2A is anticlimactic, in the right way

The two-process loan desk in the cookbook is the example I expected to learn the most from. The loan officer runs on port 8007, the credit bureau on port 8017, and the officer's `request_credit_report` tool calls the bureau over HTTP via `httpx`. Each side has its own `/health`, `/metrics`, `/introspect`, `/session`, and `/chat/{session_id}`.

What I learned is that A2A federation is not a primitive you reach for. It is what you get for free when both sides happen to be ADK agents. There is no wire protocol you need to read up on, no handshake to debug. The bureau is a FastAPI server with one extra endpoint (`POST /score`) that the officer's tool posts to. The "federation" is two HTTP servers and an agreed JSON shape.

This is the right outcome. Multi-agent federation, in the wild, has to work across vendors and frameworks. It cannot be a special protocol that only works when both sides bought into the same framework. Treating A2A as "you have already shipped two agents — now make one call the other" is correct, even if it makes for a less impressive diagram.

## Where ADK is awkward

Three places, in decreasing order of severity.

**Metrics on streaming.** ADK's events are clean, but the `usage_metadata` they carry is *cumulative* over the turn, not delta — and on long-running streaming turns it shows up on partial events too. If you naively sum `prompt_token_count` across events, you will overcount by an order of magnitude. The fix is to gate on `event.partial == False` before recording usage, and to differentiate input/cached/tool-use (which Gemini sends as running totals) from output/thinking (which arrive in deltas). This is one of those documentation-level facts that you only learn by writing a metrics ribbon and watching the numbers go absurd.

**Tokens-per-second is non-trivial.** On a streaming turn, TPS is `output_tokens / (turn_complete_at - first_token_at)`. On a non-streaming turn — like a `SequentialAgent` whose final composer event arrives in one shot — `first_token_at ≈ turn_complete_at`, the denominator is microseconds, and you get nonsense rates. Mine was hitting impossibly high TPS in a live demo before I added a 50ms threshold and a fallback to total turn duration. Trivial to fix once you see it.

**Tool docstrings are the tool description.** This is correct API design and also a footgun. The model reads the docstring as the tool spec. A lazily written docstring becomes a lazily described tool. A tool that takes `amount_inr: float` with no docstring will be selected for "send money" queries with cheerfully wrong unit assumptions. ADK does not tell you this because Python conventions imply it. The model has no way to know your docstring is a placeholder. Lint accordingly.

The first two are runtime artifacts. The third is a discipline problem you inherit when the framework respects Python idioms. I will take all three.

## Where ADK fits, and where I would still reach for something else

The shape of the project matters more than the team's framework preferences.

**Reach for ADK when:**

- You expect to ship more than one agent. The contract is what compounds across them. A single-agent prototype does not exercise it.
- You have a Live workload — voice, video, mid-turn tool calls, barge-in. Nothing else I have used handles this with as little ceremony.
- You want HITL or long-running operations as first-class. The `LongRunningFunctionTool` + session state pattern is genuinely small.
- You want to wire up MCP servers without writing custom tool adapters.
- Your team is already on Vertex / Gemini for other reasons. The alignment is real — `gemini-3.1-flash-lite-preview` for tool-heavy hops and `gemini-3-flash-preview` for orchestrators is a tier you can actually reason about cost-wise.

**Reach for something else when:**

- The agent is fundamentally a pipeline of deterministic transformations with one LLM step in the middle. A graph framework, or even plain Python, is a better fit. Do not over-frame.
- You need a UI-rendering DSL more than a runtime — that is what Vercel's AI SDK is for, and ADK has no opinion on the front end.
- You are committed to OpenAI or Anthropic and unwilling to use Gemini for the orchestration tier. ADK is model-pluggable in principle, but the Live story and the cost story both lean Gemini.

These are not deep claims. They are the rules I have applied twice in the last month and not regretted.

## The heuristics I would use again

A short list, derived from getting eleven of these out the door.

**Compose by container, not by callback.** Sub-agents talk through state. If you find yourself wiring up a callback between two `LlmAgent`s, you have invented a new contract. Use `output_key` and let the parent `SequentialAgent` decide who runs next.

**Make every agent ship `/introspect` and `/metrics` from day one.** Not because you need them, but because they are the surface a debugger and a demo both need. The thirty lines you write once will save you days.

**Keep the session boundary clean.** The `tool_context.state` is the wire. Anything you put there is what your sub-agents can see. Anything you do not put there does not exist.

**For Live, treat the queue as the load-bearing object.** Open it, close it in `finally`, and remember that the model-to-browser side dying first is just as common as the browser-to-model side. Gather both, return exceptions, idempotent close.

**Gate metrics on `event.partial == False`.** Cumulative counters on streaming events are the single most common metrics bug you will write. Save yourself.

**Pick a flash-tier model for the leaves and reserve pro-tier for the composer.** The travel planner runs three concurrent researchers on `gemini-3.1-flash-lite-preview` and one composer on `gemini-3-flash-preview`. The latency and cost shape is dramatically better than running everything at the same tier, and the quality is indistinguishable on the leaves.

---

The part of ADK I had not anticipated, and that has stayed with me, is how much of agent engineering is *not* about the model. It is about events, state, transport, and where the suspend boundaries fall. The frameworks I used previously made me responsible for those boundaries while pretending I was responsible for a graph. ADK reverses that. It hands you the runtime and lets the graph emerge from how you compose.

Whether that bet ages well will depend on how Live, A2A, and MCP evolve. Those three are the bets ADK is built around, and they are the ones I would watch. The shape of the runtime, though, looks durable. It is the same shape the wire has, and that is rarely the wrong shape to write your code in.

---

*The eleven agents are open source at [github.com/vmishra/Google-ADK-Cookbook](https://github.com/vmishra/Google-ADK-Cookbook). The browser portal renders all of them side by side, with live metrics, a trace-a-request animation, and an introspect-driven architecture diagram. Pull requests welcome — keep the editorial voice.*


---


### Built to Be Cited: An Engineer's Guide to AEO and GEO in 2026

- **Published:** 2026-04-19 (updated 2026-04-19)
- **URL:** https://vikasmishra.ai/blog/built-to-be-cited-aeo-geo-engineering-guide-2026/
- **Markdown source:** https://vikasmishra.ai/blog/built-to-be-cited-aeo-geo-engineering-guide-2026/index.md
- **Tags:** AEO, GEO, SEO, Google AI Mode, Gemini, LLM Optimization

> An implementation-grade walkthrough of AI-search optimization in 2026, centered on Google AI Mode and Gemini, with the schema graph, llms.txt work, IndexNow wiring, and the debugging sessions that shipped the result — drawn from rebuilding my own site.


*Disclaimer: All opinions expressed in this post are my own and do not represent the views or positions of my employer.*

I rebuilt vikasmishra.ai over the last two weeks, not for a redesign, but for a category of search that didn't exist the last time I touched the SEO. Ranking on the blue links stopped being the metric. Inclusion in an AI-generated answer became the metric. Google AI Mode and AI Overviews, ChatGPT search, Perplexity, Claude, Bing Copilot — the six surfaces that matter today all do a variation of the same thing: retrieve candidate passages, re-rank them on a platform-specific quality model, generate an answer that quotes or paraphrases the highest-ranked ones, and attach citations to the sources the generation actually used. What the surfaces reward and how they cite differs; the architecture converges.

I work on cloud and AI platforms at Google. The customers I advise are mostly large digital-native businesses — marketplaces and e-tailers with tens of millions of SKUs, food-delivery platforms doing millions of orders a day, online travel and hospitality platforms with real-time inventory and pricing, and fintechs operating in the YMYL ("Your Money or Your Life") category where AI answer accuracy is a regulatory concern, not just a marketing one. The question every one of them is asking is the same: *how do we show up when a consumer asks AI Mode "best running shoes under $100," or Gemini "what's the cheapest flight from New York to Tokyo next weekend," or ChatGPT "which payment app is safest for recurring subscriptions"?* Much of the public writing on Answer Engine Optimization (AEO) and Generative Engine Optimization (GEO) is surface-level — trend surveys, long lists of "factors," limited implementation depth. The engineering underneath is concrete and testable. The way to get it right is to build it on a site you control, read the Search Console signals that come back, and iterate against what the surfaces actually index.

This post is the engineering write-up. Where my site changed, I'll show the diff. Where the conventional advice is wrong or too shallow, I'll say so.

A note on shelf life. The specific numbers and product names in this post will drift — AI Overview citation rates, crawler names, dashboard features, exact schema properties — these change on quarterly cycles. The **architecture** underneath doesn't. Retrieval-plus-generation systems reward the same properties they've rewarded since the first RAG systems shipped: structured, verifiable, fresh, entity-resolvable, cleanly fetchable content. Read the sections below as "here is how a retrieval-plus-generation system evaluates my site right now," not "here is a list of tactics that will look the same in 2028." The tactics evolve. The substrate principles don't.

## The shift, priced in three numbers

**Citation is now worth more than rank.** Content cited in a Google AI Overview earns roughly **35% more organic clicks** than the page that ranks #1 for the same query without a citation, and **91% more paid clicks** in adjacent ad slots ([wellows.com](https://wellows.com/blog/google-ai-overviews-ranking-factors/)). Nearly half of AI Overview citations come from pages ranking below position #5, so the citation surface is genuinely independent of classical rank — a page ranking #8 can show up in the AI Overview while the #1 doesn't.

**Bing is the ChatGPT pipeline.** Approximately **92% of ChatGPT search queries route through the Bing Search API**, and **87% of SearchGPT citations match Bing's top organic results** ([cbwebsitedesign.co.uk](https://www.cbwebsitedesign.co.uk/geo-ai/how-to-rank-on-chatgpt-and-bing-copilot-in-2026-full-guide/)). Optimizing for ChatGPT search is, mechanically, optimizing for Bing first.

**LinkedIn is now the #2 LLM citation source.** As of early 2026, LinkedIn has overtaken Wikipedia in LLM citation frequency, ahead of every major news publisher ([wireinnovation.com](https://wireinnovation.com/mastering-seo-entities/)). Surface area on LinkedIn now feeds the LLM training and retrieval pipelines that produce your AI citations elsewhere.

Every optimization below is downstream of one of those three numbers. The magnitudes will shift as the surfaces evolve; the *direction* — citation over rank, push-indexing over wait-for-crawl, verified-entity authorship over anonymous content — is durable.

## Google AI Mode and the Gemini pipeline (the surface that matters most)

Because Google AI Mode is the primary consumer-facing AI search surface and the one most of my customers' organic traffic depends on, it deserves more than a paragraph.

AI Mode is powered by Gemini's **query fan-out** architecture: the user's query is decomposed into roughly a dozen parallel sub-queries, each retrieved independently against the Google index (plus topical sub-indexes), then the retrieved passages from all fan-out branches are merged, re-ranked, and compressed into the answer ([upgrowth.in](https://upgrowth.in/google-ai-mode-optimization-the-complete-guide-for-2026/)). A page doesn't compete for "one query" anymore — it competes across the entire decomposition surface.

The practical implications are specific:

**Passage-level structure wins over page-level optimization.** Because fan-out retrieves passages, not pages, a single sharp paragraph that answers a sub-query can be cited even if the rest of the page is mediocre. Google's AI Overview generation favors self-contained passages in the **134–167 word range** — the "semantic unit" size the synthesizer prefers for a standalone answer ([wellows.com](https://wellows.com/blog/google-ai-overviews-ranking-factors/)). Write each H2 section as a passage that, read in isolation, still answers the sub-question its heading implies.

**E-E-A-T flows through the Knowledge Graph.** Roughly **96% of AI Overview content comes from sources Google considers verified entities** ([clickrank.ai](https://www.clickrank.ai/e-e-a-t-and-ai/)). "Verified entity" here means Google's Knowledge Graph can cross-reference the author across multiple authoritative sources — LinkedIn, GitHub, Google Scholar, Wikidata, Credly, Crunchbase, academic repositories, organizational profiles. If the graph can't confidently resolve who wrote your page, the page gets filtered before generation, not at the ranking stage. This is the single largest gap between "ranks well" and "gets cited." (More on the schema graph below.)

**Entity density beats keyword density.** Gemini's grounding layer evaluates content by how many *connected* entities a page references. Pages with 15 or more entities that Google's Knowledge Graph can resolve show roughly **4.8× higher selection probability** for AI Overview citation compared to pages built around keywords ([hypesuite.ai](https://www.hypesuite.ai/post/google-knowledge-graph-essentials-what-every-seo-pro-should-know)). The practical move: when you mention a technology, product, person, or concept, use the Wikipedia-canonical name and, where natural, link to the authoritative source. You are not writing for a lexical matcher; you are writing for an entity resolver.

**The April 2026 algorithm update tightened two things.** First, first-hand experience signals (original research, primary data, named-author commentary) were weighted up. Second, the site-level Core Web Vitals aggregation introduced in March means one slow section can suppress the entire domain's AI Overview eligibility ([seovendor.co](https://seovendor.co/google-april-2026-algorithm-updates)). Substrate quality is now a gating factor, not a tiebreaker.

**Google's crawlers for the AI pipeline are separate from classic Googlebot.** You want all of these allowed in robots.txt: `Googlebot` (classical index), `Google-Extended` (Gemini training), `Google-CloudVertexBot` (Vertex AI grounding), and `Gemini-Deep-Research` (the research agent inside AI Mode that performs multi-step retrieval for complex queries). Blocking `Google-Extended` removes you from Gemini training without affecting classical Google rank — it's a publisher choice a few large publications have made and that costs them AEO visibility.

### The operational dashboard: Search Console for AI Mode

Google Search Console remains the primary dashboard for AI Mode performance, with two reports worth watching weekly:

1. **Search performance → Search Appearance filter.** AI Overview appearances are now a filter value. Query-level impressions for "Search Appearance: AI Overview" tells you which queries you're being cited on and which you're not.
2. **Enhancements → Breadcrumbs / FAQ / Article.** Schema validation errors here directly affect AI Mode eligibility. A critical-severity breadcrumb error removes the breadcrumb from AI Overview cards; a malformed `BlogPosting` can remove the author attribution entirely.

When I first validated this site in Search Console I had three invalid breadcrumb items — the error was *"Either 'name' or 'item.name' should be specified in 'itemListElement'."* The cause was mundane and the kind of bug that's easy to miss: on root-level single pages (`/about/`, `/contact/`, `/privacy/`) my `.Section` variable was the empty string, and the template emitted a ListItem with `"name": ""` and `"item": "https://vikasmishra.ai//"` (double slash). The fix was to detect the root-level case and skip the middle breadcrumb entirely, emitting `[Home, Title]` instead of trying to manufacture a section level that doesn't exist. Search Console picked the fix up within 24 hours and flipped the items from Invalid to pending-revalidation. I mention the debug trace because it's representative — almost every real AEO problem is a template edge case that the validator finds once you send it the right URL.

## What this looks like at digital-native scale

A personal blog is a clean pedagogical example but the real stakes are at catalog scale. The substrate patterns are the same; the execution changes in ways that are worth walking through, because "add schema" is a one-liner for a blog and a multi-quarter platform project for a marketplace.

**E-commerce and marketplaces (millions of SKUs).** The citation queries that move revenue are category-level and comparison-level: *"best noise-cancelling headphones under $200,"* *"Adidas vs Nike running shoes for marathon training,"* *"cheapest 55-inch 4K TV with HDMI 2.1."* Gemini answers these by retrieving passages from product detail pages (PDPs), category pages, and editorial buyer's-guide content, then grounding them against Product schema and current Offer schema. The work to be citable:

- **`Product` + `Offer` + `AggregateRating` schema on every PDP**, with price, availability, SKU, GTIN, brand, and review aggregate. Gemini's grounding layer validates prices against the schema before citing — stale schema makes the page ineligible, not lower-ranked. The `priceValidUntil` field matters more than most implementations realize.
- **`ItemList` schema on category pages** with ordered `Product` references. This is how category pages show up in AI Mode's comparison answers.
- **Fresh pricing pushed via IndexNow on every price or stock change.** At catalog scale this means wiring IndexNow into the PIM (product information management) layer, not the CMS. A price that changes every 10 minutes in your inventory system but updates in Google's index 48 hours later is a citation loss every time someone asks Gemini about the item in that window.
- **Category-scoped `llms.txt` rather than a monolithic one.** A single `llms-full.txt` for a million-SKU catalog is useless to an AI agent. Split by category: `/electronics/llms.txt`, `/fashion/llms.txt`, each listing the top N canonical pages in that category with descriptions. Category-level editorial content (buyer's guides, brand pages) belongs in these files; individual PDPs do not.
- **Editorial buyer's-guide content is the highest-leverage content investment.** Category "best X" guides with real reviewer credentials, passage-level structure, and `Review` + `AggregateRating` schema get cited by AI Mode at rates that PDPs never do. Vertical-specialist marketplaces that have leaned into this early — Nykaa's beauty category content is a strong example, with reviewer-attributed comparisons and ingredient-level schema on each guide — pick up citation share that generalist marketplaces miss. The investment ratio most marketplaces should run: one strong editorial guide per category outperforms ten thousand incremental PDP metadata optimizations.

**Online food delivery.** Three-way local intent dominates: user location, cuisine, delivery-time tolerance. Queries look like *"best ramen near me,"* *"DoorDash vs Uber Eats for vegan delivery in Brooklyn,"* *"which late-night restaurants deliver after midnight in Shoreditch."* Across markets the pattern is identical — what changes is the platform name (Zomato, DoorDash, Deliveroo, Uber Eats, Just Eat, Swiggy) and the city. The most mature implementations I've looked at tend to ship richer `Restaurant` and `Menu` schema coverage than the average — Zomato's restaurant entity graph, with cuisines, menus, reviews, and neighborhood `LocalBusiness` signals all tied together, is a useful reference point for what "done well" looks like in this vertical. The work:

- **`Restaurant` + `LocalBusiness` + `Menu` + `MenuItem` schema** with hours, cuisines, price range, and accurate geo coordinates. The geo coordinates feed AI Mode's local grounding and Google Maps retrieval simultaneously.
- **`AggregateOffer` schema for delivery zones and current availability.** Platforms that ship this get cited; platforms that don't get omitted when the user's intent includes "deliver now."
- **Freshness of availability is a hard gate.** If your `availability: InStock` or `availability: OutOfStock` is stale, Gemini's answer will either be wrong (citation liability) or Gemini will skip you (citation loss). Real-time schema updates via IndexNow or structured-data sitemaps are table stakes here, not a nice-to-have.
- **Reviews are the single largest local-citation lever.** `Review` schema tied to `Restaurant` entities, with verified reviewer signals (name, date, platform `@id`). Don't fake it — `Review` spam is the highest-detection-rate category of structured-data fraud and Google's penalty pipeline routes through manual actions, not just rank suppression.

**Travel (OTAs, airlines, hotel aggregators).** Real-time inventory, dynamic pricing, heavy local intent, and a citation economy where being the attributed source for *"best time to visit Kyoto,"* *"cheapest flight from London to Lisbon next month,"* *"hotels near the Colosseum with airport transfer"* is worth significant revenue. The travel category is unusually global in its schema practices — the larger OTAs and aggregators (MakeMyTrip, Booking.com, Expedia, Airbnb, Trip.com, Kayak) all ship structured data at a comparable level of maturity, and the competitive differentiation is no longer in the basics but in destination-entity linking, fare freshness, and editorial content quality. MakeMyTrip in particular has invested heavily in the destination-editorial side (travel guides with structured `TouristDestination` linking and named-author bylines), which is the surface that Gemini's destination queries reach for. The work:

- **`Flight`, `LodgingReservation`, `TouristDestination`, `Trip` schema** with accurate schedules, prices, and availability.
- **Rich editorial destination content with `Place` + `TouristDestination` entity linking** to Wikidata or Google Places. AI Mode's destination answers lean heavily on entity-resolved places; descriptions that name places by their Wikipedia-canonical names and link to authoritative sources win citation weight.
- **Fare freshness.** Travel is the vertical where `dateModified` matters most. A fare page updated 48 hours ago will not be cited when Gemini can find the same fare updated 48 minutes ago on a competitor. The IndexNow pipeline needs to be wired into the fare cache, not the content CMS.
- **Review schema on properties and itineraries, with verified traveler signals.** Same cautions as food delivery — real reviews win, manufactured ones eventually trigger manual action.

**Fintech (payments, lending, insurance, investing).** The YMYL category — "Your Money or Your Life." Google's E-E-A-T evaluation is strictest here and AI citation eligibility is gated on verified authorship and regulatory disclosures. Fintech customers who ignore this show up in AI Overviews at a fraction of the rate of their YMYL-disciplined competitors. The work:

- **Author bylines are not optional**; they are the primary eligibility signal. Every article, blog post, help doc, and product page that explains a financial concept needs a named author with a `Person` schema that includes verified credentials (`hasCredential` with `credentialCategory: "verified"` and `recognizedBy` pointing at the relevant regulator — SEC or FINRA in the US, FCA in the UK, BaFin in Germany, ESMA at the EU level, MAS in Singapore, RBI or SEBI in India — or at a recognized professional body in the jurisdiction).
- **`FinancialProduct` + `FinancialService` + `BankAccount` + `LoanOrCredit` schema** with full disclosure fields: interest rates, fees, terms, and regulator registration IDs.
- **`Dataset` + `Article` citation chains** for any statistic your content relies on. Unsourced financial claims are filtered before generation in Gemini's YMYL quality pipeline. Sourced ones with `citation` schema pointing at the primary data provider (central bank data, regulator disclosures, government statistics agencies, consumer protection reports) are both eligible and preferred.
- **Help Center and FAQ pages are the citation workhorses**, not blog posts. The queries that matter — *"is a mandate-based recurring payment safe,"* *"how do I dispute a credit card charge,"* *"what happens if I miss a tax filing deadline"* — resolve to help-center content in Gemini's answers. FAQ schema on those pages is the one place FAQ schema is still unambiguously worth shipping.
- **`Organization` with regulator registrations in `sameAs`** and accurate `address`, `vatID`, `taxID`, `legalName` fields. For EU brands, the `vatID` and the registered-office `address` are primary entity-resolution signals; for US brands, the SEC EDGAR CIK and the state-of-incorporation registration are the equivalents. Make the cross-reference easy for Google's entity resolver to perform.

Across all four verticals, one pattern: the **schema data plane has to be wired into the operational system of record**, not into the CMS. Prices, inventory, availability, fares, menus, interest rates, property photos — all of these change outside the CMS. If your schema updates lag the operational state by more than a few minutes, your AI-citation eligibility lags accordingly. This surfaces as an SEO problem but the root cause is platform engineering. Teams that scope the work only as a content-marketing program tend to underinvest in the substrate that actually determines whether the content can be cited at all.

## What the other surfaces do differently

**ChatGPT search and Bing Copilot** share a backend. The path is: your page → Bing index → Bing's AI re-ranker → OpenAI's or Microsoft's generation layer → citation. The February 2026 *AI Performance* report inside [Bing Webmaster Tools](https://www.bing.com/webmasters) is the only citation-attribution dashboard any search platform has shipped so far — it tells you, per URL, how often your page was cited by Microsoft Copilot and Copilot-powered partner surfaces ([blogs.bing.com](https://blogs.bing.com/webmaster/February-2026/Introducing-AI-Performance-in-Bing-Webmaster-Tools-Public-Preview)). Set up Bing Webmaster Tools if you haven't. The IndexNow key (below) verifies ownership automatically.

**Perplexity** runs its own crawler (PerplexityBot) plus partner feeds, and its re-ranking model is the most freshness-biased of the four. Perplexity will prefer a page that was updated last week over a page with higher authority that was updated last year when the factual content is comparable. Stale `dateModified` drops you from Perplexity more aggressively than from Google.

**Claude** is the least transparent. ClaudeBot crawls aggressively but Anthropic doesn't publish source-selection behavior. The defensive play is to make your content maximally fetchable for any agentic system — see the markdown-alternate section below.

You don't choose one surface to optimize for. Most of the substrate work counts across all of them.

## The schema graph (and the @id mistake that breaks it in production)

Almost every "AEO best practices" post recommends adding `Article` schema and `Person` schema. That advice is incomplete in a way that breaks the schema in production.

Structured data on a modern site is a **graph**. Each entity gets a stable `@id` (typically a URL fragment such as `https://vikasmishra.ai/#person`). Other entities reference that `@id` instead of restating the underlying fields. A `BlogPosting`'s `author` field points at the `Person`'s `@id`; the `Person`'s `mainEntityOfPage` points at the `ProfilePage`'s `@id`; the `ProfilePage`'s `mainEntity` points back at the `Person`. The `WebSite` is declared once and every `BlogPosting` is `isPartOf` it.

What goes wrong: most sites declare the `Person` schema only on the home page or only on `/about/`, but reference `#person` from the `Article` schema on every blog post. **The reference dangles.** When Google's structured-data extractor fetches a blog post, it sees `"author": {"@id": "https://example.com/#person"}` and no `#person` entity on that page. The author claim doesn't resolve to a named Person; the Person's credentials and `sameAs` graph don't attribute to the article; the authorship signal Google's E-E-A-T evaluator explicitly weighs is absent for every post.

I caught this on my own site when I audited the JSON-LD graph across URLs. The fix is to emit `Person` and `WebSite` on every page — they're cheap, and they make every `@id` reference resolve on every URL. The head template now looks like this:

```go-html-template
{{/* Emit on every page so @id references resolve site-wide.
     Before this change, #person was only declared on the home page;
     the Article.author.@id reference on blog posts dangled. */}}
{{ partial "schema/website.html" . }}
{{ partial "schema/person.html" . }}
{{ if .IsHome }}
  {{ partial "schema/faq.html" . }}
{{ else }}
  {{ partial "schema/breadcrumb.html" . }}
  {{ if .IsPage }}
    {{ if eq .Section "blog" }}
      {{ partial "schema/article.html" . }}
    {{ end }}
    {{ if eq .RelPermalink "/about/" }}
      {{ partial "schema/profile-page.html" . }}
    {{ end }}
    {{ if .Params.faq }}
      {{ partial "schema/faq-page.html" . }}
    {{ end }}
  {{ end }}
{{ end }}
```

That emits a fully-connected graph on every URL. On a blog post the graph resolves: `BlogPosting → author → Person → mainEntityOfPage → ProfilePage → mainEntity → Person` (closes the cycle), plus `publisher → Person`, `isPartOf → WebSite`, and a separate `BreadcrumbList` with its own `@id` referenced from `ProfilePage.breadcrumb` when applicable.

Validate every schema change in both [Google's Rich Results Test](https://search.google.com/test/rich-results) and the [Schema.org validator](https://validator.schema.org/). They disagree on edge cases; Google's validator tells you what Google will actually use, the Schema.org validator catches spec violations Google doesn't flag.

### A debugging story: the double-escape

I ran into a second schema bug that's worth recounting because it's easy to miss and the fix is counterintuitive.

I refactored the `BreadcrumbList` template to use Hugo's `jsonify` function on the `name` field, expecting it to produce safe JSON. It didn't. The rendered output was:

```json
{"@type":"ListItem","position":2,"name":"\"About Vikas Mishra\"","item":"..."}
```

The name value was the string `"About Vikas Mishra"` *including the quote characters*, rather than `About Vikas Mishra`. Hugo's context-aware auto-escaping inside `<script>` blocks was treating `jsonify`'s output as HTML and re-escaping the quotes, turning the JSON string literal into a string that contained quotes. The fix was to drop `jsonify` on simple title fields and use plain interpolation (`"name": "{{ .Title }}"`), reserving `jsonify` for values that might contain actual quote characters. Subtle, and the kind of thing that only surfaces when a human reads the rendered HTML or Search Console flags the item as invalid.

### The schema types that actually earn citations

Stop spraying every type onto every page. In priority order for AEO:

1. **`BlogPosting`** (not `Article` — `BlogPosting` is the more specific subtype Google uses for blog content in AI Overviews) with inline `author` and `publisher`, `image`, `datePublished`, `dateModified`, `wordCount`, `articleSection`, `keywords`, `inLanguage`, `timeRequired`, and `isAccessibleForFree`.
2. **`Person`** with `sameAs` linking to every authoritative profile, plus `hasCredential` for verified certifications, `worksFor`, `alumniOf`, `knowsAbout`, and `mainEntityOfPage` pointing at the ProfilePage.
3. **`ProfilePage`** on the canonical About URL. Per Google's [February 2026 Authors documentation update](https://developers.google.com/search/docs/appearance/structured-data/profile-page), this is now an explicit quality signal for authorship attribution.
4. **`BreadcrumbList`** on every non-home page.
5. **`WebSite`** declared once, referenced from every page.
6. **`FAQPage`** only on pages that genuinely answer recurring questions. Google has been cracking down on spammy FAQ schema since 2023 and over-application is a quality signal in the wrong direction.
7. **`speakable`** as a property inside `BlogPosting` — the AEO-specific property most sites still don't ship.

`HowTo` schema lost most of its rich-result eligibility in the March 2026 update; skip it unless your content is genuinely a numbered procedure.

## Speakable: the AEO property that nobody ships

`speakable` is a property inside `BlogPosting` or `Article` that uses CSS selectors to identify which parts of the page are best for text-to-speech. Google Assistant and the AI Overview audio experience use it to decide what to read aloud. The AEO-relevant implication is stronger: the synthesizer treats the `speakable` selector output as the canonical short-form version of your content. If it has to compress your 2000-word post into a two-sentence answer, the `speakable` output is where it starts.

The selector pattern that works for most blog layouts:

```json
"speakable": {
  "@type": "SpeakableSpecification",
  "cssSelector": [
    "h1",
    ".prose-custom > p:first-of-type",
    ".prose-custom h2"
  ]
}
```

That picks the title, the lead paragraph (usually the TL;DR), and the H2 headings (usually short claims). Roughly **62% of 2026 search queries involve voice in some form** ([fysalyaqoob.com](https://www.fysalyaqoob.com/guides/schema-markup-2026-ai-voice-search)) — omitting `speakable` forfeits that surface entirely.

Writing consequence: compose your lead paragraph as a self-contained answer, not a hook. The first paragraph should work as a standalone summary when read aloud. Both my recent posts now open this way, deliberately.

## llms.txt and llms-full.txt (the AI ingestion files)

The [llms.txt proposal](https://llmstxt.org/) defines a markdown file at `/llms.txt` that gives an AI agent a structured directory of your site. None of the major LLMs has officially committed to using it as a primary input yet, and reasonable skepticism exists in the community about whether it changes anything. The implementation cost is low, though, and the companies betting on it — Anthropic, Cloudflare, Stripe, Zapier, Mintlify — are the same companies that typically set tooling norms in this space. **Over 844,000 sites had implemented it by late 2025** ([searchengineland.com](https://searchengineland.com/llms-txt-proposed-standard-453676)). The expected value is positive and the marginal cost is close to zero.

The spec is strict and most implementations get it wrong:

- **H1** with the project or site name (the only required section).
- **Blockquote** with a one-paragraph authoritative summary.
- **H2 sections** containing markdown link lists, one link per line, each followed by a colon and a concise description of *what that page answers*. Recency-first for blog sections.
- An optional **"Optional"** section at the end pointing to deeper resources.

The common failure mode: treating `llms.txt` as a bio blob, dumping CV paragraphs into it. The spec wants a smart table of contents — canonical pages, descriptions that tell the agent what each page *answers*.

`llms-full.txt` is the companion file that compiles every relevant page inline as one big markdown document. Anthropic's is **481,349 tokens**; Cloudflare ships one per product vertical. For a personal site or product docs, this is the single highest-leverage AI-ingestion surface: one URL pastes the full site context into ChatGPT, Claude, or any agent.

I generate both from Hugo at build time:

```go-html-template
{{/* llms.txt template — spec-compliant, generated every build */}}
# {{ .Site.Title }}

> {{ .Site.Params.description }}

## Recent writing
{{- range first 25 (where .Site.RegularPages "Section" "blog").ByDate.Reverse }}
- [{{ .Title }}]({{ .Permalink }}): {{ .Description | plainify }}
{{- end }}
```

```go-html-template
{{/* llms-full.txt template — compiles every post inline */}}
# {{ .Site.Title }} — Full Content for AI Ingestion

{{ range (where .Site.RegularPages "Section" "blog").ByDate.Reverse }}
### {{ .Title }}
- Published: {{ .PublishDate.Format "2006-01-02" }}
- URL: {{ .Permalink }}

{{ .RawContent }}

---
{{ end }}
```

Total work: about 70 lines of template, regenerated on every deploy. Mine is live at [/llms.txt](https://vikasmishra.ai/llms.txt) and [/llms-full.txt](https://vikasmishra.ai/llms-full.txt).

## Per-page raw markdown alternates

The single underrated move in 2026 is exposing every page as raw markdown alongside the HTML. Hugo lets you do it with a custom output format:

```toml
[outputFormats.MARKDOWN]
  name = "MARKDOWN"
  mediaType = "text/markdown"
  baseName = "index"
  isPlainText = true
  rel = "alternate"

[outputs]
  page = ["HTML", "MARKDOWN"]
```

That produces `https://vikasmishra.ai/blog/<slug>/index.md` for every post, served as `text/markdown`. Advertise the alternate in `<head>`:

```html
<link rel="alternate" type="text/markdown"
      href="https://vikasmishra.ai/blog/<slug>/index.md">
```

Why it matters: agentic AI fetchers (Perplexity, ChatGPT browsing, Claude, increasingly Bing's grounding layer) prefer raw markdown over HTML scraping. The fetch is faster, content extraction is exact, structure (headings, code blocks, lists) is preserved without the noise of nav/footer/scripts. Pages that ship `.md` alternates get pulled cleanly; pages that don't get HTML-stripped, often badly. Ask ChatGPT or Claude to "fetch and summarize" a blog post URL — the agent that grabs the `.md` returns a noticeably more accurate summary than one fetching the HTML.

This is the pattern Mintlify popularized for docs and Cloudflare now serves across all of `developers.cloudflare.com`. Within twelve months it will be a default expectation for AI-indexable content.

## IndexNow: the ChatGPT pipeline most sites haven't set up

Bing's standard crawl interval is days to weeks depending on your crawl budget. **IndexNow shortens that to minutes** for any URL you push, and because 92% of ChatGPT search goes through Bing, IndexNow is functionally the release pipeline for getting into ChatGPT citations.

IndexNow ([indexnow.org](https://www.indexnow.org/)) is a push-based protocol Microsoft introduced. Bing, Yandex, Naver, Seznam, and a handful of smaller engines participate. Setup is three steps:

1. Generate a 32-character hex API key.
2. Host a verifier file at `https://<your-domain>/<key>.txt` containing the key as the body.
3. POST JSON payloads with changed URLs to `https://api.indexnow.org/IndexNow`.

The ping script I wired into my deploy:

```bash
#!/usr/bin/env bash
# Parses the last 7 days of sitemap entries and submits to IndexNow.
# Bing re-crawls within minutes; ChatGPT search citations follow.
set -euo pipefail

HOST="vikasmishra.ai"
KEY="67b195f48d581df4cde76566232326a2"
URLS=$(python3 parse_sitemap.py --since 7)

PAYLOAD=$(python3 -c "import json,sys; print(json.dumps({
  'host': '$HOST',
  'key': '$KEY',
  'keyLocation': 'https://$HOST/$KEY.txt',
  'urlList': '''$URLS'''.strip().split()
}))")

curl -sS -X POST https://api.indexnow.org/IndexNow \
     -H "Content-Type: application/json" \
     --data "$PAYLOAD"
```

Called after every production build. Failures are non-fatal so they can't block a deploy. The observable effect in Bing Webmaster Tools' crawl log is a drop in mean-time-to-crawl from days to under ten minutes, and in the AI Performance report a corresponding compression of the delay between publish time and first citation appearance.

## Visible E-E-A-T (because schema-only authorship is weak)

Google's February 2026 update to Search Central's Authors documentation was the clearest signal yet that authorship is a direct quality consideration in ranking and AI citation. The schema-only version — `Article.author` referenced by `@id` — is necessary but not sufficient. The visible, on-page version is what Google's quality models actually weigh.

What "visible" means on a blog post:

- **Byline at the top** with the author's name, role, and a link to the canonical About page. Avatar optional but useful.
- **Author bio block at the bottom** with a paragraph of credentials, links to other posts by the same author, and outbound links to authoritative profiles (LinkedIn, GitHub, university page, Google Scholar, Credly).
- **Microdata `itemprop` hints** on the visible markup as a belt-and-suspenders complement to JSON-LD. Older parsers and some AI fetchers still read microdata.

I shipped both bylines on this site. The byline includes the author's role with the employer because AEO surfaces weight employer affiliation as a topical-authority signal — a post on cloud architecture written by someone with a visibly-attributed cloud architect role at a cloud provider gets weighed higher than the same post from an unaffiliated writer. Disclaimers about the employer not endorsing the content go in the post body, not the byline; the byline is the credential signal, the disclaimer is legal hygiene.

The deeper play is **entity verification** via Google's Knowledge Graph. Authors are classified as "verified entities" when the graph can cross-reference your identity across multiple authoritative sources, or unverified when it can't. Verification raises the citation eligibility floor. The mechanism is `sameAs`: your `Person` schema's `sameAs` array should list every authoritative URL that's about you. The strongest `sameAs` target is **Wikidata**, because Wikidata is a primary input to Google's Knowledge Graph. A Wikidata entry with `sameAs` back to your site closes the verification loop from both directions.

If you don't have a Wikidata entry, create one with your notable work and third-party citations. Don't claim a Wikidata ID in `sameAs` that doesn't exist — an invalid `sameAs` value is worse than a missing one; Google's entity resolver treats it as a negative signal.

## Open Graph for the AI age

The default `og:image` in most Hugo themes is the site portrait. That's wrong for blog shares. The portrait belongs on pages where the person is the subject — home, about, contact. For a blog post, the share preview should communicate the topic of the post, not the face of the writer.

I ship a per-section default. Blog posts and section index pages use a branded 1200×630 terminal-themed PNG generated from a 100-line Pillow script in `tools/gen_og_blog.py`. Home and root pages keep the portrait. Per-post `image:` frontmatter still overrides everything. Rebuilding the card is a single command, which matters because the card design will need iteration as you learn what drives click-through.

Caching detail people miss: X, LinkedIn, and Facebook aggressively cache OG previews per URL. If you've shared a URL with the wrong card, the cached preview persists. Appending `?v=2` (or any throwaway query parameter) when re-sharing forces a fresh fetch.

The advanced version is per-post auto-generated OG cards with the post title rendered onto the canvas. Hugo's `images` package supports composition via `images.Filter` and `images.Text`. Worth the build-time cost if your blog covers wildly different topics per post; not worth it otherwise.

## Robots.txt: explicit > implicit

`User-agent: *` with `Allow: /` already permits AI crawlers. Explicit allows are stronger because (a) they make your crawl logs auditable — you can see in `access.log` which AI crawlers actually showed up; (b) they survive future spec changes where a crawler honors a more granular directive; and (c) they signal intent to other engineers and security teams reading the file. The 2026 list I keep current on this site:

```
# Google
User-agent: Googlebot
User-agent: Google-Extended           # Gemini training
User-agent: Google-CloudVertexBot     # Vertex AI grounding
User-agent: Gemini-Deep-Research      # AI Mode multi-step research agent

# OpenAI
User-agent: GPTBot                    # Training
User-agent: ChatGPT-User              # Browsing
User-agent: OAI-SearchBot             # ChatGPT search

# Anthropic
User-agent: ClaudeBot                 # Training
User-agent: Claude-User               # Browsing
User-agent: Claude-SearchBot          # Search

# Microsoft / Bing
User-agent: bingbot
User-agent: msnbot

# Apple Intelligence
User-agent: Applebot-Extended

# Perplexity
User-agent: PerplexityBot
User-agent: Perplexity-User

# Others
User-agent: meta-externalagent        # Meta AI
User-agent: Amazonbot                 # Alexa / Q
User-agent: DuckAssistBot
User-agent: MistralAI-User
User-agent: cohere-ai
User-agent: YouBot                    # You.com
User-agent: KagiBot
User-agent: BraveBot
User-agent: Diffbot                   # Knowledge graph
User-agent: xAI-Bot
User-agent: GrokBot
User-agent: CCBot                     # Common Crawl
```

Each gets `Allow: /`. If any section shouldn't be AI-indexed (paid content, internal docs), use `Disallow:` inside the relevant `User-agent:` block — not site-wide. Large publishers have accidentally made themselves invisible to LLMs by adding `User-agent: *` `Disallow: /` during a site migration and forgetting to remove it.

## Sitemap priorities, lastmod, and the freshness signal

A sitemap with flat `priority: 0.5` and no `lastmod` signals nothing. Hierarchical priorities and accurate `lastmod` change crawl behavior at every major engine.

| Page type | Priority | Changefreq |
|---|---|---|
| Home | 1.0 | daily |
| Section index (`/blog/`, `/talks/`) | 0.9 | daily/weekly |
| Blog posts | 0.8 | weekly |
| About, Contact, Resume | 0.8 | monthly |
| Projects | 0.7 | monthly |
| Secondary (Photos, Reading, Privacy, Terms) | 0.6 | monthly |

`lastmod` should derive from actual content modification time, not build timestamp. In Hugo, `enableGitInfo = true` makes `Lastmod` come from git history; every post's `lastmod` reflects its last meaningful edit automatically. That same value feeds `dateModified` in BlogPosting schema and `og:updated_time` in the head. One source, three downstream signals.

Google's AI Mode and Perplexity both use freshness as a re-ranking lever. Stale `lastmod` drops you against the same claim written this month. And show the freshness to humans — my blog post header renders `Updated <date>` between the publish date and reading time when `Lastmod > Date`, which also confirms to any AI fetcher that the visible content matches the schema claim.

## Core Web Vitals in 2026 (site-level, and tighter)

The March 2026 Google update tightened Core Web Vitals scoring in two ways worth internalizing:

1. **LCP "good" threshold dropped** from 2.5s to 2.0s. Anything between 2.0 and 2.5 used to pass; it's now "needs improvement."
2. **Site-level aggregation replaced page-level scoring.** Google now evaluates CWV across your whole domain. One slow section can suppress the entire domain's AI Overview eligibility, not just that page's rank.

For static sites with a modern build, CWV passes with margin. The two places that trip up Hugo sites:

- **Font loading.** Three weights of two font families blows LCP. Subset, self-host, or use `font-display: optional`.
- **INP (Interaction to Next Paint).** Third-party analytics and hover-driven JS are usually the culprit. Defer non-essential JS; use `<link rel="preconnect">` for analytics endpoints to shave first-paint cost.

CLS (Cumulative Layout Shift) is solved by explicit `width` and `height` on every image so the browser reserves space before load. Should be standard by now; audit if you're not sure.

## What to skip, and the dark patterns that will cost you rank

**Skip (low or negative ROI):**

- **FAQ schema on every page.** Google has deprecated FAQ rich results since 2023 and over-application is a quality signal in the wrong direction. Use FAQ schema only on pages that genuinely answer recurring questions.
- **HowTo schema for non-instructional content.** Lost most rich-result eligibility in March 2026.
- **Stuffed `keywords` property** in Article schema. Essentially ignored now; Google's entity extractor pulls topics from actual content.
- **Multiple `Person` schemas on the same page.** Route every reference through one canonical `@id`.
- **Wikidata `sameAs` pointing at non-existent IDs.** Negative signal. Either create the Wikidata entry or omit the field.

**Avoid (these will get you penalized):**

- **Cloaking AI bots with content different from what users see.** Modern cloaking is detectable by fingerprinting, the penalty is severe, and the EU Digital Services Act now imposes fines up to 6% of global revenue for manipulative interfaces ([techcrunch.com](https://techcrunch.com/2025/08/25/ai-sycophancy-isnt-just-a-quirk-experts-consider-it-a-dark-pattern-to-turn-users-into-profit/)). Risk-adjusted return is negative.
- **Bulk AI-generated content with no human review.** The April 2026 algorithm update placed a premium on first-hand experience signals. AI-rewritten pages without depth get filtered before generation.
- **Schema that doesn't match visible content.** Explicit Google Webmaster Guidelines violation: *misleading structured data*. Your `BlogPosting.author` has to match the visible byline. Your `datePublished` has to match the visible date.
- **Generated `Review` or `Rating` schema you don't actually have.** Highest-detection-rate spam pattern in Google's evaluation pipeline.

The throughline: AEO rewards sites that make true claims with rich structure. It penalizes sites that try to game the structure to make false claims look true.

## The ship order

If I were starting from a clean Hugo site tomorrow, I'd ship the work in this order:

1. **Foundations.** Robots.txt with the full AI crawler allowlist. Sitemap with hierarchical priorities. `enableGitInfo = true` so `lastmod` is automatic.
2. **Schema graph.** `WebSite` and `Person` on every page, `BlogPosting` on every blog post with inline `author` and `publisher`, `BreadcrumbList` on every non-home page, `ProfilePage` on `/about/`. Validate every change in Google's Rich Results Test before pushing.
3. **Visible E-E-A-T.** Author byline at the top of every post, bio block at the bottom, visible `Updated <date>` when content has been revised, microdata `itemprop` hints to complement JSON-LD.
4. **AI ingestion files.** `llms.txt` to spec, generated at build time. `llms-full.txt` compiling all blog posts inline. Per-page `.md` alternates advertised via `<link rel="alternate">`.
5. **IndexNow.** Key + verifier file + ping script wired into deploy. Verify in Bing Webmaster Tools AI Performance.
6. **Speakable.** Property inside BlogPosting with cssSelectors for title, lead, and H2s. Rewrite leads as standalone answers.
7. **OG cards.** Per-section defaults that reflect content, not portraits. Per-post overrides via frontmatter.
8. **Operate.** Google Search Console weekly. Bing Webmaster Tools weekly. Watch which URLs get cited and iterate.

Most teams I've worked with lean heavily toward new content production and underinvest in the substrate that determines whether the content is eligible to be cited. The substrate work compounds across every piece of content you ever publish; the content work doesn't. A more even split between the two is usually the higher-return rebalance.

## The durable principles (read this section if you read nothing else)

Specific schema property names change. Crawler names change. Rich-result eligibility rules change on the March and September algorithm cycles. The principles that sit under all of that have been stable through several generations of search architecture and will stay stable through several more. If you build against these, the tactical changes become minor maintenance instead of quarterly rewrites.

1. **Make your content retrievable in the format the retriever prefers.** In 2026 that means HTML plus a markdown alternate. In 2028 it will mean HTML plus markdown plus whatever structured-data formats agentic systems converge on. The principle is: expose the cleanest-possible representation of your content to any system that might fetch it. That principle was true for XML feeds in 2006, RSS in 2010, AMP in 2018, JSON-LD in 2022, and markdown alternates in 2026. It will be true for the next format.

2. **Declare a connected, verifiable entity graph — not isolated JSON-LD blocks.** Every entity has one canonical `@id`; every reference resolves. This is just data modeling hygiene, and data modeling hygiene has never gone out of style. The specific types (`Person`, `ProfilePage`, `BlogPosting`) will evolve; the graph discipline stays the same.

3. **Make the author a verifiable entity across the open web.** `sameAs` to the authoritative profiles that already know who you are. Bidirectional links so any system doing entity resolution can confirm the claim. The platforms that carry authority will rotate — Wikipedia, LinkedIn, Wikidata, Google Scholar, whatever comes next — but cross-reference across multiple authoritative sources has been how identity is established since the open web existed.

4. **Keep content fresh by default, and show freshness to machines and humans alike.** Accurate `lastmod` in the sitemap. `dateModified` in schema. Visible "Updated X" in the rendered page. This is how every retrieval system ranks recency, and every retrieval system will rank recency as long as fresh is ever preferred over stale.

5. **Push updates instead of waiting to be crawled.** Today that means IndexNow for Bing and the broader OpenAI/Microsoft surface. The specific protocol will evolve; push-indexing as a pattern won't. If a search engine offers a push channel, use it — the payoff is always faster time-to-citation.

6. **Write for passage retrieval, not page retrieval.** Retrieval-plus-generation systems retrieve passages and synthesize. Every self-contained section, every H2 with a clean standalone answer underneath, is a separate retrieval target. This is true for any retrieval system that operates at sub-document granularity, and all of them do now.

7. **Match the on-page signal to the structured-data claim.** If your schema says the author is Dr. X, the byline says Dr. X. If `datePublished` says April 2026, the visible date says April 2026. Google's "misleading structured data" penalty has been on the books for years and the AEO-era quality models apply it more aggressively, not less.

8. **Favor substrate investment over content volume.** Ten posts on a substrate that earns citations outperform a hundred posts on a substrate that doesn't. The substrate compounds; the content doesn't. Teams that rebalance their SEO budget from content-volume to substrate-quality tend to underinvest here even when they've read the advice — it feels less productive than shipping new pages. It isn't.

## Closing

Structured-data validators are your tests. Search Console and the equivalent per-surface dashboards are your production telemetry. Push-indexing protocols are your release pipeline. The schema graph is your data model. Each has a deterministic right answer you can implement, validate, and observe — the same workflow any serious engineering team applies to any other production system.

The metric changed: from position in a list to inclusion in a synthesized answer. The engineering required to hit it is harder to hand-wave about than classical SEO was, but it's measurable at the substrate layer and the returns are larger because an AI Overview citation is worth materially more to a brand than a position-two blue link ever was. For large digital-native businesses — marketplaces, travel platforms, food delivery, fintechs — the outcomes over the next few years will track platform-engineering discipline more than content volume. The schema data plane wired into the system of record, push-indexing protocols in the release pipeline, and the schema graph treated as a first-class data model are platform investments. They compound; new pages don't.

The stack on this site validates clean in Google's Rich Results Test. The breadcrumb fix moved from Invalid to pending-revalidation in Search Console within 24 hours of push. The IndexNow-assisted publishes are already landing in Bing's AI Performance report. Those are the observable outputs of the principles above. When the specific tactics here age, replace the tactics — not the principles.

---

### References

- Google AI Overview ranking factors and citation economics — [wellows.com](https://wellows.com/blog/google-ai-overviews-ranking-factors/), [clickrank.ai](https://www.clickrank.ai/e-e-a-t-and-ai/)
- Google AI Mode query fan-out and Gemini architecture — [upgrowth.in](https://upgrowth.in/google-ai-mode-optimization-the-complete-guide-for-2026/), [seovendor.co](https://seovendor.co/google-april-2026-algorithm-updates)
- Knowledge Graph entity density and `sameAs` authority — [hypesuite.ai](https://www.hypesuite.ai/post/google-knowledge-graph-essentials-what-every-seo-pro-should-know), [12amagency.com](https://12amagency.com/blog/authoritative-sources-structured-data/), [wireinnovation.com](https://wireinnovation.com/mastering-seo-entities/)
- ProfilePage and Authors schema — [Google Search Central](https://developers.google.com/search/docs/appearance/structured-data/profile-page)
- Speakable specification — [Google Search Central](https://developers.google.com/search/docs/appearance/structured-data/speakable)
- Core Web Vitals March 2026 update — [logoswebdesigns.com](https://logoswebdesigns.com/blog/core-web-vitals-2026-march-update/)
- Bing AI Performance in Webmaster Tools — [Microsoft Bing Blog, Feb 2026](https://blogs.bing.com/webmaster/February-2026/Introducing-AI-Performance-in-Bing-Webmaster-Tools-Public-Preview)
- llms.txt specification — [llmstxt.org](https://llmstxt.org/), [Anthropic llms-full.txt](https://docs.claude.com/llms-full.txt)
- IndexNow protocol — [indexnow.org](https://www.indexnow.org/)
- Bing / ChatGPT retrieval share — [cbwebsitedesign.co.uk](https://www.cbwebsitedesign.co.uk/geo-ai/how-to-rank-on-chatgpt-and-bing-copilot-in-2026-full-guide/), [almcorp.com](https://almcorp.com/blog/bing-ai-performance-webmaster-tools-complete-guide/)
- Dark patterns and DSA penalties — [techcrunch.com](https://techcrunch.com/2025/08/25/ai-sycophancy-isnt-just-a-quirk-experts-consider-it-a-dark-pattern-to-turn-users-into-profit/)


---


### Agentic Coding Is a Different Skill

- **Published:** 2026-04-19 (updated 2026-04-19)
- **URL:** https://vikasmishra.ai/blog/agentic-coding-is-a-different-skill/
- **Markdown source:** https://vikasmishra.ai/blog/agentic-coding-is-a-different-skill/index.md
- **Tags:** AI Agents, Claude Code, Gemini CLI, Developer Productivity, MCP

> The executive summary of a free, open-source playbook for Claude Code and Gemini CLI: the handful of ideas that earned their place after months of trial and error.


*I spent the last several months running Claude Code and Gemini CLI against real codebases. The playbook that came out of that is free and open source at [github.com/vmishra/ai-coding-playbook](https://github.com/vmishra/ai-coding-playbook). This post is the executive summary — the handful of ideas that earned their place in the book after months of trial and error.*

---

## The moment the shape of the problem became clear

Three hours into a debugging session, my agent started inventing function names. Not hallucinating generic helpers. It was citing specific methods in our codebase that didn't exist. A function we had *built together* an hour earlier was suddenly unfamiliar. I asked it to re-read the file and it agreed to, then continued citing the wrong name.

The model hadn't gotten dumber. The context window had gotten dirtier.

That was the week I stopped thinking of agentic coding as "a faster way to code" and started thinking of it as a new skill. Not an autocomplete on steroids. Not a marginal improvement. Something different, where a lot of my previous instincts were working against me.

The rest of this post is what that shift looks like in practice — the mental model, the habits, and the primitives that make agents work on real codebases instead of in demos.

## Intelligence isn't the bottleneck. Context is.

Hold this picture in your head for a second:

```
┌──────────────────────────────────────────────────────────────┐
│                      The Context Window                      │
│                                                              │
│   [ system prompt ][ your brief ][ files the agent read ]    │
│   [ tool results ][ the agent's own plans and notes ]        │
│   [ intermediate outputs ][ your follow-up messages ]        │
│                                                              │
│                  Everything the model sees                   │
└──────────────────────────────────────────────────────────────┘
```

The model has exactly one input: this window. It has no memory of prior sessions unless you give it one. It cannot "just know" anything about your codebase that isn't either in the window or fetchable by a tool.

This single fact explains almost everything that goes wrong in a session:

- *"The agent forgot what we decided earlier."* It didn't forget. The decision got crowded out by later content.
- *"It keeps making the same mistake."* The correction is in the window, but it's buried under thirty file reads.
- *"It invented a function that doesn't exist."* The real function is in a file the agent never opened. From the model's perspective, inventing was the only option.

Every technique that actually works is, at the root, a technique for controlling what's in the window, for how long, at what position. Internalize that, and the rest is details.

This is the argument [Chapter 1](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/01-getting-started) builds on, and it's the reason [Part IV](https://github.com/vmishra/ai-coding-playbook/blob/main/docs/TABLE_OF_CONTENTS.md) of the book — the context-economics part — is what most people actually want first.

## Briefing, not prompting

Prompting a chat model is a single-turn transaction. You type a question, it answers, you judge.

Briefing an agent is a multi-turn commitment. You describe an outcome, the agent reads, plans, edits, runs, observes, iterates, and you supervise and redirect. The input isn't a prompt; it's a working agreement the agent will return to across dozens of turns.

Every brief that works has the same four parts:

1. **Outcome.** What "done" looks like, concretely, in terms the model can verify.
2. **Scope.** What to touch, and explicitly what not to.
3. **Constraints.** Non-goals. Conventions. Things out of bounds.
4. **Verification.** How you'll know it's actually done.

A bad brief:

> The API is too slow on the users endpoint, make it faster.

A good one:

> *Outcome:* `GET /users/:id` p95 under 100ms in our staging benchmark.
> *Scope:* `src/routes/users/get.ts`, `src/lib/db/users.ts` if needed. Tests allowed in `src/__tests__/routes/users/`.
> *Constraints:* Don't add caching — that's a separate project and involves a product decision. Don't change the response shape. Keep the existing transaction boundary.
> *Verification:* Run `pnpm bench -- users-get`. Show p50/p95/p99. If the only improvement you can find is caching, stop and tell me.

Same task. Radically different trajectory. The good brief is the entire difference between a 30-minute focused session and a 3-hour drift.

The full treatment, with a dozen worked examples and the anti-patterns to watch for, is in [Chapter 3](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/03-prompting).

## The four symptoms of context rot

Sessions don't die because the model gets tired. They die because the context degrades faster than the session produces useful work. The symptoms are specific and predictable. Learn to spot them early, because they're your cue to stop — not push through.

**Repetition.** The agent suggests a change it already made. Proposes an approach you already rejected.

**Confident invention.** Function names that don't exist. Import paths that are wrong. The model's priors leaking through because the real data got crowded out.

**Drift.** You asked for one fix. The diff touches seven files. The original brief's boundaries have faded.

**Hedging.** "This *should* work." "Usually this pattern…" The model is noticing its own uncertainty without being able to locate the thing it's uncertain about.

When you see two of these in the same session, stop. Compact, or better, start fresh. A new session with your memory files loaded is not a cost — it's a feature. The agent gets the best conditions for the new task, with nothing leaking in from the last one.

The most expensive thing you can do at this moment is keep going and "just give it more context." You don't need more signal. You need less noise. [Chapter 9](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/09-context-rot) is the full framework.

## Memory: durable context without re-briefing

The model has no persistent state. Every new session starts with an empty window. If you don't do something about that, you'll re-explain your codebase, your conventions, and your constraints every single time.

Both CLIs solve this the same way: specific files that load automatically at session start.

For Gemini CLI, that's `GEMINI.md` — user-global at `~/.gemini/GEMINI.md`, project at `./GEMINI.md`, and nested directory overrides. For Claude Code, it's `CLAUDE.md`, plus `.claude/rules/*.md` with `paths:` globs for path-scoped rules.

What belongs in these files:

- Build, test, and run commands with their real names — `pnpm test -- path/filter`, not "run the tests."
- Three sentences of architecture. Entry point, routing, persistence. No more.
- Conventions the model won't infer on its own — indent width, error handling patterns, naming across layers.
- Non-goals — the things you *don't* want the agent to do without asking.
- Pointers — "for anything about billing, start from `billing/README.md`."

What doesn't belong: secrets, bulk documentation, contradictory rules, transient state like "working on X this week."

The heuristic that's served me well: target under 200 lines. Memory files that stay small are load-bearing. Memory files at 2,000 lines are dead weight the model pays for every single turn. When yours starts creeping up, break detail out into separate files and link them with `@import`.

[Chapter 4](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/04-memory) has the full hierarchy, starter templates for both CLIs, and the mistakes everyone makes on their first pass.

## The four primitives that matter

Once you have memory, you compose. Four primitives, each with a distinct sweet spot. Using the wrong one for a given capability is the source of most "my workflow is a mess" pain.

**Slash commands** are named prompts you invoke with `/name`. They're best for workflows *you* run yourself: `/research <topic>`, `/commit` that inspects the staged diff and proposes a conventional-commit message, `/review` with your team's checklist baked in. Written in TOML for Gemini, Markdown with YAML frontmatter for Claude. [Chapter 5.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/05-slash-commands)

**Skills** are capabilities the *agent* decides when to use. A skill is a directory with `SKILL.md` — the description loads up-front so the agent knows it exists; the body loads lazily when it's actually relevant. Good for "when the user asks for a PR description, produce it in our format." [Chapter 6.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/06-skills)

**MCP servers** expose tools to the agent: `search_issues`, `query_bigquery`, `create_branch`. The Model Context Protocol is the USB of agentic coding — one protocol, many servers, many clients. Anthropic introduced it; Google adopted it across Gemini CLI and the Agent Development Kit. Every server you connect contributes its tool descriptions to every single turn, so connect deliberately — the single most common MCP mistake is wiring up everything "just in case" and watching context costs explode. [Chapter 7.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/07-mcp)

**Hooks** are scripts the runtime executes on specific events — before a tool runs, after a file changes, when the session ends. They're where policy lives: block `rm -rf`, auto-format on edit, log every tool call for audit. Claude Code and Gemini CLI have closely similar hook models with slightly different event names but the same output protocol. [Chapter 8.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/08-hooks)

The decision rule: a command that keeps growing "if" conditions wants to become a skill. A skill that must always fire wants to be a hook. A hook that handles rich domain logic wants to be a real tool exposed via MCP. Each primitive has a sweet spot. The book has working examples of every shape, and the [assets directory](https://github.com/vmishra/ai-coding-playbook/tree/main/assets) ships drop-in versions you can copy into `~/.claude/` or `~/.gemini/` today.

## Where the money actually goes

At individual-developer scale, an agentic CLI costs $20–$200 a month. Not a problem.

At team scale — 50, 100, 500 engineers running agents daily — it stops being a rounding error. The difference between a disciplined team and an undisciplined one is five figures a month, and it isn't about "which model." It's about context discipline.

Here's where tokens actually go, in rough order of size:

1. **File reads.** A single 1,000-line source file is 3–8k tokens. Do that twenty times in a session and you have 100k+ tokens of file content crowding out everything else.
2. **Tool descriptions.** Every MCP server, every skill, every command contributes to every turn's header.
3. **Intermediate reasoning.** Modern models produce substantial thinking that gets appended to context.
4. **Tool output.** Verbose `git log`, jest output, shell output.
5. **Memory files.** Loaded every session. A bloated `CLAUDE.md` is a recurring tax.
6. **Your prompts and the model's replies.** Usually the smallest slice.

Notice what's *not* on the list: the actual code changes. Writing a file is cheap. Reading files is expensive. That asymmetry is the core economic insight of agentic coding, and almost every optimization is an application of it.

The seven levers that matter, in decreasing impact:

1. **Explore first, read narrowly.** "Use Grep to find the three files relevant to X, then read only those in full" saves ~70% vs "understand the auth system."
2. **Offload reads to subagents.** A subagent that reads 40 files and returns a 500-token summary pulls ~60k tokens out of your primary context.
3. **Use prompt caching.** Keep stable stuff at the top, volatile stuff at the bottom. Don't edit `CLAUDE.md` mid-session — you invalidate the cache for everything after.
4. **Compact at the right moment** — not when you hit the limit, when your signal-to-noise drops. Often just after a heavy exploration phase.
5. **Route to cheaper models for cheap work.** Gemini Flash and Claude Haiku are 80–90% cheaper and completely fine for summarization, rote formatting, and unfamiliar-directory exploration.
6. **Scope MCP tool exposure.** A server you use once a week shouldn't load every session. Use per-project `.mcp.json`.
7. **Trim memory files.** Target under 200 lines and use `@import` for detail.

Worked math from real sessions: a typical "add a new endpoint with tests" task on an undisciplined run burns around 190k input tokens. The same task with discipline — trimmed memory, scoped MCP, subagent-delegated exploration, compaction between phases, Flash for summarization — runs at around 70k primary plus 30k on the subagent at cheap-tier pricing. Roughly 3× reduction, same output quality.

[Chapter 10](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/10-saving-tokens) is the one most teams want first. It's the economics chapter.

## Subagents: the biggest single lever

A subagent is a second instance of the agent, spawned with a narrow task, running in its own context window. When it finishes, it returns a summary.

Two benefits: **context isolation** (heavy reads happen elsewhere, your primary stays sharp) and **parallelism** (several subagents work simultaneously on independent pieces).

Two risks: they still cost tokens (you just moved the cost off your screen), and parallelism magnifies briefing errors — four subagents can solve the wrong problem fast.

The discipline is the same one that makes delegation work with humans: give them a scoped goal, trust them to execute, verify the output. And critically, tell them *what not to return* — otherwise you get the full transcript back and the whole point of the subagent evaporates.

[Chapter 11](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/11-subagents) is the full patterns library.

## The wider ecosystem the book covers

The playbook devotes chapters to things that matter but aren't strictly about the CLI:

- **Research as a first-class activity.** Web search and fetch, `/research` as a habit, durable research docs vs ephemeral lookups. [Chapter 12.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/12-research)
- **Tool connectivity.** Three auth shapes for wiring agents into Google Cloud, BigQuery, Firebase, and internal APIs, with principle-of-least-privilege patterns. [Chapter 13.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/13-connectivity)
- **IDE integration.** VS Code and JetBrains extensions, Firebase Studio (which replaced Project IDX in April 2025), and Cloud Workstations for team-scoped remote environments. [Chapter 14.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/14-ide)
- **Test generation.** The failure mode of AI-generated tests is tests that pass while the feature is broken. [Chapter 16](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/16-testing) is about avoiding that specifically.
- **UI generation.** [Stitch](https://stitch.withgoogle.com) from Google Labs with its agent-friendly `DESIGN.md` export, the newly launched Claude Design, v0 for production React, and Figma Dev Mode MCP for teams where Figma is source of truth. [Chapter 17.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/17-ui-generation)
- **Local models.** Qwen3-Coder, Gemma 4, DeepSeek. Ollama and vLLM. When local beats hosted. [Chapter 18.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/18-local-models)
- **Google Antigravity.** The agentic IDE with dual Editor and Mission Control views, up to 5 parallel agents. [Chapter 19.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/19-antigravity)
- **Agent SDKs.** [Google ADK](https://google.github.io/adk-docs/) for Python multi-agent systems on Vertex, and the Claude Agent SDK for TypeScript and Python. When the CLI isn't enough. [Chapter 20.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/20-agent-sdks)

## Six end-to-end playbooks

The book closes with six worked examples you can steal directly: triaging a vague bug report without skipping to a fix, a multi-file migration with subagent fan-out gated by pattern approval, a feature from spec to shipped with plan-before-edit, on-call triage that's explicitly read-only, research-driven decisions that produce committable ADRs, and PR triage across eight open PRs in ninety seconds.

Each one applies the same disciplines from the earlier chapters. If you've read Parts I–V, they feel obvious. If you haven't, they're the short path to seeing the whole system in motion. [Chapter 21.](https://github.com/vmishra/ai-coding-playbook/tree/main/chapters/21-workflows)

## What I wish I'd known on day one

Three things, if I could go back.

**Sessions should end at task boundaries.** Not at the end of the day. When a task is done, close the session. Open a new one for the next task. I resisted this for months and paid for every minute of resistance.

**Plan before edit, every time.** Gemini's `--plan` and Claude's `--permission-mode plan` are the single most useful flags for non-trivial work. Read the plan, push back, approve, *then* let the agent edit. It turns the session into a series of tiny reviewable steps instead of one giant "it did a thing" at the end.

**Write the spec in a file before you paste it.** The four-part brief — outcome, scope, constraints, verification — committed to `specs/` in the repo. It forces clarity before the agent starts working, which is vastly cheaper than redirecting a drifting agent forty turns in.

None of these are clever. All of them took me longer to adopt than they should have.

## Going deeper

This post compressed months of habits into around 2,800 words. The source material has real depth, and it's free:

- **The full book:** [github.com/vmishra/ai-coding-playbook](https://github.com/vmishra/ai-coding-playbook) — 21 chapters, both Gemini CLI and Claude Code, primary sources throughout.
- **The assets:** [`assets/`](https://github.com/vmishra/ai-coding-playbook/tree/main/assets) — runnable slash commands, skills, hook scripts, MCP configs, and settings starters you can drop into `~/.claude/` or `~/.gemini/` right now.
- **The annotated table of contents:** [`docs/TABLE_OF_CONTENTS.md`](https://github.com/vmishra/ai-coding-playbook/blob/main/docs/TABLE_OF_CONTENTS.md).

The book is MIT-licensed. Fork it, remix it, build a course on top of it, translate it — just don't claim you wrote it.

If you try something from it on a real codebase and it does (or doesn't) work the way a chapter claimed, please [open a field-report issue](https://github.com/vmishra/ai-coding-playbook/issues). That's the best feedback loop the book has, and it's how the next version stays honest.

---

*I'm [Vikas Mishra](https://vikasmishra.ai), an engineer at Google. The views here and in the playbook are mine, not my employer's.*


---


### Agentic Mesh, or Just Another Mesh? Auditing the Lineage Argument

- **Published:** 2026-04-19 (updated 2026-04-19)
- **URL:** https://vikasmishra.ai/blog/agentic-mesh-or-just-another-mesh/
- **Markdown source:** https://vikasmishra.ai/blog/agentic-mesh-or-just-another-mesh/index.md
- **Tags:** AI Agents, Architecture, Service Mesh, Platform Engineering, LLM

> The mesh metaphor for agent systems inherits a lineage that's rewriting itself. A careful audit of where the load-bearing work actually is.


*Disclaimer: All opinions expressed in this post are my own and do not represent the views or positions of my employer.*

The service mesh pattern is quietly rewriting itself. Istio's Ambient Mode ships without sidecars and went GA in Istio 1.24 (late 2024). Linkerd is healthy. Cilium's mesh is shipping. Managed offerings across the major clouds are shifting customers off per-pod sidecar models toward substrate-level connectivity. Service mesh isn't dead, but the pattern is admitting its original shape was wrong and re-architecting at the substrate, per-node instead of per-pod.

I bring this up because Eric Broda and Davis Broda's *Agentic Mesh* leans on service mesh as the first link in a lineage argument. The pitch goes service mesh, then data mesh, then agentic mesh. Each generation supposedly inherits the previous one's abstractions: discovery, telemetry, policy, a marketplace. Just as a service mesh abstracts the plumbing of network communication, the agentic mesh is supposed to abstract the orchestration of reasoning.

Here's the awkward bit. The first link in that lineage is in the middle of rewriting itself toward a model that isn't recognizably a "mesh" in the 2017 sense. The second link, data mesh, has been publicly repudiated by its creator. And the substrate the third link is asked to govern violates the one assumption every earlier mesh relied on: that the thing being meshed is deterministic enough to sign a contract.

I want to walk through the audit carefully, because I've watched enough "we need a mesh for this" conversations go sideways to know the cost of accepting the framing. The mesh metaphor isn't wrong in a way that will blow up a production system next Tuesday. It's wrong in a way that misdirects where the load-bearing work actually is. Misdirection on architecture is how $75M budgets get spent on the wrong tier.

## The second link has been repudiated by its own author

Zhamak Dehghani coined data mesh in 2019. In 2025 she launched Nextdata OS and published "Data mesh: the beginning, revisited." Her own summary of what went wrong: first-generation execution collapsed into data-product management, with "data products" in practice reduced to clean datasets plus metadata, deployed with old data tooling, and with only marginal domain engagement. Her reformulation recenters the primitive on autonomous data products as containers. A substrate, not a federation.

This is awkward for the lineage argument. The inheritance claim is being made at the exact moment the ancestor disowns the inheritance. And if you take Dehghani's 2025 reformulation seriously, what survives is closer to *microagent* (autonomous, self-contained, self-governing) than to *mesh* (federation, registry, marketplace). The wrong half of the book validates the right half.

You could say: fine, a mesh doesn't have to mean federation, the word is used loosely. But look at what the book concretely prescribes.

- A central registry storing agents, tools, policies, certifications, and security metadata — any metadata needed for the operation of the agentic mesh lives here.
- A single proxy as entry and exit point for the entire mesh.
- A curated marketplace framed explicitly as an Apple App Store analogue.
- Admin kill switches for misbehaving agents.
- Policies stored in a central registry, accessible to people and systems.

Every concrete enforcement primitive named is centralized. The "mesh" is a central control plane wearing federation as nomenclature, and it's internally consistent across the book's architectural prescriptions. Calling it a mesh imports the wrong mental model.

The security consequences of that mismatched model are where the design breaks first.

## Discovery is a trifecta-composition engine

This is the argument that most cleanly inverts the book's premise, so I'll lead with it.

Simon Willison's lethal trifecta: an agent becomes dangerous when it combines private data access, exposure to untrusted content, and external communication capability. It's a per-agent capability model. The security frameworks that actually work in 2026, zero trust and capability-based isolation, work on exactly this composition principle. Inheritance of capabilities is the attack surface.

Now watch what happens in a discoverable mesh. The stated design goal is to turn the mesh from a series of isolated agents into a single unified whole. Agents find and compose with peers through the registry.

Take a modest enterprise mesh of 100 agents. Assume each agent independently holds each of the three trifecta capabilities with probability 0.3. The numbers are illustrative, pick your own priors and the direction doesn't change.

Single-agent trifecta probability is 0.3³ = 0.027. About 3 agents out of 100 are dangerous in isolation.

Composition across ordered 3-agent paths gives 100 × 99 × 98 = 970,200 ordered triples. For a given triple (A, B, C), the probability that the three agents collectively cover all three trifecta categories, under independence, is roughly 0.28. Expected trifecta-covering paths: about 270,000. Even if you tighten the model to "each agent holds exactly one capability," the ratio only drops to the ~48,000× range.

Call it tens of thousands of fold either way, versus 3. The discovery feature multiplies the trifecta attack surface by four to five orders of magnitude. Drop the capability probability to 0.1 and the ratio is still in the thousands.

Per-agent certification cannot prevent this, because the composition happens *across* certified agents. Agent A is certified to access data. Agent B is certified to receive external messages. Agent C is certified to send outbound. Each is clean alone. A→B→C is a trifecta path. The registry's job is to make that path easy to find.

What you want instead is capability-based isolation at the gateway, with trifecta composition rejected at path-construction time before the planner returns. That's a single enforcement primitive, not a federated certification regime. The mesh ships the opposite.

Security isn't the only axis where the mesh is optimizing the wrong variable. The reliability math is on the wrong axis too.

## The math is on the wrong axis

The book's motivating claim is that as more workflows are automated and more intelligence is distributed, coordination costs rise exponentially. The implicit cost function is N² coordination links. More participants means more coordination cost, so you need a mesh.

That's the wrong cost function.

End-to-end reliability of an agent chain is p^D, where p is per-step reliability and D is chain depth. Not N². D.

Here is the table. Per-step reliability across the columns, chain depth down the rows.

| Per-step p | D=3 | D=5 | D=10 |
|---|---|---|---|
| 0.95 | 0.857 | 0.774 | 0.599 |
| 0.98 | 0.941 | 0.904 | 0.817 |
| 0.99 | 0.970 | 0.951 | 0.904 |
| 0.999 | 0.997 | 0.995 | 0.990 |

A note on the numbers. τ-bench leaders land at about 60% pass@1, but that's end-to-end on whole tasks, and the task itself is a chain. I'm not plugging 0.60 in as per-step reliability; that would be the wrong denominator. The honest question is what per-step `p` the underlying tool-call and reasoning loop actually hits, and the answer from published traces lives somewhere in the 0.95–0.98 range on narrow tasks, dropping fast as tasks widen. Even charitably at 0.98, D=10 gets you 82% end-to-end. To hit a 4-nines SLA, the bar the book explicitly invokes, you need 0.9999^(1/10) ≈ 0.99999 per hop. Five nines per step.

Karpathy's frame for this on Dwarkesh was the "march of nines": each additional nine of reliability costs as much engineering as all the previous ones combined. Tesla Autopilot got two or three nines in five years on a narrower task surface. Agents are being asked for five nines per hop on open-ended tasks while the model distribution shifts every release.

Here's the part that inverts the book's argument. The mesh's discovery layer *lengthens chains*. A planner that can discover 500 peers constructs longer compositions than one that can only see three. The feature the book is most proud of, unified discoverability, makes the reliability problem strictly worse.

More discoverability yields deeper chains yields worse end-to-end success. The mesh is optimizing N while the universe punishes D.

And even if you solve the reliability problem, you have to pay for it. The coordination isn't free.

## Priced honestly, coordination isn't free

I want to be careful here, because it's easy to stretch a number and this number is load-bearing.

Anthropic's engineering blog reports their multi-agent research system uses roughly 15× the tokens of a single-shot chat baseline. That figure is their *specific* system on an *open-ended research workload* against a *no-orchestration* comparator. It isn't "the mesh tax" in the abstract.

But the book aspires to exactly the use case Anthropic measured, microagents composing to solve tasks no single agent could solve alone. So if you build the mesh to do the work Anthropic is doing, here is the arithmetic you inherit.

1M tasks a month, 50K tokens per task at the single-agent baseline, Sonnet 2026 blended pricing around $9 per million tokens.

Single-agent: 50B tokens × $9/M = $450K/month.
15× coordinated: 750B tokens × $9/M = $6.75M/month.

Delta of $75.6M a year. Roughly 380 fully-loaded senior engineers. One full S&P 500 mid-cap IT line.

The critical Anthropic data point isn't the 15× itself. It's their finding that token usage explained 80% of the variance in eval scores. Which means the 15× isn't waste, it's the coordination doing the work. Which in turn means you cannot get the mesh's coordination benefit at a discount. The bill is the benefit.

Cognition (the Devin team) took the other side publicly: don't build multi-agents. Their argument is information-theoretic. Messages between agents are lossy compressions of the trace, and the loss is exactly the context that makes systems reliable. Their production stance is single-threaded with extended context and full trace sharing.

The live practitioner debate in 2026 isn't "mesh vs. no mesh." It's "no multi-agent" against "multi-agent with full trace propagation, paying the 15× tax knowingly." The book's middle position (multi-agent plus IID correlation plus metadata) isn't on that map. Neither of the people paying the bill picked it.

If the bill *is* the benefit, then certification is the mechanism that's supposed to guarantee you're buying the benefit. It doesn't.

## Certification is noisier than the thing it's certifying

The book proposes a trust framework. A certifier defines a behavioral standard. The agent gets tested. Certification attaches to agent metadata. Purpose-and-policy declarations are framed as public commitments that define the agent's operating contract.

Let's stress-test that contract. Assume a reasonable rubric: the reviewer runs 10 test cases and requires at least 8 passes. Assume a 60%-capable agent and independent trials. P(pass ≥ 8 | Binomial(10, 0.6)) = 0.167. Run the same rubric twice and the probability of certify-then-fail-recert is 0.167 × 0.833 = 0.139.

That's the floor, and it gets worse.

τ-bench reports 25% cross-trial consistency, which means trials aren't independent. They're correlated in ways that make the Binomial model optimistic. When trials correlate positively, the variance of the sum grows, which inflates both false positives and false negatives at any threshold. The 14% instability rate is a *best case* under an assumption the underlying data contradicts.

I won't push the FDA 510(k) analogy here. Medical-device clearance and agent-quality signals are different disciplines, and the analogy is rhetorically strong but substantively sloppy. The honest claim is narrower: a certification regime whose false-recert rate is bounded below by 14% and unbounded above isn't a load-bearing quality gate.

The book claims that if an agent deviates from its stated purpose or breaches a defined policy, that deviation becomes a verifiable event, not a matter of interpretation. That sentence is false at the substrate level. When the agent's own pass rate is a noisy estimator with 25% cross-trial consistency, every event carries a non-trivial "same agent, unlucky sample" alternative hypothesis.

You can't certify away a noise-floor problem with a 10-test rubric. Hamel Husain's entire public thesis on evals is that AI product teams either do continuous trace-based evaluation on real traffic, or they firefight. The mesh has no evals primitive. It has certification, and certification is not evals.

Certification is the pre-flight check. The runtime control loop is what's supposed to catch the failures that slip through. That loop fails harder.

## The control loop is 60× slower than the damage

The book says behavioral anomalies escalate to human reviewers and promises SLAs around that escalation. Fine. Now look at the damage function.

- Published LangChain incident: $47,000 burned over 11 days in a retry loop. $2.97/minute sustained.
- Reddit r/AI_Agents anecdote: a $30K spike, roughly $10/minute.
- Worst-case rates cited: $100/minute during tight reasoning loops.

Typical enterprise SRE review SLA for a non-critical alert is 4 hours. 240 minutes.

| Scenario | $/min | $ per 4-hour review |
|---|---|---|
| Sustained retry loop | $2.97 | $712 |
| Spike | $10 | $2,400 |
| Worst case | $100 | $24,000 |

The control loop is structurally 60× slower than the failure it controls. This isn't a tuning problem, it's a tier problem. High-frequency trading risk controls run in microseconds because the damage function runs in microseconds. Credit-card fraud pre-authorization runs under 100ms because chargebacks are expensive. Agent token spend is a pre-authorization control problem, not a post-hoc escalation problem.

What works is a gateway primitive: hard pre-call token budgets enforced at a single choke point, evaluated before the next API call fires. That isn't a mesh, it's a guardrail.

Guardrails and substrate isolation aren't hypothetical. Two production incidents from last year show which primitives actually carry the load when things go wrong.

## The failure modes the mesh has no answer for

Two 2025 incidents worth naming carefully. Neither was a mesh deployment. Both are clean illustrations of what the mesh doesn't solve.

**Replit / SaaStr (July 2025).** Jason Lemkin's agent deleted a production database during an explicit code freeze, then initially reported that rollback wouldn't work. Lemkin recovered manually, proving the agent had fabricated the assessment. Replit's CEO shipped automatic dev/prod environment separation the following weekend.

The point isn't that a mesh failed. There was no mesh. The point is that the *fix*, hard environment isolation at the substrate, is the thing that would have helped in *any* architecture, including the book's. Meanwhile the thing the mesh actually ships (certification, policy, verifiable deviation) is defeated by the agent confabulating a cover story. The trust framework has no primitive for "the agent's self-report is unreliable," which is the entire category of failure production teams actually see.

**Klarna.** Their OpenAI bot handled 2.3M conversations in its first month of global operation, was marketed as replacing 700 FTE, and was rolled back in mid-2025 because customer satisfaction cratered. CEO: "We focused too much on efficiency and cost. The result was lower quality, and that's not sustainable." The valuation delta between AI-era peak and post-rollback reporting was around $39B. Attribution is contested; direction isn't.

Klarna wasn't a mesh deployment either. The point is that *any* marketplace-as-curation model inherits the same blind spot. Supply-side certification cannot insure demand-side product-market fit. The mesh marketplace is a supply-side guarantee wearing demand-side language: only authorized, verified, and policy-compliant agents get deployed. Being authorized to run isn't the same as being something anyone should run. The marketplace primitive can't close that gap, because the gap is between the system and the world, not inside the system.

These incidents aren't mesh failures. They're the shape of failure that any mesh built on the book's primitives would share.

A reasonable reader will push back at this point: if the primitives are that wrong, why is every big vendor shipping something that looks like a mesh?

## The counterargument I have to take seriously

You might say: every major vendor is shipping mesh-shaped agent platforms — AutoGen, Agent Broker, watsonx Orchestrate, and the various hyperscaler agent runtimes. If the mesh were wrong, why is every large vendor building one?

That's a real counterargument, and it deserves a real answer.

First, the hyperscalers are building registries plus identity plus gateway enforcement plus catalog UIs. The load-bearing primitives are orchestration-graph runtime, identity propagation, tool-call guardrails. None of them ships a "federated policy plane" or a "governance federation across agent fleets." They ship a single enforcement point, a catalog, and a trace backbone. If you tell me that's what the book means by a mesh, fine, but then the book's contrast with "just another platform" collapses.

Second, Salesforce's Agent Broker is in beta and its distinguishing feature is *deterministic* orchestration, which is the opposite of what the book's trust framework builds on. Microsoft's positioning around AutoGen has quietly moved from "multi-agent" to "agent loop with tools," which is the narrow Willison definition. The hyperscaler bets are compatible with the reframing I'm arguing for, not with the book's federated-mesh version.

Third, Gartner projects over 40% of agentic AI projects cancelled by end 2027 and estimates only about 130 of "thousands" of agentic AI vendors are legitimate. The hyperscaler platforms will be among the survivors. Mesh-shaped architectures will be a minority of those survivors. I won't put a specific number on it; the composed estimate is low-confidence and the point doesn't need false precision. The order of magnitude is small. "Mesh-shaped production deployment" is going to look more like a specialty configuration than a dominant pattern in any 2027 snapshot.

The more useful question isn't whether the big vendors are shipping meshes. It's what shape the best-shipping *practitioners* are converging on, and whether that shape looks like a mesh.

## The convergence nobody planned

If the mesh pattern worked as described, you'd see the best-shipping practitioners converging *toward* mesh primitives. Here is what they're converging on instead. This list is synthesis rather than original observation — I didn't discover any of these positions. What's new is putting them next to what the book proposes and noticing they don't overlap.

- **Cognition (Devin).** Single-threaded linear agents, full trace sharing, no multi-agent. The information-theoretic claim under it is sharp: every agent-to-agent message is a lossy compression of the context that would have made the next step reliable.
- **Anthropic.** Multi-agent with full trace propagation, a known 15× token tax, and rainbow deployments so upgrades don't kill in-flight runs. They pay the coordination bill in tokens rather than hope they can route around it — and their own data says token spend *is* the quality.
- **Replit (post-incident).** Automatic dev/prod environment separation, hard rails below the app layer. The fix that shipped under public pressure was substrate, not policy. That choice is diagnostic.
- **Chip Huyen.** A centralized platform of guardrails, model routing, caching, and monitoring; plan decoupled from execution; explicit resistance to adding agents where a deterministic workflow will do the job. Agents are added reluctantly, not architecturally.
- **Hamel Husain.** Continuous trace-based error analysis on real production traffic, every release. The gate is the eval on last week's traffic, not a certificate on paper.
- **Dehghani (2025).** Autonomous, self-governing products as the primitive — not a federation plane sitting above them. This is the coiner of "data mesh" pivoting away from the federated read of her own term.

Six practitioners, different problems, different companies, different years. None names "mesh" as the word for the load-bearing primitive. The unmined ore here is the *disagreement* between Cognition and Anthropic. One side says multi-agent messaging is lossy by construction; the other says it can be recovered with full-trace propagation at a 15× cost. That disagreement is a live research question. The mesh takes neither side and sits in a position nobody is defending.

If the mesh doesn't survive the audit intact, a quieter version of it still does.

## What survives

Strip the framing and the architecture contains real primitives.

What survives: a registry (identity plus metadata plus lookup), an audit trail (distributed tracing, which is OpenTelemetry rebranded for agents), pre-call enforcement at the gateway (the central proxy, if you take it seriously as a centralized choke point and drop the federation rhetoric), environment isolation at the substrate (which the book barely mentions but Replit paid for in public).

What doesn't survive: the federation framing, the contract-as-policy framing, the marketplace-as-trust-guarantee framing, the "scale cognition via exponential coordination" argument, and the certification-as-quality-gate. Also the service-mesh analogy that holds the lineage together, because service mesh in 2026 is walking back toward the substrate, not forward into federation.

What's left is a policy gateway with a catalog. Centralized where enforcement has to be fast. Federated only where the cost of federation is lower than the cost of a bottleneck. Opinionated at the substrate. That shape already has prior art; it's what the shipping agent platforms actually provide underneath the marketing. It's also what Dehghani's 2025 pivot describes.

Architecture words recruit backlogs. Call something a mesh and you staff a coordination-fabric team. You buy a sidecar product. You build federated policy tooling. You spend budget on N² routing. Meanwhile the load-bearing work (hard gateway enforcement, environment isolation, eval discipline, context engineering) sits in somebody else's quarter.

Call it what it is. It will ship better that way.


---


### Inference's Techniques Age in Years. Its Economics Age in Weeks.

- **Published:** 2026-04-17 (updated 2026-04-18)
- **URL:** https://vikasmishra.ai/blog/inference-techniques-vs-economics/
- **Markdown source:** https://vikasmishra.ai/blog/inference-techniques-vs-economics/index.md
- **Tags:** LLM, Inference, AI Infrastructure, MLOps, Economics

> What survives and what inverts in modern inference engineering: the physics holds, but the build-vs-buy math shifts every quarter.


I read Philip Kiely's *Inference Engineering* cover to cover last weekend. It's a good book. The mechanics are clean, the diagrams are the right diagrams, and the author has clearly served real traffic. If a new grad joined my team tomorrow and asked what to read to stop embarrassing themselves in design reviews, I'd still hand them this book.

Half of it is aging badly anyway.

The math isn't wrong. The problem is that the market the book prices against moves so fast that even a manuscript finished in January 2026 and launched in February sits in a different world by spring. The build-vs-buy arithmetic that held at press time now points somewhere different. The techniques chapters age like Knuth. The economics chapters age like a Best Buy circular.

This post is my attempt to separate the two. What in a recent inference book still holds? What has quietly inverted since the ink dried? And the part I care about most: what's the replacement decision framework when the numbers shift again next quarter, which they will.

One caveat up front. I'm arguing with the *shape* of claims common across current inference writing, Kiely's book included. Where I cite a specific chapter, I name it. Where I'm responding to a framing ("self-host to save money," "speculative decoding is nearly free"), I'll call that out as framing rather than as a direct quote.

## What survives

Start with what ages well. The physics of decoder-only transformers hasn't moved since 2023. Memory bandwidth still bounds decode. Compute still bounds prefill. The roofline still has two sides.

That means the following lessons will still be true when my kids are in high school.

The KV cache is the bottleneck, not the weights. Lilian Weng made this observation in 2023 and it has only gotten truer. A 7B Llama-class model burns roughly 0.5 MB of KV per token. On an A100-80GB, after weights and framework overhead, you have about 56 GB of KV budget, which works out to 14 concurrent sequences at 8K context. The crossover where one sequence's KV cache exceeds the entire model weight footprint sits around 28K tokens. Teams routinely size clusters from parameter count and get blindsided at concurrency=14. It's the "works at 1 QPS, dies at 10 QPS" regression, and it will still bite people in 2030.

Continuous batching is the first optimization that matters. Everything downstream (prefix caching, speculative decoding, tensor parallelism tuning) depends on you being batched first. Kiely gets this right, and so does every production team I've talked to.

Quantization is a quality signature, not a free win. Tim Dettmers' LLM.int8 result was specifically about outlier feature handling at 175B. The field extrapolated it to 7B and got burned. Recent benchmarks show INT4 on a 7B dropping 8–10% on tool-use evals, which wrecks agentic pipelines. FP8 is the current sweet spot. The mechanism (outliers, vector-wise scales, calibration sets) is evergreen.

Roofline reasoning. Horace He's "Making Deep Learning Go Brrrr" is cited across the inference literature for good reason. Identify which wall you're hitting before you optimize. Kiely internalizes this.

All of it belongs on the shelf next to Hennessy & Patterson.

## What inverted

The framing that stops aging well is the argument that you save money by self-hosting a 7B. That argument was priced against a market that's already moved.

The arithmetic matters here, because this is the kind of thing you need to be able to re-run yourself.

A well-batched 7B on an H100 sustains roughly 3,000 tokens per second on vLLM under favorable conditions. That's a public benchmark number on a single-node setup with saturated batching. AWS on-demand H100 pricing as of mid-2025 was $3.90/hr after the June price cut. GCP spot dipped to $2.25/hr. Do the division:

```
1,000,000 tokens ÷ 3,000 tok/s = 333 seconds = 0.0926 hours
On-demand: 0.0926 × $3.90 = $0.361 / M tokens
Spot:      0.0926 × $2.25 = $0.208 / M tokens
```

The bare hardware floor for a saturated single-tenant 7B is $0.21 to $0.36 per million tokens. That's a lab number. Your production number will be worse, because real workloads don't sustain saturated batching, latency SLOs force you below peak utilization, and you're paying for salaries, on-call, and observability on top.

DeepSeek V3.2 is currently priced roughly $0.20–$0.30 per million blended, depending on how heavily you weight cache hits against cold input and output. Two honest caveats before I use this number. First, V3.2 is a sparse MoE (~671B total, ~37B active). It's not a dense 7B, and its per-token compute cost is higher, not lower, than a 7B's. Second, Chinese frontier labs have been pricing aggressively to win share. These prices are plausibly below the vendor's marginal cost right now.

The conclusion I need doesn't require "API beats your hardware floor." It requires "API is close enough to your floor that the overhead eats the gap," and that claim survives both caveats. You're being offered a much stronger model, hosted, managed, with an SLA, at roughly the price of your unloaded GPU time. The math for single-workload self-hosting doesn't pencil, subsidy or not.

You won't beat that with a credit card and good intentions. You won't beat it with better kernels either. The hosted vendor is multiplexing thousands of tenants onto shared capacity at utilizations you structurally cannot match as a single-workload shop. It's the same pattern that beat enterprise colo in 2012 and self-run Hadoop in 2019.

When does self-host still win? I see two cases.

One is compliance. HIPAA, PCI, data residency. The traffic can't leave your VPC, so cost stops being the decision variable.

The other is a platform amortizing across many workloads. This is the case most current inference framing misses, and it's the one I want to spend real time on.

## Workloads, not tokens

The standard rule of thumb is that self-hosting pays off above roughly 2M tokens per day. I've quoted this number in design docs. I was wrong to, because the 2M figure assumes a single workload holding one GPU at ~50% utilization.

Production platforms don't look like that. Atlassian publicly reports greater than 60% cost reduction on an internal serving platform carrying LLMs, search encoders, and content moderation. The important detail is that those workloads are partially uncorrelated. LLM traffic peaks during US business hours. Moderation runs on a different diurnal curve. Search encoders fire on a third schedule.

The math changes when you amortize:

```
Single workload @ 2M tok/day:            50% util → break-even
Three uncorrelated workloads @ ~700K ea: ~50% combined util → break-even
```

Stated as a heuristic:

> Multi-workload break-even (tok/day) ≈ (single-workload break-even) ÷ N,
> where N is the number of uncorrelated workloads sharing capacity.

The correlation caveat is load-bearing. If all your AI features are user-facing chat that fires during the same business hours, N≈1 no matter how many features you ship. The win comes from mixing user-facing LLMs with batch moderation, offline search encoders, nightly summarization jobs, and so on.

When that mix exists and N is 5–10, per-workload break-even drops from 2M to somewhere in the 200K–400K tokens/day range. That's well within reach of companies I previously would have told to stay on an API.

The counterintuitive consequence: a 500-person SaaS with eight uncorrelated AI features has better self-host economics than a 5,000-person SaaS with one heavily-loaded AI feature. The correct framing isn't "are we big enough?" It's "do we have enough inference-shaped workloads with different traffic shapes to share a GPU?"

Hazy Research and Character.AI have gestured at workload shape as the limiting variable. I haven't seen the generalization to a per-workload division rule stated crisply anywhere, and I'd be happy to be pointed at prior art if it exists.

## Why Stripe's 73% isn't your 73%

The other production number everyone quotes is Stripe's 73% inference cost reduction on 50M daily calls, down to a third of the prior GPU fleet after migrating from Hugging Face Transformers to vLLM. I've seen this cited in three vendor pitches in the last six months, usually to justify adopting whatever engine Stripe adopted.

That's the wrong takeaway, because the number isn't portable.

The total effective throughput gain works out to roughly 11×. Three mechanisms contribute, using rough public benchmarks.

Continuous batching over a naive HF Transformers baseline is about 4×. vLLM's original paper claimed 2.7×; it has improved since.

Prefix caching on templated prompts adds another 2–3×. Character.AI reports 95% prefix hit rates on conversational traffic, and separately has driven KV-cache size down by more than 20× via MQA, hybrid attention horizons, and cross-layer KV sharing. Stripe's workload (fraud classification, merchant categorization) is heavily templated in the same way.

FP8/INT8 numeric format contributes 1.5–2× on top of that.

Product: 4 × 2.5 × 1.7 ≈ 17× theoretical; 11× realized; the delta is interaction losses. That checks out qualitatively.

What I want to resist is pretending I can cleanly split 73% into decimal contributions. Taking logs, roughly: batching is a bit under half, workload shape a third, numeric format the rest. That's handwave-calibrated, not derived.

The qualitative claim is the load-bearing one. Most of the Stripe win is workload-shaped, not engine-shaped. Your open-ended chat workload, your RAG pipeline with fresh retrieval context, your code generation: none of those get a 95% prefix hit. Strip out the prefix contribution and what's left is roughly half of Stripe's number. Still great. But if you promised 73% in a design doc and delivered 40–50%, your credibility takes the hit.

This pattern keeps showing up. Production wins get credited to the engine when the bulk of the win was the shape of the workload the team built around that engine. Character.AI's 20× KV-cache reduction is a workload story (long dialogues plus custom MQA variants). Stripe's 73% is a workload story (templated classification plus prefix caching). Atlassian's 60% is a workload story (multi-tenant amortization). In every headline case, the engine did its job. But the win was shaped, not selected.

If an honest chapter on this existed, it would carry a title like *"Your workload is the optimization, not the engine."* The existing inference literature is scattered on this point. Hazy Research and Character.AI have the right instinct, but nobody has put it in a serving book as the central thesis.

## The retrofit tax

One more piece before the take-home. The "advanced" optimizations chapter, speculative decoding in particular, inherits Karpathy's framing: at batch=1, decode is memory-bandwidth-bound, so you have spare compute to verify K tokens in parallel. It's essentially free.

Karpathy is describing a mechanism. The usual inference-book treatment implicitly extends that into a project plan. Those are different things, and the gap is expensive.

Price it out. Retrofitting speculative decoding into an existing serving stack touches continuous batching, KV allocation, tokenizer alignment, observability, and the eval harness. Call it nine engineer-months, which is on the low end of what I'd budget for a serious integration. At a US-coastal fully-loaded $29K per engineer-month (this number roughly halves in Eastern Europe or India, so adjust), that's $261K of sunk cost.

Now the break-even. Here's where I have to correct the first draft of my own reasoning. The relevant opportunity cost isn't the API price. It's the self-host blended cost of the team considering the retrofit. By this post's own argument, a team at the scale where spec decoding retrofit is on the table is already self-hosting at multi-workload amortized rates, plausibly $0.05–$0.10/M. Using the API price here would understate the token volume required.

Recent production writeups suggest realized speedups come in well below the theoretical 2–3×. I'll use 25% as a midpoint for planning purposes; treat that as my assumption, not a cited number.

```
Break-even tokens:
$261K ÷ ($0.08/M × 0.25 savings) ≈ 13 trillion tokens
```

Thirteen trillion tokens a year is ~36 billion tokens a day. That's well above Stripe's published volume. For the handful of operators globally at that scale (a few dozen, not fifty), spec decoding is a rational retrofit. For everyone else, it's an NPV-negative project dressed up as a feature. Couple that with ZenML corpus data showing 9–18 month production model lifetimes, and the retrofit doesn't finish paying back before the model itself gets deprecated.

If you're starting fresh, bake spec decoding in from day one. If you're retrofitting, run the arithmetic against your own blended cost, not mine. Most readers will find the number is "don't."

## The take-home rule

Two rules survive the numbers drifting.

First, read inference books for the physics and re-run the economics quarterly. Epoch AI puts the frontier inference API price decline at roughly 5–10× per year at constant capability. That doesn't translate one-to-one to build-vs-buy math, because GPU rental rates have fallen more like 2× over the same window, so the ratio that drives the flip has moved a few-fold over the last cycle. Not enough to invalidate the techniques, but enough that any specific dollar figure in even a recent manuscript is probably pointing at a drifting conclusion. Techniques age well because they're grounded in hardware ratios that move slowly. Economics age fast because they're priced off a commodity on a learning curve. Don't conflate the two.

Second, the right break-even question is "how many workloads?", not "how many tokens?". Self-hosting economics are governed by utilization, and utilization is governed by uncorrelated workload count. Everything else is secondary. If your AI features all fire at the same time on the same traffic shape, you don't have five workloads. You have one, five times over.

Kiely's book will still be on my shelf next year. The KV cache chapter, the batching chapter, the roofline reasoning: those are the things you can't Google your way to in a hurry. But the chapters that quote prices, the ones that argue build-vs-buy from specific current numbers, those I mark with a post-it that says *check the market before you cite this.* Every quarter. Because the market keeps moving, and the book, being a book, does not.

The half-life of an inference technique is years. The half-life of an inference dollar figure is weeks. Books can't really help with the second problem. Spreadsheets can.

---

*The views and opinions expressed in this blog are my own and do not represent those of my employer.*


---


### Unlocking the Secret Sauce of LLMs: Why the Math Matters - A Technical Deep Dive

- **Published:** 2025-06-05 (updated 2026-01-26)
- **URL:** https://vikasmishra.ai/blog/unlocking-secret-sauce-llms-why-math-matters/
- **Markdown source:** https://vikasmishra.ai/blog/unlocking-secret-sauce-llms-why-math-matters/index.md
- **Tags:** LLM, Machine Learning, Mathematics, AI, Deep Learning

> Exploring the essential mathematical foundations that practitioners need when working with Large Language Models, from linear algebra to optimization.


*This article was originally published on [LinkedIn](https://www.linkedin.com/pulse/unlocking-secret-sauce-llms-why-math-matters-technical-vikas-mishra-phdac).*

When working with Large Language Models, understanding the underlying mathematics isn't just academic—it's the difference between blindly tuning hyperparameters and systematically solving problems. Let me share the essential mathematical concepts that have helped me debug, optimize, and deploy LLMs in production.

## The Five Mathematical Pillars of LLMs

### 1. Linear Algebra: The Language of Data

At its core, an LLM represents everything as vectors and matrices. Every word, every sentence, every concept lives in a high-dimensional vector space.

**Why it matters:**
- **Embeddings** are vectors that capture semantic meaning
- **Attention mechanisms** are essentially matrix operations
- **Dimensionality reduction** techniques like SVD help optimize performance

```python
import numpy as np

# Word embeddings are vectors
word_embedding = np.array([0.2, -0.5, 0.8, ...])  # 768 or more dimensions

# Attention is computed via dot products
def attention_score(query, key):
    return np.dot(query, key) / np.sqrt(len(key))

# Self-attention across all tokens
def self_attention(Q, K, V):
    scores = np.matmul(Q, K.T) / np.sqrt(K.shape[-1])
    weights = softmax(scores)
    return np.matmul(weights, V)
```

**Practical Application:** When I needed to reduce embedding dimensions for a production RAG system, understanding SVD and eigenvalue decomposition allowed me to compress 768-dimensional embeddings to 128 dimensions while preserving 95% of the semantic information.

### 2. Calculus: The Engine of Learning

Calculus, particularly derivatives and gradients, is how models learn. Every parameter update is driven by the gradient of the loss function.

**Key concepts:**
- **Gradients** tell us which direction to adjust parameters
- **Chain rule** enables backpropagation through deep networks
- **Higher-order derivatives** reveal the optimization landscape

```python
# Simplified gradient descent
def update_parameters(params, gradients, learning_rate):
    return params - learning_rate * gradients

# The chain rule in action during backpropagation
# If y = f(g(x)), then dy/dx = (df/dg) * (dg/dx)
def backward_pass(loss, layers):
    gradient = compute_gradient(loss)
    for layer in reversed(layers):
        gradient = layer.backward(gradient)  # Chain rule applied
    return gradient
```

### 3. Optimization: Finding the Best Parameters

Training an LLM means finding optimal values for billions of parameters. This is a massive optimization problem.

**Essential techniques:**
- **Gradient Descent** and its variants (SGD, Adam, AdamW)
- **Learning rate scheduling** (warmup, cosine decay)
- **Gradient clipping** for stability

```python
# AdamW optimizer - the workhorse of LLM training
class AdamW:
    def __init__(self, params, lr=1e-4, betas=(0.9, 0.999), weight_decay=0.01):
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.weight_decay = weight_decay
        self.m = {p: 0 for p in params}  # First moment
        self.v = {p: 0 for p in params}  # Second moment
        self.t = 0

    def step(self, params, grads):
        self.t += 1
        for p in params:
            # Momentum and RMSprop components
            self.m[p] = self.beta1 * self.m[p] + (1 - self.beta1) * grads[p]
            self.v[p] = self.beta2 * self.v[p] + (1 - self.beta2) * grads[p]**2

            # Bias correction
            m_hat = self.m[p] / (1 - self.beta1**self.t)
            v_hat = self.v[p] / (1 - self.beta2**self.t)

            # Update with weight decay (decoupled)
            params[p] -= self.lr * (m_hat / (np.sqrt(v_hat) + 1e-8) +
                                    self.weight_decay * params[p])
```

### 4. Probability & Statistics: The Foundation of Generation

LLMs are fundamentally probabilistic models. They predict the probability distribution over the next token.

**Core concepts:**
- **Softmax** converts logits to probabilities
- **Cross-entropy loss** measures prediction quality
- **Sampling strategies** (temperature, top-k, top-p) control generation

```python
def softmax(logits, temperature=1.0):
    """Convert logits to probabilities with temperature scaling."""
    scaled = logits / temperature
    exp_scaled = np.exp(scaled - np.max(scaled))  # Numerical stability
    return exp_scaled / np.sum(exp_scaled)

def top_p_sampling(probs, p=0.9):
    """Nucleus sampling - sample from smallest set with cumulative prob >= p."""
    sorted_indices = np.argsort(probs)[::-1]
    cumsum = np.cumsum(probs[sorted_indices])
    cutoff_idx = np.searchsorted(cumsum, p)
    allowed_indices = sorted_indices[:cutoff_idx + 1]
    renormalized = probs[allowed_indices] / probs[allowed_indices].sum()
    return np.random.choice(allowed_indices, p=renormalized)
```

### 5. Transformers & Attention: Where It All Comes Together

The transformer architecture is where all these mathematical concepts converge.

```python
class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Learnable projection matrices
        self.W_q = np.random.randn(d_model, d_model) * 0.02
        self.W_k = np.random.randn(d_model, d_model) * 0.02
        self.W_v = np.random.randn(d_model, d_model) * 0.02
        self.W_o = np.random.randn(d_model, d_model) * 0.02

    def forward(self, x):
        batch_size, seq_len, d_model = x.shape

        # Project to Q, K, V
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head attention
        Q = Q.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        K = K.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
        V = V.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)

        # Scaled dot-product attention
        scores = (Q @ K.transpose(-2, -1)) / np.sqrt(self.d_k)
        attn_weights = softmax(scores, axis=-1)
        context = attn_weights @ V

        # Concatenate heads and project
        context = context.transpose(0, 2, 1, 3).reshape(batch_size, seq_len, d_model)
        return context @ self.W_o
```

## Real-World Case Studies

### Case 1: Debugging Training Instability

**Problem:** Fine-tuning Gemma resulted in loss spikes and NaN gradients.

**Mathematical diagnosis:** Analyzed gradient distributions and found exploding gradients in early layers.

**Solution:**
```python
# Gradient clipping with norm monitoring
def clip_grad_norm(parameters, max_norm=1.0):
    total_norm = 0
    for p in parameters:
        total_norm += np.sum(p.grad ** 2)
    total_norm = np.sqrt(total_norm)

    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1:
        for p in parameters:
            p.grad *= clip_coef

    return total_norm  # Monitor this!

# Cosine decay with warmup
def cosine_schedule(step, warmup_steps, total_steps, max_lr, min_lr):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + np.cos(np.pi * progress))
```

### Case 2: Optimizing Embedding Search

**Problem:** RAG system with 768-dimensional embeddings had 800ms query latency.

**Mathematical approach:** Applied PCA via eigenvalue decomposition to find optimal dimensionality.

**Result:** Reduced to 128 dimensions, achieving 120ms latency (6.7x improvement) with minimal accuracy loss.

```python
from sklearn.decomposition import PCA

# Analyze variance explained by each dimension
pca = PCA(n_components=768)
pca.fit(embeddings)

# Find dimensions that capture 95% variance
cumsum = np.cumsum(pca.explained_variance_ratio_)
optimal_dims = np.searchsorted(cumsum, 0.95) + 1  # Often ~100-150

# Reduce dimensions
pca_reduced = PCA(n_components=optimal_dims)
reduced_embeddings = pca_reduced.fit_transform(embeddings)
```

### Case 3: Attention Analysis for Classification

**Problem:** Text classifier stuck at 76% accuracy.

**Mathematical insight:** Calculated attention entropy to identify problematic patterns.

```python
def attention_entropy(attn_weights):
    """High entropy = diffuse attention, Low entropy = focused attention."""
    # Add small epsilon for numerical stability
    entropy = -np.sum(attn_weights * np.log(attn_weights + 1e-10), axis=-1)
    return entropy

# Found that model attention was too concentrated on [CLS] token
# Solution: Added auxiliary loss to encourage distributed attention
```

**Result:** Attention regularization improved accuracy from 76% to 92%.

## Key Takeaways

1. **Linear algebra** helps you understand and manipulate embeddings effectively
2. **Calculus** enables you to debug training issues by analyzing gradients
3. **Optimization** knowledge lets you choose the right training configuration
4. **Probability** is essential for controlling generation behavior
5. **Understanding attention** helps diagnose model behavior issues

The mathematics isn't just theory—it's your debugging toolkit. When something goes wrong (and it will), mathematical intuition helps you identify the root cause instead of randomly trying different configurations.

---

*What mathematical concepts have you found most useful when working with LLMs? I'd love to hear about your experiences.*


---


### Beyond Frontier Models: A Technical Deep Dive & Practitioner's Guide to LLM Tuning, Optimization, Deployment, and Architecture

- **Published:** 2025-05-01 (updated 2026-01-26)
- **URL:** https://vikasmishra.ai/blog/beyond-frontier-models-llm-tuning-guide/
- **Markdown source:** https://vikasmishra.ai/blog/beyond-frontier-models-llm-tuning-guide/index.md
- **Tags:** LLM, Machine Learning, MLOps, Fine-tuning, RAG, AI Infrastructure

> A comprehensive guide covering post-training techniques, production deployment, retrieval augmentation, evaluation, hallucination management, and advanced architectures for LLMs.


*This article was originally published on [LinkedIn](https://www.linkedin.com/pulse/beyond-frontier-models-technical-deep-dive-guide-llm-tuning-mishra-jiagc/).*

*Disclaimer: All opinions expressed in this post are my own and do not represent the views or positions of my employer.*

The most effective LLM systems aren't built by treating training, deployment, and evaluation as isolated steps. They emerge from understanding how these components interconnect. In this guide, I'll share the technical details that matter for building production-grade LLM applications.

---

## Part 1: Technical Foundations of LLM Post-Training

Post-training transforms a general-purpose language model into something useful for your specific domain. There are three key phases.

### Phase 1: Supervised Fine-Tuning (SFT)

SFT converts a raw model into an instruction follower using curated instruction-response pairs.

```python
# Training data format for instruction tuning
training_examples = [
    {
        "instruction": "Summarize the following article in 3 bullet points.",
        "input": "<article text>",
        "output": "• Key point 1\n• Key point 2\n• Key point 3"
    },
    # Diverse phrasing matters!
    {
        "instruction": "Give me a 3-point summary of this text.",
        "input": "<article text>",
        "output": "..."
    }
]

# SFT training loop
def sft_training_step(model, batch, optimizer):
    inputs = tokenize(batch["instruction"] + batch["input"])
    targets = tokenize(batch["output"])

    logits = model(inputs)
    loss = cross_entropy_loss(logits, targets)

    loss.backward()
    optimizer.step()
    return loss
```

**Key considerations:**
- Diverse instruction phrasing prevents overfitting to specific formats
- Quality over quantity—1000 excellent examples beat 100,000 mediocre ones
- Include edge cases and failure modes in training data

### Phase 2: Preference Optimization (RLHF, DPO, RLAIF)

Preference optimization teaches models to generate responses humans prefer.

```python
# Direct Preference Optimization (DPO) - simpler than RLHF
def dpo_loss(model, ref_model, preferred, rejected, beta=0.1):
    """
    DPO loss without explicit reward modeling.
    beta controls deviation from reference model.
    """
    # Log probabilities under current and reference models
    pi_preferred = model.log_prob(preferred)
    pi_rejected = model.log_prob(rejected)
    ref_preferred = ref_model.log_prob(preferred)
    ref_rejected = ref_model.log_prob(rejected)

    # DPO objective
    preferred_ratio = pi_preferred - ref_preferred
    rejected_ratio = pi_rejected - ref_rejected

    loss = -torch.log(torch.sigmoid(beta * (preferred_ratio - rejected_ratio)))
    return loss.mean()
```

**When to use what:**
- **RLHF**: Maximum control, but complex (requires reward model + PPO)
- **DPO**: Simpler, often comparable results
- **RLAIF**: When human feedback is expensive, use AI feedback

### Phase 3: Domain-Adaptive Continued Pre-training

For specialized domains (legal, medical, finance), continued pre-training on domain corpora bridges the knowledge gap.

```python
# Domain adaptation strategy
domain_training_config = {
    "learning_rate": 1e-5,  # Lower than initial pre-training
    "warmup_ratio": 0.1,
    "epochs": 2,  # Don't overtrain - causes forgetting
    "data_mix": {
        "domain_specific": 0.7,
        "general": 0.3  # Prevents catastrophic forgetting
    }
}
```

### Loss Functions for Different Objectives

```python
# Cross-Entropy: Standard next-token prediction
def cross_entropy_loss(logits, targets):
    return F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

# Triplet Loss: For embedding training
def triplet_loss(anchor, positive, negative, margin=0.2):
    pos_dist = F.pairwise_distance(anchor, positive)
    neg_dist = F.pairwise_distance(anchor, negative)
    return F.relu(pos_dist - neg_dist + margin).mean()

# KL Divergence: For distillation or regularization
def kl_divergence_loss(student_logits, teacher_logits, temperature=2.0):
    student_probs = F.log_softmax(student_logits / temperature, dim=-1)
    teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
    return F.kl_div(student_probs, teacher_probs, reduction='batchmean')

# Contrastive Loss (InfoNCE): For embedding models
def contrastive_loss(embeddings, temperature=0.07):
    similarity = embeddings @ embeddings.T / temperature
    labels = torch.arange(len(embeddings))
    return F.cross_entropy(similarity, labels)
```

### Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning is expensive. PEFT methods update only a fraction of parameters.

#### LoRA (Low-Rank Adaptation)

```python
class LoRALayer(nn.Module):
    """
    LoRA: Learns low-rank updates to frozen weights.
    W_new = W_frozen + (A @ B) * scaling
    """
    def __init__(self, in_features, out_features, rank=8, alpha=16):
        super().__init__()
        self.rank = rank
        self.scaling = alpha / rank

        # Frozen original weights
        self.weight = nn.Parameter(torch.randn(out_features, in_features),
                                   requires_grad=False)

        # Trainable low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

    def forward(self, x):
        # Original forward + low-rank update
        original = F.linear(x, self.weight)
        lora_update = F.linear(F.linear(x, self.lora_A), self.lora_B)
        return original + lora_update * self.scaling
```

#### QLoRA: Quantization + LoRA

```python
# QLoRA configuration for memory-efficient fine-tuning
qlora_config = {
    "load_in_4bit": True,
    "bnb_4bit_quant_type": "nf4",  # NormalFloat4 - optimal for weights
    "bnb_4bit_compute_dtype": torch.bfloat16,
    "bnb_4bit_use_double_quant": True,  # Quantize the quantization constants
    "lora_r": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.1,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"]
}
```

**Memory comparison for 7B model:**
- Full fine-tuning: ~60GB VRAM
- LoRA (r=8): ~16GB VRAM
- QLoRA (4-bit + LoRA): ~6GB VRAM

---

## Part 2: Production Deployment Architecture

### Inference Parameters That Matter

```python
generation_config = {
    # Temperature: Controls randomness
    # 0.0-0.3: Factual, deterministic tasks
    # 0.7-1.0: Creative tasks
    "temperature": 0.7,

    # Top-p (nucleus sampling): Dynamic vocabulary filtering
    # Samples from smallest set with cumulative probability >= p
    "top_p": 0.9,

    # Top-k: Fixed vocabulary limit
    "top_k": 50,

    # Repetition control
    "presence_penalty": 0.6,   # Penalize tokens that appeared
    "frequency_penalty": 0.3,  # Penalize based on frequency
}

def generate_with_config(model, prompt, config):
    logits = model(prompt)

    # Apply temperature
    logits = logits / config["temperature"]

    # Apply top-k filtering
    if config["top_k"] > 0:
        indices_to_remove = logits < torch.topk(logits, config["top_k"])[0][..., -1, None]
        logits[indices_to_remove] = float('-inf')

    # Apply top-p filtering
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    sorted_indices_to_remove = cumulative_probs > config["top_p"]
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = 0
    indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
    logits[indices_to_remove] = float('-inf')

    # Sample
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)
```

### Hardware & Memory Considerations

```python
def calculate_memory_requirements(
    num_params_billions,
    precision="fp16",
    batch_size=1,
    sequence_length=2048
):
    """Estimate GPU memory for inference."""

    bytes_per_param = {
        "fp32": 4,
        "fp16": 2,
        "bf16": 2,
        "int8": 1,
        "int4": 0.5
    }

    # Model weights
    model_memory_gb = (num_params_billions * 1e9 *
                       bytes_per_param[precision]) / 1e9

    # KV cache (for each layer)
    # Assuming ~100 layers, hidden_size ≈ num_params^0.5 * 1000
    hidden_size = int((num_params_billions ** 0.5) * 1000)
    num_layers = min(int(num_params_billions * 10), 100)
    kv_cache_per_token = 2 * num_layers * hidden_size * bytes_per_param[precision]
    kv_cache_gb = (batch_size * sequence_length * kv_cache_per_token) / 1e9

    # Activation memory (rough estimate)
    activation_gb = model_memory_gb * 0.1

    return {
        "model_weights_gb": model_memory_gb,
        "kv_cache_gb": kv_cache_gb,
        "activations_gb": activation_gb,
        "total_gb": model_memory_gb + kv_cache_gb + activation_gb
    }

# Example: 70B model in fp16
# calculate_memory_requirements(70, "fp16") → ~150GB total
# calculate_memory_requirements(70, "int4") → ~45GB total
```

### Inference Optimization Techniques

#### Speculative Decoding

```python
def speculative_decode(target_model, draft_model, prompt, num_speculative=4):
    """
    Use small draft model to propose tokens, verify with target model.
    Can achieve 2-3x speedup.
    """
    generated = prompt

    while not is_complete(generated):
        # Draft model generates multiple tokens quickly
        draft_tokens = []
        draft_probs = []
        for _ in range(num_speculative):
            logits = draft_model(generated + draft_tokens)
            token = sample(logits)
            draft_tokens.append(token)
            draft_probs.append(F.softmax(logits, dim=-1))

        # Target model verifies all at once (single forward pass)
        target_logits = target_model(generated + draft_tokens)
        target_probs = F.softmax(target_logits, dim=-1)

        # Accept tokens where target agrees with draft
        accepted = 0
        for i, (draft_p, target_p) in enumerate(zip(draft_probs, target_probs)):
            # Acceptance probability
            r = random.random()
            if r < min(1, target_p[draft_tokens[i]] / draft_p[draft_tokens[i]]):
                accepted += 1
            else:
                break

        generated += draft_tokens[:accepted]

        # Sample one more from target if we rejected early
        if accepted < num_speculative:
            generated += sample(target_logits[accepted])

    return generated
```

#### KV Cache Optimization

```python
class PagedKVCache:
    """
    vLLM-style paged attention for efficient memory usage.
    Allocates memory in pages, enables sharing across requests.
    """
    def __init__(self, page_size=16, max_pages=1000):
        self.page_size = page_size
        self.pages = {}  # page_id -> tensor
        self.page_table = {}  # request_id -> [page_ids]

    def allocate_page(self):
        page_id = len(self.pages)
        self.pages[page_id] = torch.zeros(self.page_size, hidden_size)
        return page_id

    def get_kv_cache(self, request_id, position):
        page_idx = position // self.page_size
        offset = position % self.page_size

        if request_id not in self.page_table:
            self.page_table[request_id] = []

        while len(self.page_table[request_id]) <= page_idx:
            self.page_table[request_id].append(self.allocate_page())

        page_id = self.page_table[request_id][page_idx]
        return self.pages[page_id][offset]
```

---

## Part 3: RAG and Retrieval Strategies

### Hybrid Retrieval Architecture

```python
class HybridRetriever:
    """
    Combines dense (semantic) and sparse (keyword) retrieval.
    """
    def __init__(self, dense_model, sparse_index, dense_index):
        self.dense_model = dense_model
        self.sparse_index = sparse_index  # BM25, TF-IDF
        self.dense_index = dense_index    # FAISS, Pinecone

    def retrieve(self, query, k=10, dense_weight=0.7):
        # Dense retrieval
        query_embedding = self.dense_model.encode(query)
        dense_results = self.dense_index.search(query_embedding, k=k*2)

        # Sparse retrieval
        sparse_results = self.sparse_index.search(query, k=k*2)

        # Reciprocal Rank Fusion
        scores = {}
        for rank, (doc_id, _) in enumerate(dense_results):
            scores[doc_id] = scores.get(doc_id, 0) + dense_weight / (rank + 60)

        for rank, (doc_id, _) in enumerate(sparse_results):
            scores[doc_id] = scores.get(doc_id, 0) + (1 - dense_weight) / (rank + 60)

        # Return top-k by fused score
        ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return ranked[:k]
```

### Advanced RAG Techniques

```python
# Multi-stage retrieval with reranking
class MultiStageRAG:
    def __init__(self, retriever, reranker, generator):
        self.retriever = retriever
        self.reranker = reranker  # Cross-encoder for precise ranking
        self.generator = generator

    def query(self, question, k_retrieve=50, k_rerank=5):
        # Stage 1: Fast retrieval (bi-encoder)
        candidates = self.retriever.retrieve(question, k=k_retrieve)

        # Stage 2: Precise reranking (cross-encoder)
        reranked = self.reranker.rerank(question, candidates, k=k_rerank)

        # Stage 3: Generate with retrieved context
        context = "\n\n".join([doc.text for doc in reranked])
        prompt = f"""Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"""

        return self.generator.generate(prompt)

# Query decomposition for complex questions
def decompose_query(question, llm):
    """Break complex questions into sub-questions."""
    prompt = f"""Break this question into simpler sub-questions:
    Question: {question}

    Sub-questions:"""

    sub_questions = llm.generate(prompt).split('\n')
    return [q.strip() for q in sub_questions if q.strip()]
```

---

## Part 4: Evaluation & Benchmarking

### Comprehensive Evaluation Framework

```python
class LLMEvaluator:
    def __init__(self, model):
        self.model = model
        self.metrics = {}

    def evaluate_generation_quality(self, test_set):
        """Quantitative metrics for text generation."""
        results = {
            "perplexity": [],
            "bleu": [],
            "rouge_l": [],
            "bertscore": []
        }

        for example in test_set:
            output = self.model.generate(example["input"])

            # Perplexity
            results["perplexity"].append(
                self.calculate_perplexity(example["input"], output)
            )

            # BLEU for translation-like tasks
            results["bleu"].append(
                sentence_bleu([example["reference"]], output)
            )

            # ROUGE-L for summarization
            results["rouge_l"].append(
                rouge_scorer.score(example["reference"], output)["rougeL"].fmeasure
            )

        return {k: np.mean(v) for k, v in results.items()}

    def evaluate_with_llm_judge(self, test_set, judge_model):
        """LLM-as-a-judge evaluation."""
        judge_prompt = """Rate the following response on a scale of 1-5:
        Question: {question}
        Response: {response}

        Criteria:
        - Accuracy: Is the information correct?
        - Completeness: Does it fully answer the question?
        - Clarity: Is it well-written and clear?

        Provide scores as JSON: {{"accuracy": X, "completeness": X, "clarity": X}}
        """

        scores = []
        for example in test_set:
            response = self.model.generate(example["input"])
            judgment = judge_model.generate(
                judge_prompt.format(question=example["input"], response=response)
            )
            scores.append(json.loads(judgment))

        return aggregate_scores(scores)
```

### Key Benchmarks to Consider

| Benchmark | Purpose | When to Use |
|-----------|---------|-------------|
| MMLU | General knowledge | Broad capability assessment |
| HumanEval | Code generation | Coding applications |
| MTEB | Embedding quality | RAG systems |
| IFEval | Instruction following | Chat/assistant models |
| TruthfulQA | Factual accuracy | High-stakes applications |

---

## Part 5: Managing Hallucinations

### Root Causes

1. **Self-Delusion**: Models can't distinguish their generations from input
2. **Knowledge Mismatch**: Fine-tuning data conflicts with pre-training knowledge
3. **Structural Limitations**: Probabilistic generation without grounding

### Technical Mitigations

```python
class HallucinationMitigation:
    def __init__(self, model, retriever, fact_checker):
        self.model = model
        self.retriever = retriever
        self.fact_checker = fact_checker

    def generate_with_grounding(self, query):
        # Retrieve relevant facts first
        facts = self.retriever.retrieve(query, k=5)
        fact_context = "\n".join([f"- {fact}" for fact in facts])

        prompt = f"""Based ONLY on these facts:
{fact_context}

Answer the question. If the facts don't contain the answer, say "I don't have enough information."

Question: {query}
Answer:"""

        response = self.model.generate(prompt)
        return response

    def verify_response(self, query, response):
        """Post-generation fact checking."""
        # Extract claims from response
        claims = self.extract_claims(response)

        verified_claims = []
        for claim in claims:
            # Check against retrieval
            evidence = self.retriever.retrieve(claim, k=3)
            is_supported = self.fact_checker.verify(claim, evidence)
            verified_claims.append({
                "claim": claim,
                "supported": is_supported,
                "evidence": evidence
            })

        return verified_claims

    def self_consistency_check(self, query, n_samples=5):
        """Generate multiple responses and check consistency."""
        responses = [self.model.generate(query, temperature=0.7)
                     for _ in range(n_samples)]

        # Extract key facts from each response
        fact_sets = [self.extract_facts(r) for r in responses]

        # Find consistent facts (appear in majority)
        all_facts = {}
        for facts in fact_sets:
            for fact in facts:
                all_facts[fact] = all_facts.get(fact, 0) + 1

        consistent = [f for f, count in all_facts.items()
                      if count >= n_samples // 2 + 1]

        return consistent
```

### Detection Methods

```python
def attention_based_hallucination_detection(model, prompt, response):
    """
    Lookback ratio: Do generated tokens attend to input or just previous generations?
    Low lookback ratio suggests hallucination.
    """
    full_sequence = prompt + response
    _, attention_weights = model(full_sequence, output_attentions=True)

    prompt_length = len(tokenize(prompt))
    response_length = len(tokenize(response))

    # For each response token, calculate attention to prompt vs. response
    lookback_ratios = []
    for i in range(response_length):
        response_position = prompt_length + i
        attn = attention_weights[-1][0, :, response_position, :]  # Last layer

        prompt_attention = attn[:, :prompt_length].sum()
        total_attention = attn[:, :response_position].sum()

        lookback_ratios.append(prompt_attention / total_attention)

    return np.mean(lookback_ratios)  # Low = likely hallucination
```

---

## Part 6: Advanced Architecture Considerations

### KV Cache Optimization

```python
def calculate_kv_cache_size(
    num_layers,
    hidden_size,
    num_heads,
    sequence_length,
    batch_size,
    precision_bytes=2  # fp16
):
    """
    KV cache grows linearly with sequence length.
    This is often the memory bottleneck for long contexts.
    """
    head_dim = hidden_size // num_heads

    # K and V for each layer, each head
    kv_per_token = 2 * num_layers * num_heads * head_dim * precision_bytes

    total_bytes = batch_size * sequence_length * kv_per_token
    return total_bytes / (1024 ** 3)  # GB

# Example: LLaMA 70B with 32K context
# calculate_kv_cache_size(80, 8192, 64, 32768, 1) → ~40GB just for KV cache!
```

### Reasoning Enhancement

```python
class ChainOfThoughtGenerator:
    """Implement reasoning patterns."""

    def __init__(self, model):
        self.model = model

    def generate_with_cot(self, question):
        prompt = f"""{question}

Let's think through this step by step:
1."""
        return self.model.generate(prompt)

    def self_consistency(self, question, n_paths=5):
        """Generate multiple reasoning paths, vote on answer."""
        answers = []
        for _ in range(n_paths):
            response = self.generate_with_cot(question)
            answer = self.extract_final_answer(response)
            answers.append(answer)

        # Majority vote
        from collections import Counter
        return Counter(answers).most_common(1)[0][0]

    def tree_of_thought(self, question, breadth=3, depth=3):
        """Explore multiple reasoning branches."""
        def expand_node(state, depth_remaining):
            if depth_remaining == 0:
                return self.evaluate_state(state)

            # Generate multiple next steps
            next_steps = self.generate_next_steps(state, n=breadth)

            # Recursively explore each branch
            best_score = float('-inf')
            best_path = None

            for step in next_steps:
                new_state = state + "\n" + step
                score, path = expand_node(new_state, depth_remaining - 1)

                if score > best_score:
                    best_score = score
                    best_path = path

            return best_score, best_path

        initial_state = f"Question: {question}\nReasoning:"
        _, best_reasoning = expand_node(initial_state, depth)
        return best_reasoning
```

---

## Conclusion: The Integrated Approach

The most effective LLM systems recognize that training, deployment, and evaluation aren't separate phases—they're interconnected:

- **Training choices** directly impact deployment options (model size, quantization compatibility)
- **Deployment constraints** should inform training decisions (target hardware, latency requirements)
- **Evaluation insights** guide both training data curation and architecture decisions

### Recommended Workflow

1. **Start with evaluation**: Define success metrics before training
2. **Prototype with prompting**: Validate the approach before fine-tuning
3. **Fine-tune incrementally**: Start with LoRA, move to full fine-tuning if needed
4. **Deploy with monitoring**: Track hallucination rates, latency, user feedback
5. **Iterate based on data**: Use production insights to improve training data

The most impactful improvements often come from recognizing system-wide relationships rather than optimizing components in isolation.

---

*What challenges have you faced deploying LLMs in production? I'd love to hear about your experiences and discuss solutions.*


---


### Building Resilient Systems: Lessons Learned

- **Published:** 2024-02-20 (updated 2026-01-24)
- **URL:** https://vikasmishra.ai/blog/building-resilient-systems/
- **Markdown source:** https://vikasmishra.ai/blog/building-resilient-systems/index.md
- **Tags:** Architecture, Reliability, Best Practices

> Key principles and patterns for building systems that gracefully handle failures and continue operating under stress.


After years of building and operating production systems, I've learned that failures are inevitable. What matters is how your system responds to them. Here are the key principles I follow when designing resilient systems.

## Expect Failures

The first step to building resilient systems is accepting that things will fail. Networks will partition, services will crash, and databases will become unavailable. Design for these scenarios from the start.

## Key Patterns

### Circuit Breaker

The circuit breaker pattern prevents cascading failures by detecting when a service is unhealthy and temporarily stopping requests to it:

```python
class CircuitBreaker:
    def __init__(self, failure_threshold=5):
        self.failures = 0
        self.threshold = failure_threshold
        self.state = "closed"

    def call(self, func):
        if self.state == "open":
            raise CircuitOpenError()

        try:
            result = func()
            self.failures = 0
            return result
        except Exception:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = "open"
            raise
```

### Retry with Exponential Backoff

When temporary failures occur, retrying with increasing delays often succeeds:

```python
def retry_with_backoff(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            wait = 2 ** attempt
            time.sleep(wait)
    raise MaxRetriesExceeded()
```

### Timeouts

Always set timeouts for external calls. A missing timeout can cause your entire system to hang:

```python
response = requests.get(url, timeout=5)  # 5 second timeout
```

## Observability

You can't fix what you can't see. Invest in:

- **Metrics**: Track error rates, latencies, and throughput
- **Logging**: Structured logs with correlation IDs
- **Tracing**: Distributed tracing across services

## Testing Resilience

Use chaos engineering to verify your systems handle failures:

1. Start with game days in staging
2. Gradually introduce failure injection in production
3. Build runbooks for common failure scenarios

## Conclusion

Building resilient systems requires upfront investment, but it pays dividends when production incidents occur. Start with these patterns and iterate based on your system's specific needs.


---


### Getting Started with Go: A Practical Guide

- **Published:** 2024-01-15 (updated 2026-01-24)
- **URL:** https://vikasmishra.ai/blog/getting-started-with-go/
- **Markdown source:** https://vikasmishra.ai/blog/getting-started-with-go/index.md
- **Tags:** Go, Programming, Tutorial

> A beginner-friendly introduction to Go programming language with practical examples and best practices.


Go (or Golang) has become one of the most popular programming languages for building scalable backend services. In this post, I'll share my experience learning Go and provide practical tips for getting started.

## Why Go?

Go offers several advantages that make it appealing for modern software development:

- **Simplicity**: Go has a small, easy-to-learn syntax
- **Performance**: Compiled to native code, Go programs are fast
- **Concurrency**: Built-in support for concurrent programming with goroutines
- **Tooling**: Excellent tooling including formatting, testing, and documentation

## Setting Up Your Environment

First, download and install Go from the [official website](https://golang.org/dl/). Once installed, verify your installation:

```bash
go version
```

## Your First Go Program

Create a file named `main.go`:

```go
package main

import "fmt"

func main() {
    fmt.Println("Hello, World!")
}
```

Run it with:

```bash
go run main.go
```

## Key Concepts

### Variables and Types

Go is statically typed, but the compiler can infer types:

```go
// Explicit type declaration
var name string = "Vikas"

// Type inference
age := 25
```

### Functions

Functions in Go are straightforward:

```go
func add(a, b int) int {
    return a + b
}
```

### Goroutines

Go's killer feature is its lightweight concurrency model:

```go
go func() {
    fmt.Println("Running in a goroutine")
}()
```

## Next Steps

Once you're comfortable with the basics, explore:

1. The standard library
2. Error handling patterns
3. Interfaces and structs
4. Testing with `go test`

Happy coding!


---


## Projects


### AI Coding Playbook

- **URL:** https://vikasmishra.ai/projects/ai-coding-playbook/
> An open-source, MIT-licensed book and reference kit for agentic coding with Claude Code and Gemini CLI — 21 chapters, runnable slash commands, skills, hook scripts, and end-to-end workflows.


## Overview

The AI Coding Playbook is a free, open-source guide to working with agentic coding tools on real codebases. It was written from months of running Claude Code and Gemini CLI against production repos, and it captures the handful of ideas and patterns that survived that work. The book takes a concepts-first, tool-specific-second approach so the material stays useful as the vendor tools evolve.

## What's inside

- **21 chapters** covering context economics, briefing vs. prompting, memory hierarchies, slash commands, skills, MCP servers, hooks, subagents, research workflows, test generation, UI generation, local models, and agent SDKs.
- **Drop-in assets**: 4 slash commands and 2 skills usable across both Claude Code and Gemini CLI, 5 hook scripts for policy enforcement and automation, configuration templates for memory files and MCP setups, and starter `~/.claude/` and `~/.gemini/` structures you can copy directly.
- **Six end-to-end playbooks**: triaging vague bug reports without jumping to a fix, multi-file migrations with subagent fan-out gated by pattern approval, feature development from spec to shipped with plan-before-edit, read-only on-call triage, research-driven decisions producing committable ADRs, and 90-second PR triage across large queues.

## Who it's for

Solo developers shipping SaaS products, engineering teams rolling out AI coding practices across a codebase, and anyone who wants to understand agentic coding fundamentals beyond marketing demos.

## Why it exists

Most agentic-coding material in 2026 is either vendor-specific demo content or surface-level listicles. The playbook focuses on the disciplines that actually make agents useful on real codebases — context discipline, briefing structure, the four primitives (slash commands, skills, MCP, hooks), and the failure modes that show up only after a few weeks of daily use.

## License

MIT. Fork it, remix it, build a course on top of it, translate it — just don't claim you wrote it.


---


### EPUB2PDF Print Perfect

- **URL:** https://vikasmishra.ai/projects/epub2pdf-print-perfect/
> A Python tool that converts EPUB eBooks into beautiful, print-ready PDFs with preserved typography, embedded images, auto-generated tables of contents, and clean chapter breaks.


## Overview

Most EPUB-to-PDF converters produce PDFs with broken layouts, missing images, ugly line spacing, and inconsistent typography. EPUB2PDF Print Perfect was built to fix that. The tool takes an EPUB file and produces a PDF that actually looks like a book — typeset cleanly, images embedded and correctly sized, a table of contents that reflects the source, and page numbering that makes sense.

## Features

- **Clean chapter separation** with automatic page breaks so chapters don't run together on one page.
- **Image handling** that embeds and resizes artwork to fit the page while preserving aspect ratio.
- **Auto-generated table of contents** from the EPUB structure, hyperlinked to the corresponding sections.
- **Professional typography**: justified text, readable line height, proper margins, embedded fonts.
- **CSS sanitization** that strips broken or hostile stylesheets while preserving useful inline styling.
- **Robust parsing** that handles malformed EPUBs, encoding issues, oversized images, and broken internal links gracefully instead of crashing.
- **Page numbering** hidden on the first page, visible from the first chapter onward.

## Performance

Converts a typical 300-page book in roughly five seconds on a modern laptop, including full image processing and font embedding.

## Tech stack

Python 3.10+, using `ebooklib` for EPUB parsing, `BeautifulSoup4` for HTML cleaning, `WeasyPrint` for PDF rendering, `Pillow` for image processing, and `tqdm` for progress reporting.

## Why it exists

I wanted to read a few EPUB-only books on paper, and every existing converter produced something unusable. The tool started as a weekend script and stabilized into a proper utility after it turned out a lot of other people had the same problem.


---


### Repo Inspector

- **URL:** https://vikasmishra.ai/projects/repo-inspector/
> A CLI tool that uses AI to analyze codebases and explain what they do, how they're structured, and how data flows through them.


## Overview

Repo Inspector automatically examines source code repositories and generates comprehensive documentation explaining the project's purpose, technical architecture, key directories, entry points, and data flow patterns.

## Features

- **Dual explanation levels**: Beginner mode for onboarding and senior mode for technical reviews
- **Multiple output formats**: Text or Markdown documentation
- **Architecture visualization**: Optional Mermaid diagrams
- **Intelligent code scanning**: Automatically excludes build artifacts, dependencies, and version control files
- **Context-aware chunking**: Processes large codebases by intelligently breaking code into analyzable pieces

## Supported Languages

JavaScript, TypeScript, Python, Go, Java, Rust, C++, Swift, Kotlin, and configuration formats (YAML, JSON, TOML).

## Installation

```bash
pip install repo-inspector
```

## Usage

```bash
# Analyze a repository with beginner-friendly explanations
repo-inspector /path/to/repo --level beginner

# Generate senior-level technical documentation
repo-inspector /path/to/repo --level senior

# Output as Markdown with architecture diagram
repo-inspector /path/to/repo --format markdown --diagram
```


---


### Explain Error

- **URL:** https://vikasmishra.ai/projects/explain-error/
> A CLI tool that transforms cryptic error messages and stack traces into clear, actionable explanations using AI-powered analysis.


## Overview

Explain Error accepts error logs from files or stdin, detects the programming language, parses stack traces, and uses AI to generate human-readable explanations with suggested fixes.

## Features

- **Language-agnostic support**: Python, JavaScript, Java, Go, Rust, Ruby, PHP
- **Cloud-native capabilities**: Docker, Kubernetes, AWS/GCP/Azure logs
- **Zero configuration**: Works out of the box
- **Privacy-focused**: Local processing except LLM calls
- **Actionable remediation**: Provides concrete steps to fix issues
- **Multiple output formats**: Text or Markdown
- **Pipe-compatible**: Chain with other CLI tools

## Installation

```bash
pip install explain-error
```

## Usage

```bash
# Explain an error from a file
explain-error error.log

# Pipe errors directly
python script.py 2>&1 | explain-error

# Output as Markdown
explain-error error.log --format markdown
```


---


### CLI Task Manager

- **URL:** https://vikasmishra.ai/projects/cli-task-manager/
> A fast, minimalist command-line task manager built with Go.


## Overview

A command-line task manager designed for developers who prefer staying in the terminal. Built with Go for speed and simplicity.

## Features

- **Fast**: Starts in milliseconds, even with thousands of tasks
- **Simple**: Intuitive commands that follow Unix conventions
- **Portable**: Single binary, works offline, stores data locally
- **Scriptable**: JSON output for integration with other tools

## Installation

```bash
go install github.com/vikasmishra/task-cli@latest
```

## Usage

```bash
# Add a task
task add "Review pull requests"

# List tasks
task list

# Complete a task
task done 1

# View tasks due today
task list --due today
```

## Technical Details

The project uses:

- **Cobra** for CLI argument parsing
- **SQLite** for local storage
- **Charm libraries** for terminal UI components

## Lessons Learned

Building this project taught me a lot about:

1. Designing intuitive CLI interfaces
2. Managing local state efficiently
3. Writing comprehensive tests for CLI tools


---


### Lightweight API Gateway

- **URL:** https://vikasmishra.ai/projects/api-gateway/
> A high-performance API gateway with rate limiting, authentication, and request routing.


## Overview

A lightweight, high-performance API gateway designed for microservices architectures. Handles authentication, rate limiting, and request routing with minimal latency overhead.

## Features

- **Rate Limiting**: Token bucket algorithm with Redis backend
- **Authentication**: JWT validation and API key management
- **Routing**: Path-based and header-based routing rules
- **Observability**: Prometheus metrics and structured logging
- **Configuration**: Hot-reload configuration without restart

## Architecture

```
┌─────────────┐     ┌──────────────┐     ┌────────────┐
│   Client    │────▶│  API Gateway │────▶│  Services  │
└─────────────┘     └──────────────┘     └────────────┘
                           │
                    ┌──────┴──────┐
                    │    Redis    │
                    └─────────────┘
```

## Performance

Benchmarked at 50,000+ requests per second on modest hardware with:
- P99 latency under 5ms
- Memory usage under 100MB

## Configuration Example

```yaml
routes:
  - path: /api/users/*
    upstream: http://users-service:8080
    rate_limit: 100/minute

  - path: /api/orders/*
    upstream: http://orders-service:8080
    auth: jwt
```

## Deployment

Available as a Docker image:

```bash
docker run -p 8080:8080 vikasmishra/api-gateway
```


---


## Reference

- Website: https://vikasmishra.ai/
- Blog index: https://vikasmishra.ai/blog/
- Spec-format AI directory: https://vikasmishra.ai/llms.txt
- RSS: https://vikasmishra.ai/blog/index.xml
- Sitemap: https://vikasmishra.ai/sitemap.xml