>_TheQuery

Chapter 8 of 11

Chapter 6 - Modern AI Systems: RAG, Agents, and Glue Code

The Crux

Models alone are useless. Real AI systems are models + data pipelines + retrieval + guardrails + monitoring + glue code. This chapter is about engineering AI into production, not just training models.

Why Models Alone Are Useless

You've trained a great model. Congratulations. Now what?

Reality:

  • The model needs to integrate with existing systems (databases, APIs, user interfaces)
  • Users don't send perfectly formatted inputs
  • The model drifts as the world changes
  • You need to monitor failures, log predictions, retrain periodically
  • You need to handle errors gracefully (what if the API is down?)

The model is 10% of the system. The other 90% is infrastructure.

RAG: Retrieval-Augmented Generation

LLMs hallucinate because they rely on memorized training data. What if we give them access to external knowledge?

The Idea

Instead of asking the LLM to answer directly:

  1. Retrieve relevant documents from a database
  2. Augment the prompt with retrieved information
  3. Generate the answer based on retrieved context

Example:

  • User: "What's the return policy?"
  • System retrieves: Company policy doc mentioning "30-day returns"
  • Prompt: "Based on this policy: [retrieved text], answer: What's the return policy?"
  • LLM: "We offer 30-day returns."

Why It Works

The LLM doesn't need to memorize every fact. It just needs to read context and extract answers-something LLMs are good at.

Architecture

  1. Document store: Database of knowledge (vector database, Elasticsearch, etc.)
  2. Embedding model: Convert queries and documents to vectors
  3. Retrieval: Find top-k most similar documents to the query (cosine similarity)
  4. LLM: Generate answer given query + retrieved docs

When to Use RAG vs Fine-Tuning

RAG:

  • Knowledge changes frequently (e.g., product docs updated weekly)
  • You need to cite sources
  • You have limited GPU resources

Fine-tuning:

  • Knowledge is stable
  • You want the model to internalize a style or domain-specific reasoning
  • You have labeled data and compute

Often, you use both: fine-tune for style/domain, RAG for up-to-date facts.