Chapter 11 of 16

8. CAPSTONE PROJECT

Enterprise "Company Brain" - The Ultimate RAG + KG System

Project Overview

Build a complete enterprise knowledge management system that ingests company documents, builds a knowledge graph, and answers questions using hybrid RAG + KG retrieval.

(This is the final boss. This project integrates everything you've learned - chunking, embeddings, graph construction, hybrid retrieval, deployment, monitoring, cost optimization. It will take weeks, not days. It will break in frustrating ways. You will question your life choices. When it finally works, you'll have a portfolio piece that actually demonstrates competence, not just "followed a tutorial." That's worth the pain.)

System Requirements

Input Sources:

PDF documents (reports, papers, manuals)
Markdown files (wikis, docs)
CSV data (employee directory, project list)
Web pages (company blog, documentation)

Capabilities:

Document Ingestion: Async pipeline processing all formats
Knowledge Graph: Auto-build from all documents
Hybrid Search: Combine structured + unstructured retrieval
Query Interface: Natural language queries with explanations
Admin Dashboard: Monitor usage, costs, data sources

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     INGESTION LAYER                         │
│  PDF│MD│CSV│Web → Process → Extract → Split                │
└────────────────────┬────────────────────────────────────────┘
                     │
         ┌───────────┴──────────┐
         ↓                      ↓
┌────────────────┐    ┌────────────────────┐
│  VECTOR STORE  │    │ KNOWLEDGE GRAPH    │
│  (Pinecone/    │    │  (Neo4j)           │
│   Chroma)      │    │                    │
│                │    │  Entities          │
│  Embeddings    │    │  Relationships     │
│  Metadata      │    │  Properties        │
└────────┬───────┘    └──────┬─────────────┘
         │                   │
         └────────┬──────────┘
                  ↓
         ┌────────────────────┐
         │  HYBRID RETRIEVER  │
         │                    │
         │  - Query Router    │
         │  - Entity Linker   │
         │  - Context Fusion  │
         └────────┬───────────┘
                  ↓
         ┌────────────────────┐
         │   LLM GENERATION   │
         │   + REASONING      │
         └────────┬───────────┘
                  ↓
         ┌────────────────────┐
         │   API + FRONTEND   │
         │   (FastAPI+React)  │
         └────────────────────┘

Technical Specifications

Tech Stack:

Backend: Python 3.11, FastAPI
Vector DB: Pinecone or ChromaDB
Graph DB: Neo4j
LLM: GPT-4 (primary), GPT-3.5-turbo (fallback)
Frontend: React or Streamlit
Deployment: Docker + Docker Compose
Monitoring: Prometheus + Grafana (bonus)

Core Features (Must-Have):

Document upload (drag-and-drop)
Automatic KG construction
Natural language queries
Cited answers with source links
Reasoning explanation ("how I found this")
Query routing (auto-select KG vs RAG vs Hybrid)
Admin dashboard (stats, costs)

Advanced Features (Nice-to-Have): 8. Multi-user support with authentication 9. Document versioning 10. Query history and analytics 11. Custom entity types 12. Graph visualization 13. Export answers as reports

Implementation Steps

Phase 1: Data Ingestion (Week 1)

Build DocumentProcessor for all file types
Implement async processing queue
Add metadata extraction
Test with 50+ documents

Phase 2: Knowledge Graph Construction (Week 2)

Entity and relationship extraction
Entity linking and deduplication
Load into Neo4j
Build basic Cypher query interface

Phase 3: RAG System (Week 2-3)

Chunking strategy implementation
Embedding generation and storage
Hybrid retriever (BM25 + semantic)
Reranker integration

Phase 4: Hybrid System (Week 3-4)

Query classification and routing
KG-augmented retrieval
Context fusion
Answer generation with citations

Phase 5: API & Frontend (Week 4-5)

FastAPI endpoints
Frontend (chat interface)
Admin dashboard
Authentication

Phase 6: Testing & Deployment (Week 5-6)

Unit tests (>80% coverage)
Integration tests
Load testing
Docker deployment
Documentation

Evaluation Benchmarks

Quantitative Metrics:

Accuracy: 85%+ on 100-question test set
Latency: p95 < 2 seconds
Throughput: 50+ concurrent users
Cost: < $0.10 per query
Uptime: 99.5%+

Qualitative Assessment:

Answer quality (human evaluation)
Citation accuracy (source verification)
Reasoning clarity (explanation quality)
User experience (UI/UX review)

Evaluation Rubric

Component	Weight	Criteria
Data Ingestion	15%	Handles all file types, metadata extraction, async processing
Knowledge Graph	20%	Entity/relation extraction quality, graph completeness, Cypher queries work
RAG System	20%	Retrieval quality, chunking strategy, embedding optimization
Hybrid Integration	25%	Query routing, context fusion, answer quality
Production Quality	20%	API design, testing, deployment, documentation, monitoring

Total: 100 points

90-100: Exceptional - Production-ready, innovative features
80-89: Excellent - All core features working well
70-79: Good - Core features present, some rough edges
60-69: Adequate - Basic functionality works
<60: Needs improvement

What Your Portfolio Demo Should Show

5-Minute Video Covering:

Intro (30s): Problem statement and solution overview
Data Ingestion (60s): Upload docs, show processing pipeline
Knowledge Graph (60s): Visualize graph, run Cypher query
Query Demo (90s):
- Factual query (KG-routed)
- Analytical query (RAG-routed)
- Multi-hop query (Hybrid)
Advanced Features (60s): Citations, reasoning, admin dashboard
Technical Deep-Dive (30s): Architecture diagram, tech stack

How This Signals Hire-Readiness

What Employers See:

✅ Full-stack skills: Backend + Frontend + DevOps
✅ AI/ML expertise: LLMs, embeddings, vector DBs
✅ Data engineering: Pipelines, async processing
✅ Production thinking: Testing, monitoring, deployment
✅ Problem-solving: Complex system design
✅ Communication: Clear documentation and demo

Conversation Starters in Interviews:

"Tell me about your approach to query routing"
"How did you optimize for cost and latency?"
"What were the biggest challenges in building this?"
"How would you scale this to 10M documents?"

← Chapter 9 - Hands-On Projects1 / 1Chapter 11 - Assessments & Quizzes →