>_TheQuery

Chapter 11 of 16

8. CAPSTONE PROJECT

Enterprise "Company Brain" - The Ultimate RAG + KG System

Project Overview

Build a complete enterprise knowledge management system that ingests company documents, builds a knowledge graph, and answers questions using hybrid RAG + KG retrieval.

(This is the final boss. This project integrates everything you've learned - chunking, embeddings, graph construction, hybrid retrieval, deployment, monitoring, cost optimization. It will take weeks, not days. It will break in frustrating ways. You will question your life choices. When it finally works, you'll have a portfolio piece that actually demonstrates competence, not just "followed a tutorial." That's worth the pain.)

System Requirements

Input Sources:

  • PDF documents (reports, papers, manuals)
  • Markdown files (wikis, docs)
  • CSV data (employee directory, project list)
  • Web pages (company blog, documentation)

Capabilities:

  1. Document Ingestion: Async pipeline processing all formats
  2. Knowledge Graph: Auto-build from all documents
  3. Hybrid Search: Combine structured + unstructured retrieval
  4. Query Interface: Natural language queries with explanations
  5. Admin Dashboard: Monitor usage, costs, data sources

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     INGESTION LAYER                         │
│  PDF│MD│CSV│Web → Process → Extract → Split                │
└────────────────────┬────────────────────────────────────────┘
                     │
         ┌───────────┴──────────┐
         ↓                      ↓
┌────────────────┐    ┌────────────────────┐
│  VECTOR STORE  │    │ KNOWLEDGE GRAPH    │
│  (Pinecone/    │    │  (Neo4j)           │
│   Chroma)      │    │                    │
│                │    │  Entities          │
│  Embeddings    │    │  Relationships     │
│  Metadata      │    │  Properties        │
└────────┬───────┘    └──────┬─────────────┘
         │                   │
         └────────┬──────────┘
                  ↓
         ┌────────────────────┐
         │  HYBRID RETRIEVER  │
         │                    │
         │  - Query Router    │
         │  - Entity Linker   │
         │  - Context Fusion  │
         └────────┬───────────┘
                  ↓
         ┌────────────────────┐
         │   LLM GENERATION   │
         │   + REASONING      │
         └────────┬───────────┘
                  ↓
         ┌────────────────────┐
         │   API + FRONTEND   │
         │   (FastAPI+React)  │
         └────────────────────┘

Technical Specifications

Tech Stack:

  • Backend: Python 3.11, FastAPI
  • Vector DB: Pinecone or ChromaDB
  • Graph DB: Neo4j
  • LLM: GPT-4 (primary), GPT-3.5-turbo (fallback)
  • Frontend: React or Streamlit
  • Deployment: Docker + Docker Compose
  • Monitoring: Prometheus + Grafana (bonus)

Core Features (Must-Have):

  1. Document upload (drag-and-drop)
  2. Automatic KG construction
  3. Natural language queries
  4. Cited answers with source links
  5. Reasoning explanation ("how I found this")
  6. Query routing (auto-select KG vs RAG vs Hybrid)
  7. Admin dashboard (stats, costs)

Advanced Features (Nice-to-Have): 8. Multi-user support with authentication 9. Document versioning 10. Query history and analytics 11. Custom entity types 12. Graph visualization 13. Export answers as reports

Implementation Steps

Phase 1: Data Ingestion (Week 1)

  1. Build DocumentProcessor for all file types
  2. Implement async processing queue
  3. Add metadata extraction
  4. Test with 50+ documents

Phase 2: Knowledge Graph Construction (Week 2)

  1. Entity and relationship extraction
  2. Entity linking and deduplication
  3. Load into Neo4j
  4. Build basic Cypher query interface

Phase 3: RAG System (Week 2-3)

  1. Chunking strategy implementation
  2. Embedding generation and storage
  3. Hybrid retriever (BM25 + semantic)
  4. Reranker integration

Phase 4: Hybrid System (Week 3-4)

  1. Query classification and routing
  2. KG-augmented retrieval
  3. Context fusion
  4. Answer generation with citations

Phase 5: API & Frontend (Week 4-5)

  1. FastAPI endpoints
  2. Frontend (chat interface)
  3. Admin dashboard
  4. Authentication

Phase 6: Testing & Deployment (Week 5-6)

  1. Unit tests (>80% coverage)
  2. Integration tests
  3. Load testing
  4. Docker deployment
  5. Documentation

Evaluation Benchmarks

Quantitative Metrics:

  • Accuracy: 85%+ on 100-question test set
  • Latency: p95 < 2 seconds
  • Throughput: 50+ concurrent users
  • Cost: < $0.10 per query
  • Uptime: 99.5%+

Qualitative Assessment:

  • Answer quality (human evaluation)
  • Citation accuracy (source verification)
  • Reasoning clarity (explanation quality)
  • User experience (UI/UX review)

Evaluation Rubric

ComponentWeightCriteria
Data Ingestion15%Handles all file types, metadata extraction, async processing
Knowledge Graph20%Entity/relation extraction quality, graph completeness, Cypher queries work
RAG System20%Retrieval quality, chunking strategy, embedding optimization
Hybrid Integration25%Query routing, context fusion, answer quality
Production Quality20%API design, testing, deployment, documentation, monitoring

Total: 100 points

  • 90-100: Exceptional - Production-ready, innovative features
  • 80-89: Excellent - All core features working well
  • 70-79: Good - Core features present, some rough edges
  • 60-69: Adequate - Basic functionality works
  • <60: Needs improvement

What Your Portfolio Demo Should Show

5-Minute Video Covering:

  1. Intro (30s): Problem statement and solution overview
  2. Data Ingestion (60s): Upload docs, show processing pipeline
  3. Knowledge Graph (60s): Visualize graph, run Cypher query
  4. Query Demo (90s):
    • Factual query (KG-routed)
    • Analytical query (RAG-routed)
    • Multi-hop query (Hybrid)
  5. Advanced Features (60s): Citations, reasoning, admin dashboard
  6. Technical Deep-Dive (30s): Architecture diagram, tech stack

How This Signals Hire-Readiness

What Employers See:

  • Full-stack skills: Backend + Frontend + DevOps
  • AI/ML expertise: LLMs, embeddings, vector DBs
  • Data engineering: Pipelines, async processing
  • Production thinking: Testing, monitoring, deployment
  • Problem-solving: Complex system design
  • Communication: Clear documentation and demo

Conversation Starters in Interviews:

  • "Tell me about your approach to query routing"
  • "How did you optimize for cost and latency?"
  • "What were the biggest challenges in building this?"
  • "How would you scale this to 10M documents?"