What is RAG?

Imagine you ask a friend:

“What’s the penalty for breaking my loan agreement early?”

If your friend doesn’t know but quickly checks the contract, finds the exact clause, and then answers you — that’s RAG in action.

RAG (Retrieval-Augmented Generation) combines two steps:

  1. Retrieve → Search your documents or knowledge base for relevant info.

  2. Generate → Use an LLM to answer, based only on that retrieved info.

This prevents the AI from hallucinating and ensures answers are accurate, traceable, and grounded in real data.

For a short primer, here’s a great explainer:


A Quick Note on LangChain

To build our assistant, we’ll use LangChain — a framework that makes it easier to work with LLMs by providing building blocks.

Key features of LangChain:

  • Document Loaders → Extract text from PDFs, Word files, websites, etc.

  • Text Splitters → Break large documents into chunks.

  • Embeddings & Vector Stores → Convert chunks into embeddings and store them for fast semantic search.

  • Chains & Agents → Orchestrate retrieval + reasoning flows.

Analogy: If AI development is cooking, LangChain is your pre-stocked kitchen with ingredients (loaders, splitters, embeddings) and utensils (chains, retrievers).

In our project, LangChain will stitch together all steps of the RAG pipeline: Load PDFs → Split → Embed → Store → Retrieve → Answer.


Roadmap: From Basics to Agentic Workflows

Here’s the phased roadmap we’ll follow (see image):

roadmap-rag
  1. Foundational RAG for Single-Source Data

    • Core pipeline that works with one source (like a single PDF).

  2. Data Persistence & Stateful Management

    • Store embeddings and chat history in a database for multi-session use.

  3. Multi-Source Integration & Advanced Retrieval

    • Support multiple data sources with refined retrieval strategies.

  4. Agentic Workflows (Future Phase)

    • Move beyond Q&A to assistants that can take proactive, goal-oriented actions.


High-Level Design (HLD)

phase1-hld-1

Let’s walk through the architecture row by row (see diagram):

1. User Interaction (Top Row)

  • The user uploads a PDF or types a question in a React web app.

  • The React app sends requests to the backend.

  • This keeps the interface clean: upload → ask → get an answer.


2. Flask/FastAPI Backend (Middle Row)

The React app talks to a backend server via two endpoints:

  • /upload-pdf → Handles document upload and processing.

  • /chat → Handles queries and returns responses.

The backend acts like a traffic controller, directing data between UI and the RAG pipeline.


3. RAG Pipeline (Row 3)

This is the core engine powered by LangChain:

  • PyMuPDFLoader → Extracts text and metadata from PDFs.

  • RecursiveCharacterTextSplitter → Splits large text into overlapping chunks.

  • OpenAIEmbeddings → Converts chunks into embeddings (vectors).

  • FAISS Vector Store → Stores embeddings for fast similarity search.


4. Vector DB & LLM (Bottom Row)

  • FAISS DB → Think of it as Google Search for your document.

  • Retriever & QnA Chain → Fetches context + combines with your query.

  • ChatOpenAI (GPT-4o) → Generates the final human-like answer.


Diving into the Code: The RAG Pipeline

1. Data Ingestion: Loading and Splitting

from langchain.document_loaders import PyMuPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load PDF loader = PyMuPDFLoader("sample.pdf") documents = loader.load()

# Split into chunks splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ) chunks = splitter.split_documents(documents)

  • Loader extracts raw text + metadata (page numbers, file name).

  • Splitter ensures text is chopped neatly without losing context.


2. Embedding and Indexing

from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import FAISS

# Generate embeddings embeddings = OpenAIEmbeddings()

# Create FAISS index vector_store = FAISS.from_documents(chunks, embeddings)

  • Embeddings turn text into dense vectors.

  • FAISS stores them in memory for quick similarity search.


3. Retrieval and Answer Generation

from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI

# Create retriever retriever = vector_store.as_retriever()

# Q&A chain qa_chain = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4o-mini"), retriever=retriever )

# Ask a question response = qa_chain.run("What is the penalty for early termination?") print(response)

  • Retriever finds the top relevant chunks.

  • Q&A chain combines user question + retrieved chunks → sends to LLM.

  • LLM answers only using that context.


4. FastAPI Endpoints

In main.py, we expose simple APIs:

from fastapi import FastAPI, UploadFile from rag_pipeline import process_pdf, qa_chain

app = FastAPI()

@app.post("/upload-pdf") async def upload_pdf(file: UploadFile): return process_pdf(file)

@app.post("/chat") async def chat(question: str): return {"answer": qa_chain.run(question)}


Example Flow in Action

  • User uploads contract.pdf in the React app.

  • Backend processes it → loads, splits, embeds, stores in FAISS.

  • User asks: “What is the penalty for early withdrawal?”

  • Backend retrieves the relevant clause (page 12).

  • ChatOpenAI forms a natural response:

“According to page 12, early withdrawal incurs a 2% fee if done within 6 months.”

Wrapping Up

With Phase 1, we now have:

✅ A working RAG pipeline for single PDFs

✅ A clean separation between frontend, backend, and AI pipeline

✅ Grounded, document-aware answers

Next steps:

  • Add persistence (so embeddings don’t vanish on restart)

  • Support multiple documents

  • Explore advanced retrieval techniques

  • Move towards agentic workflows

This is just the beginning of turning your PDFs into a personal knowledge assistant


Resources & Connect

If you’d like to dive deeper or try this project yourself:

I’ll continue updating both the blog and GitHub repo as we progress through the later phases of the Agentic RAG roadmap. Stay tuned — exciting things ahead