NLP / LLMs

Talk to Your Docs

Custom RAG Chatbot

A RAG-based chatbot for PDF documents evolving from basic LangChain to custom PyTorch components, with Arabic OCR support and multilingual embeddings.

Click here for the GitHub repository

Purpose of the Project

This project was built first and foremost as a learning exercise. The goal is not to ship a production-ready chatbot, but to deeply understand the core building blocks of modern NLP systems - including Retrieval-Augmented Generation (RAG), transformers, embedding models, and how LLMs can be used effectively with private, unstructured data like PDFs.

Rather than relying entirely on high-level libraries like LangChain, this project breaks down each component of the RAG pipeline and re-implements it where possible using PyTorch, HuggingFace Transformers, and lower-level NLP tools. This hands-on approach allows for a much clearer understanding of how these systems actually work under the hood.

The learning focus includes:

  • How transformer-based models process and embed text
  • How RAG pipelines retrieve relevant context from long documents
  • How to build and evaluate embedding-based search using cosine similarity and FAISS
  • How to process both English and Arabic documents, including challenges with OCR and tokenization
  • How to progressively move from API-based tools (OpenAI) to local, fully controlled models (HuggingFace)

Problem Context

LLMs such as Gemini 2.5 or GPT-4o do not have access to your private files. If you are working with confidential documents, or internal company data, uploading them to the cloud can raise serious concerns around privacy, compliance, or performance. This project explores how to build a local, secure question-answering system over your own documents without depending entirely on third-party APIs.

Overview

“Talk to Your Docs” is a modular RAG-based chatbot that takes in PDF documents and allows natural language queries. It supports both English and Arabic text, and is designed to gradually replace high-level dependencies with custom implementations to support deeper technical learning.

Development Process

The system was developed in progressive versions, each one targeting a specific learning milestone:

  • Version 1: LangChain-based MVP with OpenAI and FAISS
  • Version 2: Replaced embedding logic with custom PyTorch-based BERT embeddings
  • Version 3: Replaced LangChain chunking with manual sentence chunking using NLTK
  • Version 4: Arabic OCR support with multilingual embeddings and hybrid PDF processing

Each step was designed to isolate a component, replace it manually, and study how it fits into the overall pipeline.

Technical Components

  • PDF ingestion and text extraction - pdfplumber for text-based PDFs, with PaddleOCR fallback for scanned documents
  • Sentence chunking and overlap tuning - NLTK sentence tokenization with language-aware chunking for Arabic and English
  • Embedding generation - Multilingual sentence-transformers (paraphrase-multilingual-mpnet-base-v2) for both Arabic and English text
  • Vector search - FAISS for efficient similarity search
  • Answer generation - OpenAI GPT for question-answering

Version 4 added comprehensive Arabic support through:

  • Hybrid PDF processing: Automatically detects whether a PDF contains text or is a scanned image, using fast text extraction when possible and falling back to OCR when needed
  • PaddleOCR integration: Production-ready OCR with >74% accuracy on Arabic text, significantly outperforming alternatives like Tesseract
  • Arabic text reshaping: Implements proper right-to-left (RTL) display using arabic_reshaper and BiDi algorithm to fix reversed or mangled Arabic text
  • Multilingual embeddings: Single embedding model that handles both Arabic and English, eliminating dimension mismatches and simplifying the architecture
  • Language detection: Automatic language detection with character-based validation

While the Arabic text reshaping fixes work for most PDFs, some Arabic PDFs may still display incorrectly due to non-standard fonts, custom encodings, or complex layouts. The system handles the most common cases where PDFs store Arabic in logical order with isolated characters.

Next Steps

Future improvements will continue to follow this learning-driven approach:

  • Custom re-ranker to improve retrieval accuracy beyond simple cosine similarity
  • Replace OpenAI API with local HuggingFace LLMs for fully offline operation
  • Multi-modal support to handle images and text together
  • Fine-tuning embeddings on domain-specific documents
  • Model serving with FastAPI for production deployment
  • Conversation memory to maintain context across multiple questions
  • Advanced Arabic PDF handling to support edge cases with non-standard encodings and complex layouts