Talk to Your Docs | Abdullah Alabdullatif

Click here for the GitHub repository

Purpose of the Project

This project was built first and foremost as a learning exercise. The goal is not to ship a production-ready chatbot, but to deeply understand the core building blocks of modern NLP systems - including Retrieval-Augmented Generation (RAG), transformers, embedding models, and how LLMs can be used effectively with private, unstructured data like PDFs.

Rather than relying entirely on high-level libraries like LangChain, this project breaks down each component of the RAG pipeline and re-implements it where possible using PyTorch, HuggingFace Transformers, and lower-level NLP tools. This hands-on approach allows for a much clearer understanding of how these systems actually work under the hood.

The learning focus includes:

How transformer-based models process and embed text
How RAG pipelines retrieve relevant context from long documents
How to build and evaluate embedding-based search using cosine similarity and FAISS
How to process both English and Arabic documents, including challenges with OCR and tokenization
How to progressively move from API-based tools (OpenAI) to local, fully controlled models (HuggingFace)

Problem Context

LLMs such as Gemini 2.5 or GPT-4o do not have access to your private files. If you are working with confidential documents, or internal company data, uploading them to the cloud can raise serious concerns around privacy, compliance, or performance. This project explores how to build a local, secure question-answering system over your own documents without depending entirely on third-party APIs.

Overview

“Talk to Your Docs” is a modular RAG-based chatbot that takes in PDF documents and allows natural language queries. It supports both English and Arabic text, and is designed to gradually replace high-level dependencies with custom implementations to support deeper technical learning.

Development Process

The system was developed in progressive versions, each one targeting a specific learning milestone:

Version 1: LangChain-based MVP with OpenAI and FAISS
Version 2: Replaced embedding logic with custom PyTorch-based BERT embeddings
Version 3: Replaced LangChain chunking with manual sentence chunking using NLTK
Version 4: Arabic OCR support with multilingual embeddings and hybrid PDF processing

Each step was designed to isolate a component, replace it manually, and study how it fits into the overall pipeline.

Technical Components

PDF ingestion and text extraction - pdfplumber for text-based PDFs, with PaddleOCR fallback for scanned documents
Sentence chunking and overlap tuning - NLTK sentence tokenization with language-aware chunking for Arabic and English
Embedding generation - Multilingual sentence-transformers (paraphrase-multilingual-mpnet-base-v2) for both Arabic and English text
Vector search - FAISS for efficient similarity search
Answer generation - OpenAI GPT for question-answering

Version 4 added comprehensive Arabic support through:

Hybrid PDF processing: Automatically detects whether a PDF contains text or is a scanned image, using fast text extraction when possible and falling back to OCR when needed
PaddleOCR integration: Production-ready OCR with >74% accuracy on Arabic text, significantly outperforming alternatives like Tesseract
Arabic text reshaping: Implements proper right-to-left (RTL) display using arabic_reshaper and BiDi algorithm to fix reversed or mangled Arabic text
Multilingual embeddings: Single embedding model that handles both Arabic and English, eliminating dimension mismatches and simplifying the architecture
Language detection: Automatic language detection with character-based validation

While the Arabic text reshaping fixes work for most PDFs, some Arabic PDFs may still display incorrectly due to non-standard fonts, custom encodings, or complex layouts. The system handles the most common cases where PDFs store Arabic in logical order with isolated characters.

Next Steps

Future improvements will continue to follow this learning-driven approach:

Custom re-ranker to improve retrieval accuracy beyond simple cosine similarity
Replace OpenAI API with local HuggingFace LLMs for fully offline operation
Multi-modal support to handle images and text together
Fine-tuning embeddings on domain-specific documents
Model serving with FastAPI for production deployment
Conversation memory to maintain context across multiple questions
Advanced Arabic PDF handling to support edge cases with non-standard encodings and complex layouts