πŸš€ -> Project on GitHub <-

RAG Implementation - Complete Analysis

Zurück zur Übersicht

RAG Implementation - Complete Analysis

Table of Contents

  1. Overview
  2. Architecture
  3. Vector Store (ChromaDB)
  4. Memory Systems
  5. Document Processing
  6. Retrieval Mechanisms
  7. Complete Workflow
  8. Configuration
  9. Code References

    Overview

    The CrawlLama system implements RAG (Retrieval Augmented Generation) using ChromaDB as the vector store. RAG enables semantic searches over crawled web content and uses it as context for LLM responses.

    Core Components

    | Component | File | Function | |———–|β€”β€”|β€”β€”β€”-| | RAGManager | tools/rag.py | Main class for RAG operations | | ChromaDB | Vector Store | Semantic document search | | ContextManager | core/context_manager.py | Chunking & Prompt Building | | ToolRegistry | tools/tool_registry.py | Integration into Agent Workflow | β€”

    Architecture

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                      USER QUERY                             β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                   AGENT (SearchAgent)                       β”‚
    β”‚  - Analyzes query complexity                                β”‚
    β”‚  - Selects tool: web_search, wiki_lookup, or rag_search    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                   RAG SEARCH TOOL                           β”‚
    β”‚  RAGManager.search(query, top_k=5)                          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                   CHROMADB QUERY                            β”‚
    β”‚  - Query embedding with "nomic-embed-text"                  β”‚
    β”‚  - Cosine Similarity Search                                 β”‚
    β”‚  - Returns: Top-K most similar documents                    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚               FORMAT RAG RESULTS                            β”‚
    β”‚  - Truncate text (max 300 characters)                       β”‚
    β”‚  - Add source & relevance score                             β”‚
    β”‚  Format: "[Source: url] (Relevance: 0.85)"                  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚              CONTEXT BUILDING                               β”‚
    β”‚  ContextManager.build_prompt()                              β”‚
    β”‚  - Combines: system_prompt + rag_results + user_query       β”‚
    β”‚  - Truncate to max_context_tokens (4000-16000)              β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚             LLM GENERATION (Ollama)                         β”‚
    β”‚  - Sends prompt to local LLM                                β”‚
    β”‚  - Optional: Hallucination Detection                        β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                   RESPONSE                                  β”‚
    β”‚  Formatted response with source attribution                 β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    

Vector Store (ChromaDB)

Initialization

File: tools/rag.py:49-63

# ChromaDB Settings
Settings(
 persist_directory="data/embeddings",
 anonymized_telemetry=False
)
# Client Creation
self.client = chromadb.PersistentClient(
 path="data/embeddings"
)
# Collection Setup
self.collection = self.client.get_or_create_collection(
 name="web_documents",
 metadata={"hnsw:space": "cosine"}  # Cosine Similarity
)

Embedding Model

  • Character Estimation: 4 characters = 1 token
  • Max Chunk Size: chunk_size * 4 characters
  • Overlap: overlap * 4 characters between chunks
  • Smart Splitting:
    • Breaks at sentence boundaries (., ?, !)
    • Prevents truncation mid-sentence
    • Maintains context continuity via overlap Example:
      Chunk 1: "This is the first sentence. And the second sentence."
          ↓ (Overlap 50 tokens)
      Chunk 2: "And the second sentence. Here comes the third."
      

      2. Adding Documents

      File: tools/rag.py:78-127

      def add_documents(texts: List[str],
                   metadatas: Optional[List[dict]] = None,
                   ids: Optional[List[str]] = None,
                   use_batch: bool = True) -> None
      

      Pipeline: ``` INPUT: texts, metadatas, ids ↓

  • ID GENERATION (if not provided)
    • MD5 hash of text (first 16 characters)
    • Prevents duplicates ↓
  • METADATA PREPARATION
    • Default: {β€œsource”: β€œunknown”}
    • Customizable per document ↓
  • BATCH PROCESSING
    • Threshold: 100 documents
    • Processes in batches
    • Progress logging ↓
  • CHROMADB INDEXING
    • collection.add(documents=texts, metadatas=metadatas, ids=ids)
    • Automatic embedding with nomic-embed-text
    • Persistence to disk ↓ OUTPUT: Documents indexed & searchable
      **Important Note:**
           - **No automatic indexing** of web search results!
           - **Manual indexing required:**
      ```python
      agent.add_to_knowledge_base(
       texts=["Document 1", "Document 2"],
       metadatas=[{"source": "url1"}, {"source": "url2"}]
      )
      

      3. Deduplication

      File: tools/rag.py:103-104

      # Generate MD5 hash as ID
      doc_id = hashlib.md5(text.encode()).hexdigest()[:16]
      

      β†’ Same text = same ID β†’ Automatic deduplication by ChromaDB β€”

      Retrieval Mechanisms

      File: tools/rag.py:167-226

      def search(query: str,
            top_k: int = 5,
            filter_metadata: Optional[dict] = None,
            min_relevance: float = 0.0) -> List[Dict]
      

      Process:

  • Query embedding with nomic-embed-text
  • ChromaDB .query() with cosine similarity
  • Conversion: relevance = 1.0 - distance
  • Filtering: Only results >= min_relevance Return Format:
    {
     "text": "Document content...",
     "metadata": {"source": "https://example.com"},
     "distance": 0.15,        # Cosine distance
     "relevance": 0.85,       # 1.0 - distance
     "id": "abc123def456"
    }
    

    File: tools/rag.py:227-279

    def multi_query_search(queries: List[str],
                        top_k: int = 5,
                        deduplicate: bool = True) -> List[Dict]
    

    Use Case: Query expansion / reformulation Process:

  • Parallel execution with ThreadPoolExecutor (max 4 workers)
  • Each query searched separately
  • Deduplication (if enabled):
    • Groups results by document ID
    • Keeps best relevance score per document
  • Sorting by relevance (highest first) Example:
    queries = [
     "AI in healthcare",
     "medical AI systems",
     "artificial intelligence diagnosis"
    ]
    results = rag.multi_query_search(queries, top_k=10)
    

    File: tools/rag.py:281-318

    def hybrid_search(query: str,
                   top_k: int = 5,
                   semantic_weight: float = 0.7) -> List[Dict]
    

    Combination: Semantic + keyword search Process:

  • Generate query variants:
    • Original query
    • Lowercase version
    • First 3 words
  • Multi-query search with all variants
  • Weighting: semantic_weight (0.0 - 1.0)

    Complete Workflow

    Scenario: User asks β€œWhat is RAG in AI?”

    Step 1: Query Analysis

    File: core/agent.py:138-143

    # Agent receives query
    query = "What is RAG in AI?"
    # Agent decides tool usage
    # Options: web_search, wiki_lookup, rag_search
    

    Step 2: RAG Search (if selected)

    File: tools/rag.py:167-226

    results = rag_manager.search(
     query="What is RAG in AI?",
     top_k=5,
     min_relevance=0.5
    )
    

    Internal ChromaDB Operations:

    # 1. Query Embedding
    query_embedding = embed_model.encode("What is RAG in AI?")
    # 2. Similarity Search
    db_results = collection.query(
     query_embeddings=[query_embedding],
     n_results=5
    )
    # 3. Relevance Calculation
    for i, distance in enumerate(db_results['distances'][0]):
     relevance = 1.0 - distance
    

    Step 3: Result Formatting

    File: tools/rag.py:380-413

    formatted = format_rag_results(results, max_length=300)
    

    Output: ```

  • [Source: https://example.com/rag] (Relevance: 0.92) RAG (Retrieval Augmented Generation) is a technique that…
  • [Source: https://ai-docs.org/rag-guide] (Relevance: 0.87) In RAG, relevant documents are retrieved from a database…
    #### Step 4: Context Building
    **File:** `core/context_manager.py:119-155`
    ```python
    prompt = build_prompt(
     system_prompt="You are a helpful AI assistant.",
     user_query="What is RAG in AI?",
     context=formatted_rag_results,
     max_context_tokens=4000
    )
    

    Final Prompt: ``` You are a helpful AI assistant. Context:

  • [Source: https://example.com/rag] (Relevance: 0.92) RAG (Retrieval Augmented Generation) is a technique that…
  • [Source: https://ai-docs.org/rag-guide] (Relevance: 0.87) In RAG, relevant documents are retrieved from a database… Question: What is RAG in AI?
    #### Step 5: Token Management
    **File:** `core/context_manager.py:133-145`
    ```python
    # Token Counting
    system_tokens = count_tokens(system_prompt)
    query_tokens = count_tokens(user_query)
    context_tokens = count_tokens(context)
    # Truncation if needed
    if context_tokens > max_context_tokens:
     context = truncate_text(context, max_context_tokens)
    

    Step 6: LLM Generation

    File: core/agent.py + core/ollama_client.py

    response = ollama_client.generate(
     prompt=final_prompt,
     model="llama2",  # or configured model
     temperature=0.7
    )
    

    Step 7: Response to User

    RAG (Retrieval Augmented Generation) combines information
    retrieval with LLM generation. Relevant documents are retrieved
    from a knowledge base and used as context for the LLM response.
    This reduces hallucinations and enables up-to-date, fact-based answers.
    Sources:
    - https://example.com/rag (Relevance: 0.92)
    - https://ai-docs.org/rag-guide (Relevance: 0.87)
    
  • Configuration

    config.json RAG Settings

    File: config.json:24-30

    {
      "rag": {
     "enabled": true,
     "embedding_model": "nomic-embed-text",
     "chunk_size": 500,
     "chunk_overlap": 50,
     "top_k": 10
      }
    }
    

    Path Configuration

    {
      "paths": {
     "embeddings_dir": "data/embeddings"
      }
    }
    

    Context Limits

    {
      "context_limits": {
     "small": 4000,
     "medium": 6000,
     "large": 8000,
     "xlarge": 12000,
     "max_storage": 8000
      }
    }
    

    Optimized for RTX 3080:

  • Graceful Degradation: Fallback if ChromaDB unavailable
  • Parallel Processing: ThreadPoolExecutor for multi-query
  • Smart Chunking: Sentence-based splitting with overlap
  • Flexible Search: Standard, multi-query, hybrid modes
  • Lazy Loading: RAG as β€œheavy” tool loaded on-demand
  • Deduplication: MD5 hashing prevents duplicates

    ⚠️ Limitations

  • No Auto-Indexing: Web search results NOT automatically added to RAG
  • Manual Population: User must explicitly index documents
  • Single Collection: All documents in one collection
  • No Update Mechanism: No built-in method to update existing docs
  • Simple Metadata: Only source-based metadata structure

    🎯 Use Cases

    Currently Implemented:

    • Manual indexing of documents
    • Semantic search over indexed content
    • Multi-query expansion for better recall

      Not Implemented:

    • Automatic indexing of web crawl results
    • Incremental updates of documents
    • Multi-tenancy (separate collections per user)

      Usage Examples

      1. Adding Documents

      # Via Agent API
      agent = SearchAgent(config)
      agent.add_to_knowledge_base(
       texts=[
         "RAG combines retrieval with generation.",
         "ChromaDB is a vector database."
       ],
       metadatas=[
         {"source": "https://rag-tutorial.com"},
         {"source": "https://chromadb.docs"}
       ]
      )
      
      # Via RAGManager
      rag = RAGManager()
      results = rag.search(
       query="What is a vector database?",
       top_k=5,
       min_relevance=0.5
      )
      for result in results:
       print(f"[{result['relevance']:.2f}] {result['text'][:100]}...")
      
      # Query Expansion
      queries = [
       "Vector Database",
       "Embedding storage systems",
       "Semantic search databases"
      ]
      results = rag.multi_query_search(
       queries=queries,
       top_k=10,
       deduplicate=True
      )
      
      # Semantic + Keyword Combination
      results = rag.hybrid_search(
       query="ChromaDB features",
       top_k=5,
       semantic_weight=0.7  # 70% semantic, 30% keyword
      )
      
  • Summary

    The CrawlLama RAG system uses ChromaDB for semantic document search with the nomic-embed-text embedding model. The workflow is:

  • Indexing: Documents manually added via add_to_knowledge_base()
  • Chunking: Texts split into 500-token chunks with 50-token overlap
  • Embedding: ChromaDB automatically creates embeddings
  • Retrieval: Cosine similarity search finds most similar documents
  • Context Building: Results formatted and inserted into LLM prompt
  • Generation: Local Ollama LLM generates response based on context Important: The system clearly separates:
    • RAG Store (ChromaDB) for semantic search
    • Memory Store (JSON) for structured OSINT data Web search results are NOT automatically indexed - this must be done manually.