The CrawlLama system implements RAG (Retrieval Augmented Generation) using ChromaDB as the vector store. RAG enables semantic searches over crawled web content and uses it as context for LLM responses.
| Component | File | Function |
|ββββ|ββ|βββ-|
| RAGManager | tools/rag.py | Main class for RAG operations |
| ChromaDB | Vector Store | Semantic document search |
| ContextManager | core/context_manager.py | Chunking & Prompt Building |
| ToolRegistry | tools/tool_registry.py | Integration into Agent Workflow |
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER QUERY β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AGENT (SearchAgent) β
β - Analyzes query complexity β
β - Selects tool: web_search, wiki_lookup, or rag_search β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG SEARCH TOOL β
β RAGManager.search(query, top_k=5) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CHROMADB QUERY β
β - Query embedding with "nomic-embed-text" β
β - Cosine Similarity Search β
β - Returns: Top-K most similar documents β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FORMAT RAG RESULTS β
β - Truncate text (max 300 characters) β
β - Add source & relevance score β
β Format: "[Source: url] (Relevance: 0.85)" β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTEXT BUILDING β
β ContextManager.build_prompt() β
β - Combines: system_prompt + rag_results + user_query β
β - Truncate to max_context_tokens (4000-16000) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM GENERATION (Ollama) β
β - Sends prompt to local LLM β
β - Optional: Hallucination Detection β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RESPONSE β
β Formatted response with source attribution β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
File: tools/rag.py:49-63
# ChromaDB Settings
Settings(
persist_directory="data/embeddings",
anonymized_telemetry=False
)
# Client Creation
self.client = chromadb.PersistentClient(
path="data/embeddings"
)
# Collection Setup
self.collection = self.client.get_or_create_collection(
name="web_documents",
metadata={"hnsw:space": "cosine"} # Cosine Similarity
)
nomic-embed-textdata/models/config.json:26
data/embeddings/.add() or .upsert()web_documents (configurable)
File: tools/rag.py:46-76, 94-96
If ChromaDB is unavailable:
except Exception as e:
logger.warning(f"RAG system initialization failed: {e}")
logger.warning("RAG functionality will be disabled")
self.client = None
self.collection = None
β System continues without RAG capability β
Purpose: Semantic search over web content
| Property | Value |
|βββ-|ββ-|
| Storage | data/embeddings/ |
| Persistence | Automatic via ChromaDB |
| Content | Crawled web pages, search results |
| Access | Semantic similarity search |
Purpose: OSINT data (emails, phones, IPs, usernames)
| Property | Value |
|βββ-|ββ-|
| File | core/memory_store.py |
| Storage | data/memory.json |
| Content | Structured intelligence data with metadata |
| Access | Direct JSON access |
IMPORTANT: Memory Store is NOT used for RAG!
β
File: core/context_manager.py:56-101
def split_into_chunks(text: str,
chunk_size: int = 500, # Tokens
overlap: int = 50) -> List[str]:
Algorithm:
chunk_size * 4 charactersoverlap * 4 characters between chunks., ?, !)Chunk 1: "This is the first sentence. And the second sentence."
β (Overlap 50 tokens)
Chunk 2: "And the second sentence. Here comes the third."
File: tools/rag.py:78-127
def add_documents(texts: List[str],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
use_batch: bool = True) -> None
Pipeline: ``` INPUT: texts, metadatas, ids β
**Important Note:**
- **No automatic indexing** of web search results!
- **Manual indexing required:**
```python
agent.add_to_knowledge_base(
texts=["Document 1", "Document 2"],
metadatas=[{"source": "url1"}, {"source": "url2"}]
)
File: tools/rag.py:103-104
# Generate MD5 hash as ID
doc_id = hashlib.md5(text.encode()).hexdigest()[:16]
β Same text = same ID β Automatic deduplication by ChromaDB β
File: tools/rag.py:167-226
def search(query: str,
top_k: int = 5,
filter_metadata: Optional[dict] = None,
min_relevance: float = 0.0) -> List[Dict]
Process:
nomic-embed-text.query() with cosine similarityrelevance = 1.0 - distancemin_relevance
Return Format:
{
"text": "Document content...",
"metadata": {"source": "https://example.com"},
"distance": 0.15, # Cosine distance
"relevance": 0.85, # 1.0 - distance
"id": "abc123def456"
}
File: tools/rag.py:227-279
def multi_query_search(queries: List[str],
top_k: int = 5,
deduplicate: bool = True) -> List[Dict]
Use Case: Query expansion / reformulation Process:
ThreadPoolExecutor (max 4 workers)queries = [
"AI in healthcare",
"medical AI systems",
"artificial intelligence diagnosis"
]
results = rag.multi_query_search(queries, top_k=10)
File: tools/rag.py:281-318
def hybrid_search(query: str,
top_k: int = 5,
semantic_weight: float = 0.7) -> List[Dict]
Combination: Semantic + keyword search Process:
semantic_weight (0.0 - 1.0)File: core/agent.py:138-143
# Agent receives query
query = "What is RAG in AI?"
# Agent decides tool usage
# Options: web_search, wiki_lookup, rag_search
File: tools/rag.py:167-226
results = rag_manager.search(
query="What is RAG in AI?",
top_k=5,
min_relevance=0.5
)
Internal ChromaDB Operations:
# 1. Query Embedding
query_embedding = embed_model.encode("What is RAG in AI?")
# 2. Similarity Search
db_results = collection.query(
query_embeddings=[query_embedding],
n_results=5
)
# 3. Relevance Calculation
for i, distance in enumerate(db_results['distances'][0]):
relevance = 1.0 - distance
File: tools/rag.py:380-413
formatted = format_rag_results(results, max_length=300)
Output: ```
#### Step 4: Context Building
**File:** `core/context_manager.py:119-155`
```python
prompt = build_prompt(
system_prompt="You are a helpful AI assistant.",
user_query="What is RAG in AI?",
context=formatted_rag_results,
max_context_tokens=4000
)
Final Prompt: ``` You are a helpful AI assistant. Context:
#### Step 5: Token Management
**File:** `core/context_manager.py:133-145`
```python
# Token Counting
system_tokens = count_tokens(system_prompt)
query_tokens = count_tokens(user_query)
context_tokens = count_tokens(context)
# Truncation if needed
if context_tokens > max_context_tokens:
context = truncate_text(context, max_context_tokens)
File: core/agent.py + core/ollama_client.py
response = ollama_client.generate(
prompt=final_prompt,
model="llama2", # or configured model
temperature=0.7
)
RAG (Retrieval Augmented Generation) combines information
retrieval with LLM generation. Relevant documents are retrieved
from a knowledge base and used as context for the LLM response.
This reduces hallucinations and enables up-to-date, fact-based answers.
Sources:
- https://example.com/rag (Relevance: 0.92)
- https://ai-docs.org/rag-guide (Relevance: 0.87)
File: config.json:24-30
{
"rag": {
"enabled": true,
"embedding_model": "nomic-embed-text",
"chunk_size": 500,
"chunk_overlap": 50,
"top_k": 10
}
}
{
"paths": {
"embeddings_dir": "data/embeddings"
}
}
{
"context_limits": {
"small": 4000,
"medium": 6000,
"large": 8000,
"xlarge": 12000,
"max_storage": 8000
}
}
Optimized for RTX 3080:
| Parameter | Value | Description | |ββββ|ββ-|ββββ-| | batch_size | 100 | Documents per batch | | max_workers | 4 | Threads for parallel search | | similarity_metric | cosine | Distance metric | | min_relevance | 0.0 | Minimum relevance score | β
| File | Lines | Description |
|ββ|ββ-|ββββ-|
| tools/rag.py | 1-413 | Complete RAG Implementation |
| | 49-63 | ChromaDB Initialization |
| | 78-127 | add_documents() - Document Indexing |
| | 167-226 | search() - Standard Search |
| | 227-279 | multi_query_search() - Parallel Multi-Query |
| | 281-318 | hybrid_search() - Hybrid Semantic+Keyword |
| | 380-413 | format_rag_results() - Result Formatting |
| File | Lines | Description |
|ββ|ββ-|ββββ-|
| tools/tool_registry.py | 8, 34-42 | RAG Import & Lazy Loading |
| | 79-89 | RAG Tool Definition |
| | 144-178 | add_documents_to_rag() API |
| core/agent.py | 138-143 | RAG Tool Initialization |
| | 1490-1501 | add_to_knowledge_base() Public API |
| File | Lines | Description |
|ββ|ββ-|ββββ-|
| core/context_manager.py | 56-101 | split_into_chunks() - Chunking |
| | 119-155 | build_prompt() - Prompt Construction |
| | 103-117 | truncate_text() - Smart Truncation |
| File | Lines | Description |
|ββ|ββ-|ββββ-|
| config.json | 24-30 | RAG Settings |
| core/unified_loader.py | 54-58 | Lazy-Loading Config |
β
# Via Agent API
agent = SearchAgent(config)
agent.add_to_knowledge_base(
texts=[
"RAG combines retrieval with generation.",
"ChromaDB is a vector database."
],
metadatas=[
{"source": "https://rag-tutorial.com"},
{"source": "https://chromadb.docs"}
]
)
# Via RAGManager
rag = RAGManager()
results = rag.search(
query="What is a vector database?",
top_k=5,
min_relevance=0.5
)
for result in results:
print(f"[{result['relevance']:.2f}] {result['text'][:100]}...")
# Query Expansion
queries = [
"Vector Database",
"Embedding storage systems",
"Semantic search databases"
]
results = rag.multi_query_search(
queries=queries,
top_k=10,
deduplicate=True
)
# Semantic + Keyword Combination
results = rag.hybrid_search(
query="ChromaDB features",
top_k=5,
semantic_weight=0.7 # 70% semantic, 30% keyword
)
The CrawlLama RAG system uses ChromaDB for semantic document search with the nomic-embed-text embedding model. The workflow is:
add_to_knowledge_base()