# πŸ€– Chatbot Architecture Overview: Krishna's Personal AI Assistant (old and intial one) This document outlines the technical architecture and modular design of Krishna Vamsi Dhulipalla’s personal AI chatbot system, implemented using **LangChain**, **OpenAI**, **NVIDIA NIMs**, and **Gradio**. The assistant is built for intelligent, retriever-augmented, memory-aware interaction tailored to Krishna’s background and user context. --- ## 🧱 Core Components ### 1. **LLMs Used and Their Roles** | Purpose | Model Name | Role Description | | ----------------------------------- | ---------------------------------------- | ---------------------------------------------------------------- | | **Rephraser LLM** | `microsoft/phi-3-mini-4k-instruct` | Rewrites vague/short queries into detailed, keyword-rich queries | | **Relevance Classifier + Reranker** | `mistralai/mixtral-8x22b-instruct-v0.1` | Classifies query relevance to KB and reranks retrieved chunks | | **Answer Generator** | `nvidia/llama-3.1-nemotron-70b-instruct` | Provides rich, structured answers (replacing GPT-4o for testing) | | **Fallback Humor Model** | `mistralai/mixtral-8x22b-instruct-v0.1` | Responds humorously and redirects when out-of-scope | | **KnowledgeBase Updater** | `mistralai/mistral-7b-instruct-v0.3` | Extracts and updates structured memory about the user | All models are integrated via **LangChain RunnableChains**, supporting both streaming and structured execution. --- ## πŸ” Retrieval Architecture ### βœ… **Hybrid Retrieval System** The assistant combines: - **BM25Retriever**: Lexical keyword match - **FAISS Vector Search**: Dense embeddings from `sentence-transformers/all-MiniLM-L6-v2` ### 🧠 Rephrasing for Retrieval - The **user's query** is expanded using the Rephraser LLM, with awareness of `last_followups` and memory - **Rewritten query** is used throughout retrieval, validation, and reranking ### πŸ“Š Scoring & Ranking - Each subquery is run through both BM25 and FAISS - Results are merged via weighted formula: `final_score = Ξ± * vector_score + (1 - Ξ±) * bm25_score` - Deduplication via fingerprinting - Top-k (default: 15) results are passed forward --- ## πŸ”Ž Validation + Chunk Reranking ### πŸ” Relevance Classification - LLM2 evaluates: - Whether the query (or rewritten query) is **in-scope** - If so, returns a **reranked list of chunk indices** - Memory (`last_input`, `last_output`, `last_followups`) and `rewritten_query` are included for better context ### ❌ If Out-of-Scope - Chunks are discarded - Response is generated using fallback LLM with humor and redirection --- ## 🧠 Memory + Personalization ### πŸ“˜ KnowledgeBase Model Tracks structured user data: - `user_name`, `company`, `last_input`, `last_output` - `summary_history`, `recent_interests`, `last_followups`, `tone` ### πŸ”„ Memory Updates - After every response, assistant extracts and updates memory - Handled via `RExtract` pipeline using `PydanticOutputParser` and KB LLM --- ## 🧭 Orchestration Flow ```text User Input ↓ Rephraser LLM (phi-3-mini) ↓ Hybrid Retrieval (BM25 + FAISS) ↓ Validation + Reranking (mixtral-8x22b) ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ In-Scope β”‚ β”‚ Out-of-Scope Query β”‚ β”‚ (Top-k Chunks)β”‚ β”‚ (Memory-based only)β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ ↓ ↓ Answer LLM (nemotron-70b) Fallback Humor LLM ``` --- ## πŸ’¬ Frontend Interface (Gradio) - Built using **Gradio ChatInterface + Blocks** - Features: - Responsive design - Custom CSS - Streaming markdown responses - Preloaded examples and auto-scroll --- ## 🧩 Additional Design Highlights - **Streaming**: Nemotron-70B used via LangChain streaming - **Prompt Engineering**: Answer prompts use markdown formatting, section headers, bullet points, and personalized sign-offs - **Memory-Aware Rewriting**: Handles vague replies like `"yes"` or `"A"` by mapping them to `last_followups` - **Knowledge Chunk Enrichment**: Each FAISS chunk includes synthetic summary and 3 QA-style synthetic queries --- ## πŸš€ Future Enhancements - Tool calling for tasks like calendar access or Google search - Multi-model reranking agents - Memory summarization agents for long dialogs - Topic planners to group conversations - Retrieval filtering based on user interest and session --- This architecture is modular, extensible, and designed to simulate a memory-grounded, expert-aware personal assistant tailored to Krishna’s evolving knowledge and conversational goals. # πŸ€– Chatbot Architecture Overview: Krishna's Personal AI Assistant (LangGraph Version) (New and current one) This document details the updated architecture of **Krishna Vamsi Dhulipalla’s** personal AI assistant, now fully implemented with **LangGraph** for orchestrated state management and tool execution. The system is designed for **retrieval-augmented, memory-grounded, and multi-turn conversational intelligence**, integrating **OpenAI GPT-4o**, **Hugging Face embeddings**, and **cross-encoder reranking**. --- ## 🧱 Core Components ### 1. **Models & Their Roles** | Purpose | Model Name | Role Description | | -------------------------- | ---------------------------------------- | ------------------------------------------------ | | **Main Chat Model** | `gpt-4o` | Handles conversation, tool calls, and reasoning | | **Retriever Embeddings** | `sentence-transformers/all-MiniLM-L6-v2` | Embedding generation for FAISS vector search | | **Cross-Encoder Reranker** | `cross-encoder/ms-marco-MiniLM-L-6-v2` | Reranks retrieval results for semantic relevance | | **BM25 Retriever** | (LangChain BM25Retriever) | Keyword-based search complementing vector search | All models are bound to LangGraph **StateGraph** nodes for structured execution. --- ## πŸ” Retrieval System ### βœ… **Hybrid Retrieval** - **FAISS Vector Search** with normalized embeddings - **BM25Retriever** for lexical keyword matching - Combined using **Reciprocal Rank Fusion (RRF)** ### πŸ“Š **Reranking & Diversity** 1. Initial retrieval with FAISS & BM25 (top-K per retriever) 2. Fusion via RRF scoring 3. **Cross-Encoder reranking** (top-N candidates) 4. **Maximal Marginal Relevance (MMR)** selection for diversity ### πŸ”Ž Retriever Tool (`@tool retriever`) - Returns top passages with minimal duplication - Used in-system prompt to fetch accurate facts about Krishna --- ## 🧠 Memory System ### Long-Term Memory - **FAISS-based memory vector store** stored at `backend/data/memory_faiss` - Stores conversation summaries per thread ID ### Memory Search Tool (`@tool memory_search`) - Retrieves relevant conversation snippets by semantic similarity - Supports **thread-scoped** search for contextual continuity ### Memory Write Node - After each AI response, stores `[Q]: ... [A]: ...` summary - Autosaves after every `MEM_AUTOSAVE_EVERY` turns or on thread end --- ## 🧭 Orchestration Flow (LangGraph) ```mermaid graph TD A[START] --> B[agent node] B -->|tool call| C[tools node] B -->|no tool| D[memory_write] C --> B D --> E[END] ``` ### **Nodes**: - **agent**: Calls main LLM with conversation window + system prompt - **tools**: Executes retriever or memory search tools - **memory_write**: Persists summaries to long-term memory ### **Conditional Edges**: - From **agent** β†’ `tools` if tool call detected - From **agent** β†’ `memory_write` if no tool call --- ## πŸ’¬ System Prompt The assistant: - Uses retriever and memory search tools to gather facts about Krishna - Avoids fabrication and requests clarification when needed - Responds humorously when off-topic but steers back to Krishna’s expertise - Formats with Markdown, headings, and bullet points Embedded **Krishna’s Bio** provides static grounding context. --- ## 🌐 API & Streaming - **Backend**: FastAPI (`backend/api.py`) - `/chat` SSE endpoint streams tokens in real-time - Passes `thread_id` & `is_final` to LangGraph for stateful conversations - **Frontend**: React + Tailwind (custom chat UI) - Threaded conversation storage in browser `localStorage` - Real-time token rendering via `EventSource` - Features: new chat, clear chat, delete thread, suggestions --- ## πŸ–₯️ Frontend Highlights - Dark theme ChatGPT-style UI - Sidebar for thread management - Live streaming responses with Markdown rendering - Suggestion prompts for quick interactions - Message actions: copy, edit, regenerate --- ## 🧩 Design Improvements Over Previous Version - **LangGraph StateGraph** ensures explicit control of message flow - **Thread-scoped memory** enables multi-session personalization - **Hybrid RRF + Cross-Encoder + MMR** retrieval pipeline improves relevance & diversity - **SSE streaming** for low-latency feedback - Decoupled **retrieval** and **memory** as separate tools for modularity --- ## πŸš€ Future Enhancements - Integrate **tool calling** for external APIs (calendar, search) - Summarization agents for condensing memory store - Interest-based retrieval filtering - Multi-agent orchestration for complex tasks --- This LangGraph-powered architecture delivers a **stateful, retrieval-augmented, memory-aware personal assistant** optimized for Krishna’s profile and designed for **extensibility, performance, and precision**.