File size: 9,917 Bytes
249a397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
# πŸ€– Chatbot Architecture Overview: Krishna's Personal AI Assistant (old and intial one)

This document outlines the technical architecture and modular design of Krishna Vamsi Dhulipalla’s personal AI chatbot system, implemented using **LangChain**, **OpenAI**, **NVIDIA NIMs**, and **Gradio**. The assistant is built for intelligent, retriever-augmented, memory-aware interaction tailored to Krishna’s background and user context.

---

## 🧱 Core Components

### 1. **LLMs Used and Their Roles**

| Purpose                             | Model Name                               | Role Description                                                 |
| ----------------------------------- | ---------------------------------------- | ---------------------------------------------------------------- |
| **Rephraser LLM**                   | `microsoft/phi-3-mini-4k-instruct`       | Rewrites vague/short queries into detailed, keyword-rich queries |
| **Relevance Classifier + Reranker** | `mistralai/mixtral-8x22b-instruct-v0.1`  | Classifies query relevance to KB and reranks retrieved chunks    |
| **Answer Generator**                | `nvidia/llama-3.1-nemotron-70b-instruct` | Provides rich, structured answers (replacing GPT-4o for testing) |
| **Fallback Humor Model**            | `mistralai/mixtral-8x22b-instruct-v0.1`  | Responds humorously and redirects when out-of-scope              |
| **KnowledgeBase Updater**           | `mistralai/mistral-7b-instruct-v0.3`     | Extracts and updates structured memory about the user            |

All models are integrated via **LangChain RunnableChains**, supporting both streaming and structured execution.

---

## πŸ” Retrieval Architecture

### βœ… **Hybrid Retrieval System**

The assistant combines:

- **BM25Retriever**: Lexical keyword match
- **FAISS Vector Search**: Dense embeddings from `sentence-transformers/all-MiniLM-L6-v2`

### 🧠 Rephrasing for Retrieval

- The **user's query** is expanded using the Rephraser LLM, with awareness of `last_followups` and memory
- **Rewritten query** is used throughout retrieval, validation, and reranking

### πŸ“Š Scoring & Ranking

- Each subquery is run through both BM25 and FAISS
- Results are merged via weighted formula:  
  `final_score = Ξ± * vector_score + (1 - Ξ±) * bm25_score`
- Deduplication via fingerprinting
- Top-k (default: 15) results are passed forward

---

## πŸ”Ž Validation + Chunk Reranking

### πŸ” Relevance Classification

- LLM2 evaluates:
  - Whether the query (or rewritten query) is **in-scope**
  - If so, returns a **reranked list of chunk indices**
- Memory (`last_input`, `last_output`, `last_followups`) and `rewritten_query` are included for better context

### ❌ If Out-of-Scope

- Chunks are discarded
- Response is generated using fallback LLM with humor and redirection

---

## 🧠 Memory + Personalization

### πŸ“˜ KnowledgeBase Model

Tracks structured user data:

- `user_name`, `company`, `last_input`, `last_output`
- `summary_history`, `recent_interests`, `last_followups`, `tone`

### πŸ”„ Memory Updates

- After every response, assistant extracts and updates memory
- Handled via `RExtract` pipeline using `PydanticOutputParser` and KB LLM

---

## 🧭 Orchestration Flow

```text
User Input
   ↓
Rephraser LLM (phi-3-mini)
   ↓
Hybrid Retrieval (BM25 + FAISS)
   ↓
Validation + Reranking (mixtral-8x22b)
   ↓
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ In-Scope     β”‚     β”‚ Out-of-Scope Query β”‚
 β”‚ (Top-k Chunks)β”‚     β”‚ (Memory-based only)β”‚
 β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
      ↓                                  ↓
 Answer LLM (nemotron-70b)       Fallback Humor LLM
```

---

## πŸ’¬ Frontend Interface (Gradio)

- Built using **Gradio ChatInterface + Blocks**
- Features:
  - Responsive design
  - Custom CSS
  - Streaming markdown responses
  - Preloaded examples and auto-scroll

---

## 🧩 Additional Design Highlights

- **Streaming**: Nemotron-70B used via LangChain streaming
- **Prompt Engineering**: Answer prompts use markdown formatting, section headers, bullet points, and personalized sign-offs
- **Memory-Aware Rewriting**: Handles vague replies like `"yes"` or `"A"` by mapping them to `last_followups`
- **Knowledge Chunk Enrichment**: Each FAISS chunk includes synthetic summary and 3 QA-style synthetic queries

---

## πŸš€ Future Enhancements

- Tool calling for tasks like calendar access or Google search
- Multi-model reranking agents
- Memory summarization agents for long dialogs
- Topic planners to group conversations
- Retrieval filtering based on user interest and session

---

This architecture is modular, extensible, and designed to simulate a memory-grounded, expert-aware personal assistant tailored to Krishna’s evolving knowledge and conversational goals.

# πŸ€– Chatbot Architecture Overview: Krishna's Personal AI Assistant (LangGraph Version) (New and current one)

This document details the updated architecture of **Krishna Vamsi Dhulipalla’s** personal AI assistant, now fully implemented with **LangGraph** for orchestrated state management and tool execution. The system is designed for **retrieval-augmented, memory-grounded, and multi-turn conversational intelligence**, integrating **OpenAI GPT-4o**, **Hugging Face embeddings**, and **cross-encoder reranking**.

---

## 🧱 Core Components

### 1. **Models & Their Roles**

| Purpose                    | Model Name                               | Role Description                                 |
| -------------------------- | ---------------------------------------- | ------------------------------------------------ |
| **Main Chat Model**        | `gpt-4o`                                 | Handles conversation, tool calls, and reasoning  |
| **Retriever Embeddings**   | `sentence-transformers/all-MiniLM-L6-v2` | Embedding generation for FAISS vector search     |
| **Cross-Encoder Reranker** | `cross-encoder/ms-marco-MiniLM-L-6-v2`   | Reranks retrieval results for semantic relevance |
| **BM25 Retriever**         | (LangChain BM25Retriever)                | Keyword-based search complementing vector search |

All models are bound to LangGraph **StateGraph** nodes for structured execution.

---

## πŸ” Retrieval System

### βœ… **Hybrid Retrieval**

- **FAISS Vector Search** with normalized embeddings
- **BM25Retriever** for lexical keyword matching
- Combined using **Reciprocal Rank Fusion (RRF)**

### πŸ“Š **Reranking & Diversity**

1. Initial retrieval with FAISS & BM25 (top-K per retriever)
2. Fusion via RRF scoring
3. **Cross-Encoder reranking** (top-N candidates)
4. **Maximal Marginal Relevance (MMR)** selection for diversity

### πŸ”Ž Retriever Tool (`@tool retriever`)

- Returns top passages with minimal duplication
- Used in-system prompt to fetch accurate facts about Krishna

---

## 🧠 Memory System

### Long-Term Memory

- **FAISS-based memory vector store** stored at `backend/data/memory_faiss`
- Stores conversation summaries per thread ID

### Memory Search Tool (`@tool memory_search`)

- Retrieves relevant conversation snippets by semantic similarity
- Supports **thread-scoped** search for contextual continuity

### Memory Write Node

- After each AI response, stores `[Q]: ... [A]: ...` summary
- Autosaves after every `MEM_AUTOSAVE_EVERY` turns or on thread end

---

## 🧭 Orchestration Flow (LangGraph)

```mermaid
graph TD
    A[START] --> B[agent node]
    B -->|tool call| C[tools node]
    B -->|no tool| D[memory_write]
    C --> B
    D --> E[END]
```

### **Nodes**:

- **agent**: Calls main LLM with conversation window + system prompt
- **tools**: Executes retriever or memory search tools
- **memory_write**: Persists summaries to long-term memory

### **Conditional Edges**:

- From **agent** β†’ `tools` if tool call detected
- From **agent** β†’ `memory_write` if no tool call

---

## πŸ’¬ System Prompt

The assistant:

- Uses retriever and memory search tools to gather facts about Krishna
- Avoids fabrication and requests clarification when needed
- Responds humorously when off-topic but steers back to Krishna’s expertise
- Formats with Markdown, headings, and bullet points

Embedded **Krishna’s Bio** provides static grounding context.

---

## 🌐 API & Streaming

- **Backend**: FastAPI (`backend/api.py`)
  - `/chat` SSE endpoint streams tokens in real-time
  - Passes `thread_id` & `is_final` to LangGraph for stateful conversations
- **Frontend**: React + Tailwind (custom chat UI)
  - Threaded conversation storage in browser `localStorage`
  - Real-time token rendering via `EventSource`
  - Features: new chat, clear chat, delete thread, suggestions

---

## πŸ–₯️ Frontend Highlights

- Dark theme ChatGPT-style UI
- Sidebar for thread management
- Live streaming responses with Markdown rendering
- Suggestion prompts for quick interactions
- Message actions: copy, edit, regenerate

---

## 🧩 Design Improvements Over Previous Version

- **LangGraph StateGraph** ensures explicit control of message flow
- **Thread-scoped memory** enables multi-session personalization
- **Hybrid RRF + Cross-Encoder + MMR** retrieval pipeline improves relevance & diversity
- **SSE streaming** for low-latency feedback
- Decoupled **retrieval** and **memory** as separate tools for modularity

---

## πŸš€ Future Enhancements

- Integrate **tool calling** for external APIs (calendar, search)
- Summarization agents for condensing memory store
- Interest-based retrieval filtering
- Multi-agent orchestration for complex tasks

---

This LangGraph-powered architecture delivers a **stateful, retrieval-augmented, memory-aware personal assistant** optimized for Krishna’s profile and designed for **extensibility, performance, and precision**.