Enhance project structure and documentation
Browse files- Update Dockerfile to include additional application files.
- Revise README.md for improved clarity and detail on application usage and technology stack.
- Expand chainlit.md with usage instructions and examples.
- Implement main.py as the command-line entry point for running the application and updating the vector database.
- Create .env.example for environment variable configuration.
- Add comprehensive bug report and feature request templates.
- Establish Python CI workflow for automated testing and linting.
- Develop CONTRIBUTING.md to guide new contributors.
- Include LICENSE and SECURITY.md for legal and security guidelines.
- .env.example +25 -0
- .github/ISSUE_TEMPLATE/bug_report.md +37 -0
- .github/ISSUE_TEMPLATE/feature_request.md +25 -0
- .github/workflows/python-ci.yml +45 -0
- CONTRIBUTING.md +130 -0
- Dockerfile +3 -1
- LICENSE +21 -0
- README.md +111 -3
- SECURITY.md +35 -0
- chainlit.md +31 -3
- main.py +54 -3
.env.example
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# TheDataGuy Chat Configuration
|
2 |
+
# Copy this file to .env and fill in your values
|
3 |
+
|
4 |
+
# OpenAI API Key - Required for LLM and embeddings
|
5 |
+
OPENAI_API_KEY=your_openai_api_key_here
|
6 |
+
|
7 |
+
# Vector Store Configuration
|
8 |
+
VECTOR_STORAGE_PATH=./db/vector_store_tdg
|
9 |
+
QDRANT_COLLECTION=thedataguy_documents
|
10 |
+
|
11 |
+
# Model Configuration
|
12 |
+
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
|
13 |
+
LLM_MODEL=gpt-4o-mini
|
14 |
+
LLM_TEMPERATURE=0
|
15 |
+
|
16 |
+
# For evaluation and synthetic data generation (optional)
|
17 |
+
SDG_LLM_MODEL=gpt-4.1
|
18 |
+
EVAL_LLM_MODEL=gpt-4.1
|
19 |
+
|
20 |
+
# Blog Configuration
|
21 |
+
DATA_DIR=data/
|
22 |
+
BLOG_BASE_URL=https://thedataguy.pro/blog/
|
23 |
+
|
24 |
+
# Search Configuration
|
25 |
+
MAX_SEARCH_RESULTS=5
|
.github/ISSUE_TEMPLATE/bug_report.md
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
name: Bug Report
|
3 |
+
about: Create a report to help us improve
|
4 |
+
title: '[BUG] '
|
5 |
+
labels: bug
|
6 |
+
assignees: ''
|
7 |
+
---
|
8 |
+
|
9 |
+
## Bug Description
|
10 |
+
A clear and concise description of what the bug is.
|
11 |
+
|
12 |
+
## Steps to Reproduce
|
13 |
+
1. Go to '...'
|
14 |
+
2. Click on '....'
|
15 |
+
3. Scroll down to '....'
|
16 |
+
4. See error
|
17 |
+
|
18 |
+
## Expected Behavior
|
19 |
+
A clear and concise description of what you expected to happen.
|
20 |
+
|
21 |
+
## Actual Behavior
|
22 |
+
What actually happened instead.
|
23 |
+
|
24 |
+
## Screenshots
|
25 |
+
If applicable, add screenshots to help explain your problem.
|
26 |
+
|
27 |
+
## Environment
|
28 |
+
- OS: [e.g. Windows, macOS, Linux]
|
29 |
+
- Browser: [e.g. Chrome, Safari, Firefox]
|
30 |
+
- Version: [e.g. 1.0.0]
|
31 |
+
- Python Version: [e.g. 3.13.0]
|
32 |
+
|
33 |
+
## Additional Context
|
34 |
+
Add any other context about the problem here, such as:
|
35 |
+
- Error messages or logs
|
36 |
+
- Relevant configuration details
|
37 |
+
- Any recent changes that might have caused the issue
|
.github/ISSUE_TEMPLATE/feature_request.md
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
name: Feature Request
|
3 |
+
about: Suggest an idea for this project
|
4 |
+
title: '[FEATURE] '
|
5 |
+
labels: enhancement
|
6 |
+
assignees: ''
|
7 |
+
---
|
8 |
+
|
9 |
+
## Feature Description
|
10 |
+
A clear and concise description of the feature you'd like to see implemented.
|
11 |
+
|
12 |
+
## Use Case
|
13 |
+
Describe the context and use case for this feature. How would it benefit the project and its users?
|
14 |
+
|
15 |
+
## Proposed Solution
|
16 |
+
If you have ideas about how to implement this feature, describe them here.
|
17 |
+
|
18 |
+
## Alternatives Considered
|
19 |
+
Have you considered any alternative solutions or features? If so, please describe them.
|
20 |
+
|
21 |
+
## Additional Context
|
22 |
+
Add any other context, screenshots, or mockups about the feature request here.
|
23 |
+
|
24 |
+
## Impact
|
25 |
+
How would this feature impact the current functionality? Would it require any changes to existing features?
|
.github/workflows/python-ci.yml
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
name: Python CI
|
2 |
+
|
3 |
+
on:
|
4 |
+
push:
|
5 |
+
branches: [ main ]
|
6 |
+
pull_request:
|
7 |
+
branches: [ main ]
|
8 |
+
|
9 |
+
jobs:
|
10 |
+
test:
|
11 |
+
runs-on: ubuntu-latest
|
12 |
+
strategy:
|
13 |
+
matrix:
|
14 |
+
python-version: ['3.13']
|
15 |
+
|
16 |
+
steps:
|
17 |
+
- uses: actions/checkout@v3
|
18 |
+
|
19 |
+
- name: Set up Python ${{ matrix.python-version }}
|
20 |
+
uses: actions/setup-python@v4
|
21 |
+
with:
|
22 |
+
python-version: ${{ matrix.python-version }}
|
23 |
+
|
24 |
+
- name: Install dependencies
|
25 |
+
run: |
|
26 |
+
python -m pip install --upgrade pip
|
27 |
+
pip install uv
|
28 |
+
uv init
|
29 |
+
uv sync
|
30 |
+
|
31 |
+
- name: Lint with flake8
|
32 |
+
run: |
|
33 |
+
uv pip install flake8
|
34 |
+
# stop the build if there are Python syntax errors or undefined names
|
35 |
+
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
|
36 |
+
# exit-zero treats all errors as warnings
|
37 |
+
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
|
38 |
+
|
39 |
+
- name: Check if vector store can be built
|
40 |
+
run: |
|
41 |
+
python py-src/pipeline.py --ci --output-dir ./artifacts
|
42 |
+
env:
|
43 |
+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
44 |
+
VECTOR_STORAGE_PATH: ./db/vector_store_ci
|
45 |
+
EMBEDDING_MODEL: Snowflake/snowflake-arctic-embed-l
|
CONTRIBUTING.md
ADDED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Contributing to TheDataGuy Chat
|
2 |
+
|
3 |
+
Thank you for your interest in contributing to the TheDataGuy Chat project! This document provides guidelines and instructions for contributing to this repository.
|
4 |
+
|
5 |
+
## Project Overview
|
6 |
+
|
7 |
+
TheDataGuy Chat is a Q&A chatbot powered by the content from [TheDataGuy blog](https://thedataguy.pro/blog/). It uses RAG (Retrieval Augmented Generation) to provide informative answers about topics such as RAGAS, RAG evaluation, building research agents, metric-driven development, and data science best practices.
|
8 |
+
|
9 |
+
## Development Environment Setup
|
10 |
+
|
11 |
+
### Prerequisites
|
12 |
+
|
13 |
+
- Python 3.13 or higher
|
14 |
+
- [uv](https://github.com/astral-sh/uv) for Python package management
|
15 |
+
- Docker (optional, for containerized development)
|
16 |
+
- OpenAI API key
|
17 |
+
|
18 |
+
### Local Setup
|
19 |
+
|
20 |
+
1. Clone the repository:
|
21 |
+
```bash
|
22 |
+
git clone https://github.com/mafzaal/lets-talk.git
|
23 |
+
cd lets-talk
|
24 |
+
```
|
25 |
+
|
26 |
+
2. Create a `.env` file with the necessary environment variables:
|
27 |
+
```
|
28 |
+
OPENAI_API_KEY=your_openai_api_key
|
29 |
+
VECTOR_STORAGE_PATH=./db/vector_store_tdg
|
30 |
+
LLM_MODEL=gpt-4o-mini
|
31 |
+
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
|
32 |
+
```
|
33 |
+
|
34 |
+
3. Install dependencies:
|
35 |
+
```bash
|
36 |
+
uv init && uv sync
|
37 |
+
```
|
38 |
+
|
39 |
+
4. Build the vector store:
|
40 |
+
```bash
|
41 |
+
./scripts/build-vector-store.sh
|
42 |
+
```
|
43 |
+
|
44 |
+
5. Run the application:
|
45 |
+
```bash
|
46 |
+
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
|
47 |
+
```
|
48 |
+
|
49 |
+
### Using Docker
|
50 |
+
|
51 |
+
1. Build the Docker image:
|
52 |
+
```bash
|
53 |
+
docker build -t lets-talk .
|
54 |
+
```
|
55 |
+
|
56 |
+
2. Run the container:
|
57 |
+
```bash
|
58 |
+
docker run -p 7860:7860 --env-file ./.env lets-talk
|
59 |
+
```
|
60 |
+
|
61 |
+
## Project Structure
|
62 |
+
|
63 |
+
```
|
64 |
+
lets-talk/
|
65 |
+
├── data/ # Raw blog post content
|
66 |
+
├── py-src/ # Python source code
|
67 |
+
│ ├── lets_talk/ # Core application modules
|
68 |
+
│ │ ├── agent.py # Agent implementation
|
69 |
+
│ │ ├── config.py # Configuration settings
|
70 |
+
│ │ ├── models.py # Data models
|
71 |
+
│ │ ├── prompts.py # LLM prompt templates
|
72 |
+
│ │ ├── rag.py # RAG implementation
|
73 |
+
│ │ ├── rss_tool.py # RSS feed integration
|
74 |
+
│ │ ├── tools.py # Tool implementations
|
75 |
+
│ │ └── utils/ # Utility functions
|
76 |
+
│ ├── app.py # Main application entry point
|
77 |
+
│ ├── pipeline.py # Data processing pipeline
|
78 |
+
│ └── notebooks/ # Jupyter notebooks for analysis
|
79 |
+
├── db/ # Vector database storage
|
80 |
+
├── evals/ # Evaluation datasets and results
|
81 |
+
└── scripts/ # Utility scripts
|
82 |
+
```
|
83 |
+
|
84 |
+
## Adding New Blog Posts
|
85 |
+
|
86 |
+
When new blog posts are published on TheDataGuy.pro, follow these steps to add them to the chat application:
|
87 |
+
|
88 |
+
1. Add the markdown content to the `data/` directory in a new folder named after the post slug
|
89 |
+
2. Run the vector store update script:
|
90 |
+
```bash
|
91 |
+
python py-src/pipeline.py --force-recreate
|
92 |
+
```
|
93 |
+
|
94 |
+
## Workflow
|
95 |
+
|
96 |
+
1. **Fork** the repository on GitHub
|
97 |
+
2. **Clone** your fork to your local machine
|
98 |
+
3. Create a new **branch** for your feature or bug fix
|
99 |
+
4. Make your changes
|
100 |
+
5. Run the tests to ensure everything works
|
101 |
+
6. **Commit** your changes with clear, descriptive commit messages
|
102 |
+
7. **Push** your branch to your fork on GitHub
|
103 |
+
8. Submit a **Pull Request** to the main repository
|
104 |
+
|
105 |
+
## Code Style
|
106 |
+
|
107 |
+
- Follow PEP 8 style guidelines for Python code
|
108 |
+
- Use meaningful variable and function names
|
109 |
+
- Add docstrings to all functions and classes
|
110 |
+
- Include type hints where appropriate
|
111 |
+
|
112 |
+
## Testing
|
113 |
+
|
114 |
+
- Write tests for new features and bug fixes
|
115 |
+
- Ensure all tests pass before submitting a Pull Request
|
116 |
+
- Use the Ragas evaluation framework to test RAG performance
|
117 |
+
|
118 |
+
## Documentation
|
119 |
+
|
120 |
+
- Update relevant documentation when making changes
|
121 |
+
- Add docstrings to all functions, classes, and modules
|
122 |
+
- Keep the README and other documentation up to date
|
123 |
+
|
124 |
+
## License
|
125 |
+
|
126 |
+
By contributing to this project, you agree that your contributions will be licensed under the same license as the project (MIT License).
|
127 |
+
|
128 |
+
## Contact
|
129 |
+
|
130 |
+
If you have any questions or need further clarification, please reach out to the project maintainer at [contact form](https://thedataguy.pro/contact/).
|
Dockerfile
CHANGED
@@ -26,11 +26,13 @@ RUN uv sync
|
|
26 |
|
27 |
# Copy the app to the container
|
28 |
COPY --chown=user ./py-src/ $HOME/app
|
29 |
-
|
|
|
30 |
|
31 |
#TODO: Fix this to download
|
32 |
#copy posts to container
|
33 |
COPY --chown=user ./data/ $HOME/app/data
|
|
|
34 |
# Expose the port
|
35 |
EXPOSE 7860
|
36 |
|
|
|
26 |
|
27 |
# Copy the app to the container
|
28 |
COPY --chown=user ./py-src/ $HOME/app
|
29 |
+
COPY --chown=user ./.chainlit/ $HOME/app
|
30 |
+
COPY --chown=user ./chainlit.md $HOME/app
|
31 |
|
32 |
#TODO: Fix this to download
|
33 |
#copy posts to container
|
34 |
COPY --chown=user ./data/ $HOME/app/data
|
35 |
+
|
36 |
# Expose the port
|
37 |
EXPOSE 7860
|
38 |
|
LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2025 Muhammad Afzaal
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
CHANGED
@@ -9,7 +9,7 @@ pinned: false
|
|
9 |
|
10 |
# Welcome to TheDataGuy Chat! 👋
|
11 |
|
12 |
-
This is a Q&A chatbot powered by TheDataGuy blog posts. Ask questions about topics covered in the blog, such as:
|
13 |
|
14 |
- RAGAS and RAG evaluation
|
15 |
- Building research agents
|
@@ -21,15 +21,80 @@ This is a Q&A chatbot powered by TheDataGuy blog posts. Ask questions about topi
|
|
21 |
Under the hood, this application uses:
|
22 |
|
23 |
1. **Snowflake Arctic Embeddings**: To convert text into vector representations
|
|
|
|
|
|
|
24 |
2. **Qdrant Vector Database**: To store and search for similar content
|
|
|
|
|
|
|
25 |
3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
|
|
|
|
|
|
|
26 |
4. **LangChain**: For building the RAG workflow
|
|
|
|
|
|
|
|
|
27 |
5. **Chainlit**: For the chat interface
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
-
|
|
|
|
|
|
|
|
|
|
|
32 |
|
|
|
|
|
|
|
33 |
|
34 |
```bash
|
35 |
docker build -t lets-talk .
|
@@ -38,4 +103,47 @@ docker run -p 7860:7860 \
|
|
38 |
lets-talk
|
39 |
```
|
40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
9 |
|
10 |
# Welcome to TheDataGuy Chat! 👋
|
11 |
|
12 |
+
This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as:
|
13 |
|
14 |
- RAGAS and RAG evaluation
|
15 |
- Building research agents
|
|
|
21 |
Under the hood, this application uses:
|
22 |
|
23 |
1. **Snowflake Arctic Embeddings**: To convert text into vector representations
|
24 |
+
- Base model: `Snowflake/snowflake-arctic-embed-l`
|
25 |
+
- Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs)
|
26 |
+
|
27 |
2. **Qdrant Vector Database**: To store and search for similar content
|
28 |
+
- Efficiently indexes blog post content for fast semantic search
|
29 |
+
- Supports real-time updates when new blog posts are published
|
30 |
+
|
31 |
3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
|
32 |
+
- Primary model: OpenAI `gpt-4o-mini` for production inference
|
33 |
+
- Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation
|
34 |
+
|
35 |
4. **LangChain**: For building the RAG workflow
|
36 |
+
- Orchestrates the retrieval and generation components
|
37 |
+
- Provides flexible components for LLM application development
|
38 |
+
- Structured for easy maintenance and future enhancements
|
39 |
+
|
40 |
5. **Chainlit**: For the chat interface
|
41 |
+
- Offers an interactive UI with message threading
|
42 |
+
- Supports file uploads and custom components
|
43 |
+
|
44 |
+
## Technology Stack
|
45 |
+
|
46 |
+
### Core Components
|
47 |
+
- **Vector Database**: Qdrant (stores embeddings via `pipeline.py`)
|
48 |
+
- **Embedding Model**: Snowflake Arctic Embeddings
|
49 |
+
- **LLM**: OpenAI GPT-4o-mini
|
50 |
+
- **Framework**: LangChain + Chainlit
|
51 |
+
- **Development Language**: Python 3.13
|
52 |
+
|
53 |
+
### Advanced Features
|
54 |
+
- **Evaluation**: Ragas metrics for evaluating RAG performance:
|
55 |
+
- Faithfulness
|
56 |
+
- Context Relevancy
|
57 |
+
- Answer Relevancy
|
58 |
+
- Topic Adherence
|
59 |
+
- **Synthetic Data Generation**: For training and testing
|
60 |
+
- **Vector Store Updates**: Automated pipeline to update when new blog content is published
|
61 |
+
- **Fine-tuned Embeddings**: Custom embeddings tuned for technical content
|
62 |
+
|
63 |
+
## Project Structure
|
64 |
|
65 |
+
```
|
66 |
+
lets-talk/
|
67 |
+
├── data/ # Raw blog post content
|
68 |
+
├── py-src/ # Python source code
|
69 |
+
│ ├── lets_talk/ # Core application modules
|
70 |
+
│ │ ├── agent.py # Agent implementation
|
71 |
+
│ │ ├── config.py # Configuration settings
|
72 |
+
│ │ ├── models.py # Data models
|
73 |
+
│ │ ├── prompts.py # LLM prompt templates
|
74 |
+
│ │ ├── rag.py # RAG implementation
|
75 |
+
│ │ ├── rss_tool.py # RSS feed integration
|
76 |
+
│ │ └── tools.py # Tool implementations
|
77 |
+
│ ├── app.py # Main application entry point
|
78 |
+
│ └── pipeline.py # Data processing pipeline
|
79 |
+
├── db/ # Vector database storage
|
80 |
+
├── evals/ # Evaluation datasets and results
|
81 |
+
└── notebooks/ # Jupyter notebooks for analysis
|
82 |
+
```
|
83 |
+
|
84 |
+
## Environment Setup
|
85 |
+
|
86 |
+
The application requires the following environment variables:
|
87 |
|
88 |
+
```
|
89 |
+
OPENAI_API_KEY=your_openai_api_key
|
90 |
+
VECTOR_STORAGE_PATH=./db/vector_store_tdg
|
91 |
+
LLM_MODEL=gpt-4o-mini
|
92 |
+
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
|
93 |
+
```
|
94 |
|
95 |
+
## Running Locally
|
96 |
+
|
97 |
+
### Using Docker
|
98 |
|
99 |
```bash
|
100 |
docker build -t lets-talk .
|
|
|
103 |
lets-talk
|
104 |
```
|
105 |
|
106 |
+
### Using Python
|
107 |
+
|
108 |
+
```bash
|
109 |
+
# Install dependencies
|
110 |
+
uv init && uv sync
|
111 |
+
|
112 |
+
# Run the application
|
113 |
+
chainlit run py-src/app.py --host 0.0.0.0 --port 7860
|
114 |
+
```
|
115 |
+
|
116 |
+
## Deployment
|
117 |
+
|
118 |
+
The application is designed to be deployed on:
|
119 |
+
|
120 |
+
- **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk))
|
121 |
+
- **Production**: Azure Container Apps (planned)
|
122 |
+
|
123 |
+
## Evaluation
|
124 |
+
|
125 |
+
This project includes extensive evaluation capabilities using the Ragas framework:
|
126 |
+
|
127 |
+
- **Synthetic Data Generation**: For creating test datasets
|
128 |
+
- **Metric Evaluation**: Measuring faithfulness, relevance, and more
|
129 |
+
- **Fine-tuning Analysis**: Comparing different embedding models
|
130 |
+
|
131 |
+
## Future Enhancements
|
132 |
+
|
133 |
+
- **Agentic Reasoning**: Adding more sophisticated agent capabilities
|
134 |
+
- **Web UI Integration**: Custom Svelte component for the blog
|
135 |
+
- **CI/CD**: GitHub Actions workflow for automated deployment
|
136 |
+
- **Monitoring**: LangSmith integration for observability
|
137 |
+
|
138 |
+
## License
|
139 |
+
|
140 |
+
This project is available under the MIT License.
|
141 |
+
|
142 |
+
## Acknowledgements
|
143 |
+
|
144 |
+
- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content
|
145 |
+
- [Ragas](https://docs.ragas.io/) for evaluation framework
|
146 |
+
- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components
|
147 |
+
- [Chainlit](https://docs.chainlit.io/) for the chat interface
|
148 |
+
|
149 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
SECURITY.md
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Security Policy
|
2 |
+
|
3 |
+
## Supported Versions
|
4 |
+
|
5 |
+
Use this section to tell people about which versions of your project are currently being supported with security updates.
|
6 |
+
|
7 |
+
| Version | Supported |
|
8 |
+
| ------- | ------------------ |
|
9 |
+
| 0.1.x | :white_check_mark: |
|
10 |
+
|
11 |
+
## Reporting a Vulnerability
|
12 |
+
|
13 |
+
We take the security of TheDataGuy Chat seriously. If you believe you've found a security vulnerability, please follow these steps:
|
14 |
+
|
15 |
+
1. **Do not** disclose the vulnerability publicly
|
16 |
+
2. **Do not** create a public GitHub issue for the vulnerability
|
17 |
+
3. Email your findings to [contact form](https://thedataguy.pro/contact/)
|
18 |
+
|
19 |
+
Please include the following in your report:
|
20 |
+
|
21 |
+
- A description of the vulnerability
|
22 |
+
- Steps to reproduce the issue
|
23 |
+
- Potential impact of the vulnerability
|
24 |
+
- Any potential solutions you've identified
|
25 |
+
|
26 |
+
## What to Expect
|
27 |
+
|
28 |
+
When you report a vulnerability:
|
29 |
+
|
30 |
+
- You'll receive acknowledgment of your report within 48 hours
|
31 |
+
- We'll investigate and provide an estimated timeline for a fix
|
32 |
+
- We'll keep you updated as we work on resolving the issue
|
33 |
+
- Once fixed, we'll publicly acknowledge your responsible disclosure (unless you prefer to remain anonymous)
|
34 |
+
|
35 |
+
Thank you for helping to keep TheDataGuy Chat and its users safe!
|
chainlit.md
CHANGED
@@ -1,6 +1,34 @@
|
|
1 |
-
#
|
2 |
|
3 |
-
|
4 |
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
|
|
1 |
+
# Welcome to TheDataGuy Chat! 👋
|
2 |
|
3 |
+
## About
|
4 |
|
5 |
+
This chat application allows you to ask questions about topics covered in [TheDataGuy](https://thedataguy.pro)'s blog, including:
|
6 |
+
|
7 |
+
- **RAGAS**: Evaluation frameworks for LLM applications
|
8 |
+
- **Research Agents**: Building and evaluating AI agents
|
9 |
+
- **Metric-Driven Development**: Data-centric approaches to development
|
10 |
+
- **RAG Systems**: Retrieval Augmented Generation techniques
|
11 |
+
- **Data Science Best Practices**: Strategies for effective data work
|
12 |
+
|
13 |
+
## How To Use
|
14 |
+
|
15 |
+
1. **Ask a question** related to any topic covered in the blog
|
16 |
+
2. The system will **search for relevant content** from the blog posts
|
17 |
+
3. You'll receive an **informative response** with links to the original articles
|
18 |
+
|
19 |
+
## Examples
|
20 |
+
|
21 |
+
Try asking questions like:
|
22 |
+
- "What is RAGAS and how does it help evaluate LLM applications?"
|
23 |
+
- "How can I build a research agent with RSS feed support?"
|
24 |
+
- "What are the key principles of metric-driven development?"
|
25 |
+
- "How do I evaluate RAG systems effectively?"
|
26 |
+
|
27 |
+
## Under The Hood
|
28 |
+
|
29 |
+
This application uses Snowflake Arctic Embeddings, Qdrant Vector Database, LangChain, and GPT-4o-mini to provide accurate and helpful responses based on blog content.
|
30 |
+
|
31 |
+
For more details, check out the [GitHub repository](https://github.com/mafzaal/lets-talk).
|
32 |
+
|
33 |
+
Happy chatting! 💬
|
34 |
|
main.py
CHANGED
@@ -1,9 +1,60 @@
|
|
1 |
|
2 |
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
def main():
|
5 |
-
"""Main function to update blog data"""
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
if __name__ == "__main__":
|
9 |
-
main()
|
|
|
1 |
|
2 |
|
3 |
|
4 |
+
#!/usr/bin/env python3
|
5 |
+
"""
|
6 |
+
TheDataGuy Chat - Main Entry Point
|
7 |
+
|
8 |
+
This script serves as the main entry point for the TheDataGuy Chat application.
|
9 |
+
It provides a command-line interface to run the app and update the vector database.
|
10 |
+
"""
|
11 |
+
|
12 |
+
import os
|
13 |
+
import sys
|
14 |
+
import argparse
|
15 |
+
from dotenv import load_dotenv
|
16 |
+
|
17 |
+
# Load environment variables from .env file
|
18 |
+
load_dotenv()
|
19 |
+
|
20 |
def main():
|
21 |
+
"""Main function to run the application or update blog data"""
|
22 |
+
parser = argparse.ArgumentParser(description="TheDataGuy Chat - RAG-powered blog assistant")
|
23 |
+
|
24 |
+
# Define commands
|
25 |
+
subparsers = parser.add_subparsers(dest="command", help="Command to run")
|
26 |
+
|
27 |
+
# Run app command
|
28 |
+
run_parser = subparsers.add_parser("run", help="Run the chat application")
|
29 |
+
run_parser.add_argument("--host", default="0.0.0.0", help="Host to bind to")
|
30 |
+
run_parser.add_argument("--port", type=int, default=7860, help="Port to bind to")
|
31 |
+
|
32 |
+
# Update vector store command
|
33 |
+
update_parser = subparsers.add_parser("update", help="Update the vector database")
|
34 |
+
update_parser.add_argument("--force", action="store_true", help="Force recreation of the vector store")
|
35 |
+
|
36 |
+
# Parse arguments
|
37 |
+
args = parser.parse_args()
|
38 |
+
|
39 |
+
# Handle commands
|
40 |
+
if args.command == "run":
|
41 |
+
# Import here to avoid circular imports
|
42 |
+
import chainlit as cl
|
43 |
+
os.system(f"chainlit run py-src/app.py --host {args.host} --port {args.port}")
|
44 |
+
|
45 |
+
elif args.command == "update":
|
46 |
+
# Import here to avoid loading heavy dependencies if not needed
|
47 |
+
from py_src.pipeline import create_vector_database
|
48 |
+
force_flag = "--force-recreate" if args.force else ""
|
49 |
+
print(f"Updating vector database (force={args.force})")
|
50 |
+
create_vector_database(force_recreate=args.force)
|
51 |
+
|
52 |
+
else:
|
53 |
+
# Show help if no command provided
|
54 |
+
parser.print_help()
|
55 |
+
return 1
|
56 |
+
|
57 |
+
return 0
|
58 |
|
59 |
if __name__ == "__main__":
|
60 |
+
sys.exit(main())
|