Spaces:

mafzaal
/

lets_talk

Running

mafzaal commited on May 13

Commit

2754790

1 Parent(s): dba85e7

Enhance project structure and documentation

- Update Dockerfile to include additional application files.
- Revise README.md for improved clarity and detail on application usage and technology stack.
- Expand chainlit.md with usage instructions and examples.
- Implement main.py as the command-line entry point for running the application and updating the vector database.
- Create .env.example for environment variable configuration.
- Add comprehensive bug report and feature request templates.
- Establish Python CI workflow for automated testing and linting.
- Develop CONTRIBUTING.md to guide new contributors.
- Include LICENSE and SECURITY.md for legal and security guidelines.

Files changed (11) hide show

.env.example +25 -0
.github/ISSUE_TEMPLATE/bug_report.md +37 -0
.github/ISSUE_TEMPLATE/feature_request.md +25 -0
.github/workflows/python-ci.yml +45 -0
CONTRIBUTING.md +130 -0
Dockerfile +3 -1
LICENSE +21 -0
README.md +111 -3
SECURITY.md +35 -0
chainlit.md +31 -3
main.py +54 -3

.env.example ADDED Viewed

	@@ -0,0 +1,25 @@

+# TheDataGuy Chat Configuration
+# Copy this file to .env and fill in your values
+# OpenAI API Key - Required for LLM and embeddings
+OPENAI_API_KEY=your_openai_api_key_here
+# Vector Store Configuration
+VECTOR_STORAGE_PATH=./db/vector_store_tdg
+QDRANT_COLLECTION=thedataguy_documents
+# Model Configuration
+EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
+LLM_MODEL=gpt-4o-mini
+LLM_TEMPERATURE=0
+# For evaluation and synthetic data generation (optional)
+SDG_LLM_MODEL=gpt-4.1
+EVAL_LLM_MODEL=gpt-4.1
+# Blog Configuration
+DATA_DIR=data/
+BLOG_BASE_URL=https://thedataguy.pro/blog/
+# Search Configuration
+MAX_SEARCH_RESULTS=5

.github/ISSUE_TEMPLATE/bug_report.md ADDED Viewed

	@@ -0,0 +1,37 @@

+---
+name: Bug Report
+about: Create a report to help us improve
+title: '[BUG] '
+labels: bug
+assignees: ''
+---
+## Bug Description
+A clear and concise description of what the bug is.
+## Steps to Reproduce
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+## Expected Behavior
+A clear and concise description of what you expected to happen.
+## Actual Behavior
+What actually happened instead.
+## Screenshots
+If applicable, add screenshots to help explain your problem.
+## Environment
+- OS: [e.g. Windows, macOS, Linux]
+- Browser: [e.g. Chrome, Safari, Firefox]
+- Version: [e.g. 1.0.0]
+- Python Version: [e.g. 3.13.0]
+## Additional Context
+Add any other context about the problem here, such as:
+- Error messages or logs
+- Relevant configuration details
+- Any recent changes that might have caused the issue

.github/ISSUE_TEMPLATE/feature_request.md ADDED Viewed

	@@ -0,0 +1,25 @@

+---
+name: Feature Request
+about: Suggest an idea for this project
+title: '[FEATURE] '
+labels: enhancement
+assignees: ''
+---
+## Feature Description
+A clear and concise description of the feature you'd like to see implemented.
+## Use Case
+Describe the context and use case for this feature. How would it benefit the project and its users?
+## Proposed Solution
+If you have ideas about how to implement this feature, describe them here.
+## Alternatives Considered
+Have you considered any alternative solutions or features? If so, please describe them.
+## Additional Context
+Add any other context, screenshots, or mockups about the feature request here.
+## Impact
+How would this feature impact the current functionality? Would it require any changes to existing features?

.github/workflows/python-ci.yml ADDED Viewed

	@@ -0,0 +1,45 @@

+name: Python CI
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ['3.13']
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v4
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install uv
+        uv init
+        uv sync
+    - name: Lint with flake8
+      run: |
+        uv pip install flake8
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+        # exit-zero treats all errors as warnings
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+    - name: Check if vector store can be built
+      run: |
+        python py-src/pipeline.py --ci --output-dir ./artifacts
+      env:
+        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+        VECTOR_STORAGE_PATH: ./db/vector_store_ci
+        EMBEDDING_MODEL: Snowflake/snowflake-arctic-embed-l

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,130 @@

+# Contributing to TheDataGuy Chat
+Thank you for your interest in contributing to the TheDataGuy Chat project! This document provides guidelines and instructions for contributing to this repository.
+## Project Overview
+TheDataGuy Chat is a Q&A chatbot powered by the content from [TheDataGuy blog](https://thedataguy.pro/blog/). It uses RAG (Retrieval Augmented Generation) to provide informative answers about topics such as RAGAS, RAG evaluation, building research agents, metric-driven development, and data science best practices.
+## Development Environment Setup
+### Prerequisites
+- Python 3.13 or higher
+- [uv](https://github.com/astral-sh/uv) for Python package management
+- Docker (optional, for containerized development)
+- OpenAI API key
+### Local Setup
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/mafzaal/lets-talk.git
+   cd lets-talk
+   ```
+2. Create a `.env` file with the necessary environment variables:
+   ```
+   OPENAI_API_KEY=your_openai_api_key
+   VECTOR_STORAGE_PATH=./db/vector_store_tdg
+   LLM_MODEL=gpt-4o-mini
+   EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
+   ```
+3. Install dependencies:
+   ```bash
+   uv init && uv sync
+   ```
+4. Build the vector store:
+   ```bash
+   ./scripts/build-vector-store.sh
+   ```
+5. Run the application:
+   ```bash
+   chainlit run py-src/app.py --host 0.0.0.0 --port 7860
+   ```
+### Using Docker
+1. Build the Docker image:
+   ```bash
+   docker build -t lets-talk .
+   ```
+2. Run the container:
+   ```bash
+   docker run -p 7860:7860 --env-file ./.env lets-talk
+   ```
+## Project Structure
+```
+lets-talk/
+├── data/                  # Raw blog post content
+├── py-src/                # Python source code
+│   ├── lets_talk/         # Core application modules
+│   │   ├── agent.py       # Agent implementation
+│   │   ├── config.py      # Configuration settings
+│   │   ├── models.py      # Data models
+│   │   ├── prompts.py     # LLM prompt templates
+│   │   ├── rag.py         # RAG implementation
+│   │   ├── rss_tool.py    # RSS feed integration
+│   │   ├── tools.py       # Tool implementations
+│   │   └── utils/         # Utility functions
+│   ├── app.py             # Main application entry point
+│   ├── pipeline.py        # Data processing pipeline
+│   └── notebooks/         # Jupyter notebooks for analysis
+├── db/                    # Vector database storage
+├── evals/                 # Evaluation datasets and results
+└── scripts/               # Utility scripts
+```
+## Adding New Blog Posts
+When new blog posts are published on TheDataGuy.pro, follow these steps to add them to the chat application:
+1. Add the markdown content to the `data/` directory in a new folder named after the post slug
+2. Run the vector store update script:
+   ```bash
+   python py-src/pipeline.py --force-recreate
+   ```
+## Workflow
+1. **Fork** the repository on GitHub
+2. **Clone** your fork to your local machine
+3. Create a new **branch** for your feature or bug fix
+4. Make your changes
+5. Run the tests to ensure everything works
+6. **Commit** your changes with clear, descriptive commit messages
+7. **Push** your branch to your fork on GitHub
+8. Submit a **Pull Request** to the main repository
+## Code Style
+- Follow PEP 8 style guidelines for Python code
+- Use meaningful variable and function names
+- Add docstrings to all functions and classes
+- Include type hints where appropriate
+## Testing
+- Write tests for new features and bug fixes
+- Ensure all tests pass before submitting a Pull Request
+- Use the Ragas evaluation framework to test RAG performance
+## Documentation
+- Update relevant documentation when making changes
+- Add docstrings to all functions, classes, and modules
+- Keep the README and other documentation up to date
+## License
+By contributing to this project, you agree that your contributions will be licensed under the same license as the project (MIT License).
+## Contact
+If you have any questions or need further clarification, please reach out to the project maintainer at [contact form](https://thedataguy.pro/contact/).

Dockerfile CHANGED Viewed

@@ -26,11 +26,13 @@ RUN uv sync
 # Copy the app to the container
 COPY --chown=user ./py-src/ $HOME/app
 #TODO: Fix this to download
 #copy posts to container
 COPY --chown=user ./data/ $HOME/app/data
 # Expose the port
 EXPOSE 7860

 # Copy the app to the container
 COPY --chown=user ./py-src/ $HOME/app
+COPY --chown=user ./.chainlit/ $HOME/app
+COPY --chown=user ./chainlit.md $HOME/app
 #TODO: Fix this to download
 #copy posts to container
 COPY --chown=user ./data/ $HOME/app/data
 # Expose the port
 EXPOSE 7860

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Muhammad Afzaal
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ pinned: false
 # Welcome to TheDataGuy Chat! 👋
-This is a Q&A chatbot powered by TheDataGuy blog posts. Ask questions about topics covered in the blog, such as:
 - RAGAS and RAG evaluation
 - Building research agents
@@ -21,15 +21,80 @@ This is a Q&A chatbot powered by TheDataGuy blog posts. Ask questions about topi
 Under the hood, this application uses:
 1. **Snowflake Arctic Embeddings**: To convert text into vector representations
 2. **Qdrant Vector Database**: To store and search for similar content
 3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
 4. **LangChain**: For building the RAG workflow
 5. **Chainlit**: For the chat interface
-## Sources
-All answers are generated based on content from [TheDataGuy blog](https://thedataguy.pro/blog/). Sources are shown for each response so you can read more about the topic.
 ```bash
 docker build -t lets-talk .
@@ -38,4 +103,47 @@ docker run -p 7860:7860 \
     lets-talk
 ```
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 # Welcome to TheDataGuy Chat! 👋
+This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as:
 - RAGAS and RAG evaluation
 - Building research agents
 Under the hood, this application uses:
 1. **Snowflake Arctic Embeddings**: To convert text into vector representations
+   - Base model: `Snowflake/snowflake-arctic-embed-l`
+   - Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs)
 2. **Qdrant Vector Database**: To store and search for similar content
+   - Efficiently indexes blog post content for fast semantic search
+   - Supports real-time updates when new blog posts are published
 3. **GPT-4o-mini**: To generate helpful responses based on retrieved content
+   - Primary model: OpenAI `gpt-4o-mini` for production inference
+   - Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation
 4. **LangChain**: For building the RAG workflow
+   - Orchestrates the retrieval and generation components
+   - Provides flexible components for LLM application development
+   - Structured for easy maintenance and future enhancements
 5. **Chainlit**: For the chat interface
+   - Offers an interactive UI with message threading
+   - Supports file uploads and custom components
+## Technology Stack
+### Core Components
+- **Vector Database**: Qdrant (stores embeddings via `pipeline.py`)
+- **Embedding Model**: Snowflake Arctic Embeddings
+- **LLM**: OpenAI GPT-4o-mini
+- **Framework**: LangChain + Chainlit
+- **Development Language**: Python 3.13
+### Advanced Features
+- **Evaluation**: Ragas metrics for evaluating RAG performance:
+  - Faithfulness
+  - Context Relevancy
+  - Answer Relevancy
+  - Topic Adherence
+- **Synthetic Data Generation**: For training and testing
+- **Vector Store Updates**: Automated pipeline to update when new blog content is published
+- **Fine-tuned Embeddings**: Custom embeddings tuned for technical content
+## Project Structure
+```
+lets-talk/
+├── data/                  # Raw blog post content
+├── py-src/                # Python source code
+│   ├── lets_talk/         # Core application modules
+│   │   ├── agent.py       # Agent implementation
+│   │   ├── config.py      # Configuration settings
+│   │   ├── models.py      # Data models
+│   │   ├── prompts.py     # LLM prompt templates
+│   │   ├── rag.py         # RAG implementation
+│   │   ├── rss_tool.py    # RSS feed integration
+│   │   └── tools.py       # Tool implementations
+│   ├── app.py             # Main application entry point
+│   └── pipeline.py        # Data processing pipeline
+├── db/                    # Vector database storage
+├── evals/                 # Evaluation datasets and results
+└── notebooks/             # Jupyter notebooks for analysis
+```
+## Environment Setup
+The application requires the following environment variables:
+```
+OPENAI_API_KEY=your_openai_api_key
+VECTOR_STORAGE_PATH=./db/vector_store_tdg
+LLM_MODEL=gpt-4o-mini
+EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
+```
+## Running Locally
+### Using Docker
 ```bash
 docker build -t lets-talk .
     lets-talk
 ```
+### Using Python
+```bash
+# Install dependencies
+uv init && uv sync
+# Run the application
+chainlit run py-src/app.py --host 0.0.0.0 --port 7860
+```
+## Deployment
+The application is designed to be deployed on:
+- **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk))
+- **Production**: Azure Container Apps (planned)
+## Evaluation
+This project includes extensive evaluation capabilities using the Ragas framework:
+- **Synthetic Data Generation**: For creating test datasets
+- **Metric Evaluation**: Measuring faithfulness, relevance, and more
+- **Fine-tuning Analysis**: Comparing different embedding models
+## Future Enhancements
+- **Agentic Reasoning**: Adding more sophisticated agent capabilities
+- **Web UI Integration**: Custom Svelte component for the blog
+- **CI/CD**: GitHub Actions workflow for automated deployment
+- **Monitoring**: LangSmith integration for observability
+## License
+This project is available under the MIT License.
+## Acknowledgements
+- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content
+- [Ragas](https://docs.ragas.io/) for evaluation framework
+- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components
+- [Chainlit](https://docs.chainlit.io/) for the chat interface
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

SECURITY.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# Security Policy
+## Supported Versions
+Use this section to tell people about which versions of your project are currently being supported with security updates.
+| Version | Supported          |
+| ------- | ------------------ |
+| 0.1.x   | :white_check_mark: |
+## Reporting a Vulnerability
+We take the security of TheDataGuy Chat seriously. If you believe you've found a security vulnerability, please follow these steps:
+1. **Do not** disclose the vulnerability publicly
+2. **Do not** create a public GitHub issue for the vulnerability
+3. Email your findings to [contact form](https://thedataguy.pro/contact/)
+Please include the following in your report:
+- A description of the vulnerability
+- Steps to reproduce the issue
+- Potential impact of the vulnerability
+- Any potential solutions you've identified
+## What to Expect
+When you report a vulnerability:
+- You'll receive acknowledgment of your report within 48 hours
+- We'll investigate and provide an estimated timeline for a fix
+- We'll keep you updated as we work on resolving the issue
+- Once fixed, we'll publicly acknowledge your responsible disclosure (unless you prefer to remain anonymous)
+Thank you for helping to keep TheDataGuy Chat and its users safe!

chainlit.md CHANGED Viewed

@@ -1,6 +1,34 @@
-# Let's Talk
-`Let's Talk` is chat app based on contents from [TheDataGuy](https://thedataguy.pro)'s blog posts.
-More information at [Let's Talk](https://github.com/mafzaal/lets-talk)

+# Welcome to TheDataGuy Chat! 👋
+## About
+This chat application allows you to ask questions about topics covered in [TheDataGuy](https://thedataguy.pro)'s blog, including:
+- **RAGAS**: Evaluation frameworks for LLM applications
+- **Research Agents**: Building and evaluating AI agents
+- **Metric-Driven Development**: Data-centric approaches to development
+- **RAG Systems**: Retrieval Augmented Generation techniques
+- **Data Science Best Practices**: Strategies for effective data work
+## How To Use
+1. **Ask a question** related to any topic covered in the blog
+2. The system will **search for relevant content** from the blog posts
+3. You'll receive an **informative response** with links to the original articles
+## Examples
+Try asking questions like:
+- "What is RAGAS and how does it help evaluate LLM applications?"
+- "How can I build a research agent with RSS feed support?"
+- "What are the key principles of metric-driven development?"
+- "How do I evaluate RAG systems effectively?"
+## Under The Hood
+This application uses Snowflake Arctic Embeddings, Qdrant Vector Database, LangChain, and GPT-4o-mini to provide accurate and helpful responses based on blog content.
+For more details, check out the [GitHub repository](https://github.com/mafzaal/lets-talk).
+Happy chatting! 💬

main.py CHANGED Viewed

@@ -1,9 +1,60 @@
 def main():
-    """Main function to update blog data"""
-    print("=== Blog Data Update ===")
 if __name__ == "__main__":
-    main()

+#!/usr/bin/env python3
+"""
+TheDataGuy Chat - Main Entry Point
+This script serves as the main entry point for the TheDataGuy Chat application.
+It provides a command-line interface to run the app and update the vector database.
+"""
+import os
+import sys
+import argparse
+from dotenv import load_dotenv
+# Load environment variables from .env file
+load_dotenv()
 def main():
+    """Main function to run the application or update blog data"""
+    parser = argparse.ArgumentParser(description="TheDataGuy Chat - RAG-powered blog assistant")
+    # Define commands
+    subparsers = parser.add_subparsers(dest="command", help="Command to run")
+    # Run app command
+    run_parser = subparsers.add_parser("run", help="Run the chat application")
+    run_parser.add_argument("--host", default="0.0.0.0", help="Host to bind to")
+    run_parser.add_argument("--port", type=int, default=7860, help="Port to bind to")
+    # Update vector store command
+    update_parser = subparsers.add_parser("update", help="Update the vector database")
+    update_parser.add_argument("--force", action="store_true", help="Force recreation of the vector store")
+    # Parse arguments
+    args = parser.parse_args()
+    # Handle commands
+    if args.command == "run":
+        # Import here to avoid circular imports
+        import chainlit as cl
+        os.system(f"chainlit run py-src/app.py --host {args.host} --port {args.port}")
+    elif args.command == "update":
+        # Import here to avoid loading heavy dependencies if not needed
+        from py_src.pipeline import create_vector_database
+        force_flag = "--force-recreate" if args.force else ""
+        print(f"Updating vector database (force={args.force})")
+        create_vector_database(force_recreate=args.force)
+    else:
+        # Show help if no command provided
+        parser.print_help()
+        return 1
+    return 0
 if __name__ == "__main__":
+    sys.exit(main())