Spaces:

mafzaal
/

lets_talk

Running

App Files Files Community

mafzaal commited on May 13

Commit

5b07cdb

1 Parent(s): 3379e0a

Remove Blog Data Utilities documentation to streamline project structure

Browse files

Files changed (1) hide show

BLOG_DATA_UTILS.md +0 -94

BLOG_DATA_UTILS.md DELETED Viewed

@@ -1,94 +0,0 @@
-# Blog Data Utilities
-This directory contains utilities for loading, processing, and maintaining blog post data for the RAG system.
-## Available Tools
-### `blog_utils.py`
-This Python module contains utility functions for:
-- Loading blog posts from the data directory
-- Processing and enriching metadata (adding URLs, titles, etc.)
-- Getting statistics about the documents
-- Creating and updating vector embeddings
-- Loading existing vector stores
-### `update_blog_data.py`
-This script allows you to:
-- Update the blog data when new posts are published
-- Process new blog posts
-- Update the vector store
-- Track changes over time
-### Legacy Notebooks (Reference Only)
-The following notebooks are kept for reference but the functionality has been moved to Python modules:
-- `utils_data_loading.ipynb` - Contains the original utility functions
-- `update_blog_data.ipynb` - Demonstrates the update workflow
-## How to Use
-### Updating Blog Data
-When new blog posts are published, follow these steps:
-1. Add the markdown files to the `data/` directory
-2. Run the update script:
-   ```bash
-   cd /home/mafzaal/source/lets-talk
-   uv run python update_blog_data.py
-   ```
-   You can also force recreation of the vector store:
-   ```bash
-   uv run python update_blog_data.py --force-recreate
-   ```
-   Or customize the chunking behavior:
-   ```bash
-   uv run python update_blog_data.py --chunk-size 1500 --chunk-overlap 300
-   ```
-   Or use whole documents without chunking:
-   ```bash
-   uv run python update_blog_data.py --no-chunking
-   ```
-This will:
-- Load all blog posts (including new ones)
-- Update the vector embeddings
-- Save statistics for tracking
-### Customizing the Process
-You can customize the process by editing the `.env` file:
-```
-DATA_DIR=data/                             # Directory containing blog posts
-VECTOR_STORAGE_PATH=./db/vectorstore_v3    # Path to vector store
-EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l  # Embedding model
-QDRANT_COLLECTION=thedataguy_documents     # Collection name
-BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog
-CHUNK_SIZE=1000                            # Size of each document chunk
-CHUNK_OVERLAP=200                          # Overlap between chunks
-```
-### In the Chainlit App
-The Chainlit app (`app.py`) has been updated to use these utility functions from the `blog_utils.py` module. It falls back to notebook import and direct initialization if there are any issues.
-## Adding Custom Processing
-To add custom processing for blog posts:
-1. Edit the `update_document_metadata` function in `blog_utils.py`
-2. Add any additional enrichment or processing steps
-3. Update the vector store using the `update_blog_data.py` script
-## Future Improvements
-- Add scheduled update process for automatically including new blog posts
-- Add tracking of embedding models and versions
-- Add webhook support to automatically update when new posts are published