# Blog Data Utilities This directory contains utilities for loading, processing, and maintaining blog post data for the RAG system. ## Available Tools ### `blog_utils.py` This Python module contains utility functions for: - Loading blog posts from the data directory - Processing and enriching metadata (adding URLs, titles, etc.) - Getting statistics about the documents - Creating and updating vector embeddings - Loading existing vector stores ### `update_blog_data.py` This script allows you to: - Update the blog data when new posts are published - Process new blog posts - Update the vector store - Track changes over time ### Legacy Notebooks (Reference Only) The following notebooks are kept for reference but the functionality has been moved to Python modules: - `utils_data_loading.ipynb` - Contains the original utility functions - `update_blog_data.ipynb` - Demonstrates the update workflow ## How to Use ### Updating Blog Data When new blog posts are published, follow these steps: 1. Add the markdown files to the `data/` directory 2. Run the update script: ```bash cd /home/mafzaal/source/lets-talk uv run python update_blog_data.py ``` You can also force recreation of the vector store: ```bash uv run python update_blog_data.py --force-recreate ``` This will: - Load all blog posts (including new ones) - Update the vector embeddings - Save statistics for tracking ### Customizing the Process You can customize the process by editing the `.env` file: ``` DATA_DIR=data/ # Directory containing blog posts VECTOR_STORAGE_PATH=./db/vectorstore_v3 # Path to vector store EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l # Embedding model QDRANT_COLLECTION=thedataguy_documents # Collection name BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog ``` ### In the Chainlit App The Chainlit app (`app.py`) has been updated to use these utility functions from the `blog_utils.py` module. It falls back to notebook import and direct initialization if there are any issues. ## Adding Custom Processing To add custom processing for blog posts: 1. Edit the `update_document_metadata` function in `blog_utils.py` 2. Add any additional enrichment or processing steps 3. Update the vector store using the `update_blog_data.py` script ## Future Improvements - Add scheduled update process for automatically including new blog posts - Add tracking of embedding models and versions - Add webhook support to automatically update when new posts are published