mafzaal commited on
Commit
5b07cdb
·
1 Parent(s): 3379e0a

Remove Blog Data Utilities documentation to streamline project structure

Browse files
Files changed (1) hide show
  1. BLOG_DATA_UTILS.md +0 -94
BLOG_DATA_UTILS.md DELETED
@@ -1,94 +0,0 @@
1
- # Blog Data Utilities
2
-
3
- This directory contains utilities for loading, processing, and maintaining blog post data for the RAG system.
4
-
5
- ## Available Tools
6
-
7
- ### `blog_utils.py`
8
-
9
- This Python module contains utility functions for:
10
- - Loading blog posts from the data directory
11
- - Processing and enriching metadata (adding URLs, titles, etc.)
12
- - Getting statistics about the documents
13
- - Creating and updating vector embeddings
14
- - Loading existing vector stores
15
-
16
- ### `update_blog_data.py`
17
-
18
- This script allows you to:
19
- - Update the blog data when new posts are published
20
- - Process new blog posts
21
- - Update the vector store
22
- - Track changes over time
23
-
24
- ### Legacy Notebooks (Reference Only)
25
-
26
- The following notebooks are kept for reference but the functionality has been moved to Python modules:
27
-
28
- - `utils_data_loading.ipynb` - Contains the original utility functions
29
- - `update_blog_data.ipynb` - Demonstrates the update workflow
30
-
31
- ## How to Use
32
-
33
- ### Updating Blog Data
34
-
35
- When new blog posts are published, follow these steps:
36
-
37
- 1. Add the markdown files to the `data/` directory
38
- 2. Run the update script:
39
- ```bash
40
- cd /home/mafzaal/source/lets-talk
41
- uv run python update_blog_data.py
42
- ```
43
-
44
- You can also force recreation of the vector store:
45
- ```bash
46
- uv run python update_blog_data.py --force-recreate
47
- ```
48
-
49
- Or customize the chunking behavior:
50
- ```bash
51
- uv run python update_blog_data.py --chunk-size 1500 --chunk-overlap 300
52
- ```
53
-
54
- Or use whole documents without chunking:
55
- ```bash
56
- uv run python update_blog_data.py --no-chunking
57
- ```
58
-
59
- This will:
60
- - Load all blog posts (including new ones)
61
- - Update the vector embeddings
62
- - Save statistics for tracking
63
-
64
- ### Customizing the Process
65
-
66
- You can customize the process by editing the `.env` file:
67
-
68
- ```
69
- DATA_DIR=data/ # Directory containing blog posts
70
- VECTOR_STORAGE_PATH=./db/vectorstore_v3 # Path to vector store
71
- EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l # Embedding model
72
- QDRANT_COLLECTION=thedataguy_documents # Collection name
73
- BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog
74
- CHUNK_SIZE=1000 # Size of each document chunk
75
- CHUNK_OVERLAP=200 # Overlap between chunks
76
- ```
77
-
78
- ### In the Chainlit App
79
-
80
- The Chainlit app (`app.py`) has been updated to use these utility functions from the `blog_utils.py` module. It falls back to notebook import and direct initialization if there are any issues.
81
-
82
- ## Adding Custom Processing
83
-
84
- To add custom processing for blog posts:
85
-
86
- 1. Edit the `update_document_metadata` function in `blog_utils.py`
87
- 2. Add any additional enrichment or processing steps
88
- 3. Update the vector store using the `update_blog_data.py` script
89
-
90
- ## Future Improvements
91
-
92
- - Add scheduled update process for automatically including new blog posts
93
- - Add tracking of embedding models and versions
94
- - Add webhook support to automatically update when new posts are published