Spaces:
Sleeping
Sleeping
## How to run some of the code in this repository | |
### 1. Make sure Docker is installed on your machine | |
### 2. Clone the repository | |
### 3. CD into the repository | |
### 4. Run the following command to build the docker image | |
```bash | |
docker docker compose build -t oc-prototype . | |
``` | |
### 5. Run the following command to run the docker image | |
```bash | |
docker compose up -d oc-prototype | |
docker exec -it oc-prototype /bin/bash | |
``` | |
## Prototype TODO's | |
## Data | |
- [X] Process all misinfo claims and generate embeddings for a library namespace | |
- [X] Upsert claims into pinecone | |
- [X] Upsert 300k into namespace | |
- [ ] Update claim format to be similar to: https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data/data | |
## Functions | |
- [X] Upsert vector | |
- [X] Batch upsert | |
- [X] Query against metadata | |
- [ ] Generate working Dockerfile for project reproducibility | |
- [ ] Load data into a database | |
- [ ] Test precision/recall of embeddings | |
- [ ] Generate working version of climate demo | |
Embedding pricing: | |
1 token = approximately 0.75 words or 1k tokens = 750 words, you pay per 1000 tokens $0.0001 | |
Using that it can be shown that you get about 4 characters per token or 4Kb of embedding text per 1k tokens or $0.0001 | |
Using that as your basis you can approximate the cost of your embedding by : | |
Cost in $ = Size of Data in Kilobytes * 0.000025 | |
$0.100 / 1M tokens | |
Credentials for running google cloud queries: see ostreacultura-credentials.json | |