Spaces:

stefanjwojcik
/

misinfo_detection_app

Sleeping

misinfo_detection_app / ReadMe.md

first commit

9ff0a35 verified 6 months ago

1.46 kB

	## How to run some of the code in this repository

	### 1. Make sure Docker is installed on your machine
	### 2. Clone the repository
	### 3. CD into the repository
	### 4. Run the following command to build the docker image
	```bash
	docker docker compose build -t oc-prototype .
	```
	### 5. Run the following command to run the docker image
	```bash
	docker compose up -d oc-prototype
	docker exec -it oc-prototype /bin/bash
	```


	## Prototype TODO's

	## Data
	- [X] Process all misinfo claims and generate embeddings for a library namespace
	- [X] Upsert claims into pinecone
	- [X] Upsert 300k into namespace
	- [ ] Update claim format to be similar to: https://www.kaggle.com/datasets/shivkumarganesh/politifact-factcheck-data/data

	## Functions
	- [X] Upsert vector
	- [X] Batch upsert
	- [X] Query against metadata

	- [ ] Generate working Dockerfile for project reproducibility
	- [ ] Load data into a database
	- [ ] Test precision/recall of embeddings
	- [ ] Generate working version of climate demo


	Embedding pricing:

	1 token = approximately 0.75 words or 1k tokens = 750 words, you pay per 1000 tokens $0.0001
	Using that it can be shown that you get about 4 characters per token or 4Kb of embedding text per 1k tokens or $0.0001
	Using that as your basis you can approximate the cost of your embedding by :
	Cost in $ = Size of Data in Kilobytes * 0.000025

	$0.100 / 1M tokens

	Credentials for running google cloud queries: see ostreacultura-credentials.json