burtenshaw commited on
Commit
d12ff68
·
1 Parent(s): adb4caa

simplify readme

Browse files
Files changed (1) hide show
  1. README.md +0 -29
README.md CHANGED
@@ -61,35 +61,6 @@ After deduplication completes:
61
 
62
  The dataset will be saved as `your-username/dataset-name` and be publicly available.
63
 
64
- ## Technical Details
65
-
66
- - **Embedding Model**: Uses `minishlab/potion-base-8M` (Model2Vec) for fast, efficient text embeddings
67
- - **Deduplication Algorithm**: SemHash for scalable semantic similarity detection
68
- - **Backend**: Runs on CPU (may be slow for large datasets on free tier)
69
-
70
- ## Local Usage
71
-
72
- For faster processing of large datasets, run locally:
73
-
74
- ```bash
75
- git clone <repository-url>
76
- cd semantic-deduplication
77
- pip install -r requirements.txt
78
- python app.py
79
- ```
80
-
81
- ## Examples
82
-
83
- ### Cross-dataset Deduplication
84
- Remove test set contamination:
85
- - **Dataset 1**: `your-org/training-data` (split: `train`)
86
- - **Dataset 2**: `your-org/test-data` (split: `test`)
87
- - **Result**: Clean test set with training examples removed
88
-
89
- ### Single Dataset Cleaning
90
- Remove duplicates from a dataset:
91
- - **Dataset 1**: `common_voice` (split: `train`)
92
- - **Result**: Training set with duplicate audio transcriptions removed
93
 
94
  ## Notes
95
 
 
61
 
62
  The dataset will be saved as `your-username/dataset-name` and be publicly available.
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ## Notes
66