burtenshaw commited on
Commit
074bcd7
·
1 Parent(s): f6c9d95

add friendly readme

Browse files
Files changed (1) hide show
  1. README.md +86 -1
README.md CHANGED
@@ -9,6 +9,91 @@ app_file: app.py
9
  pinned: false
10
  license: mit
11
  short_description: Deduplicate HuggingFace datasets in seconds
 
 
 
 
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  license: mit
11
  short_description: Deduplicate HuggingFace datasets in seconds
12
+ hf_oauth: true
13
+ hf_oauth_scopes:
14
+ - write-repo
15
+ - manage-repo
16
  ---
17
 
18
+ # Semantic Text Deduplication Using SemHash
19
+
20
+ This Gradio application performs **semantic deduplication** on HuggingFace datasets using [SemHash](https://github.com/MinishLab/semhash) with [Model2Vec](https://github.com/MinishLab/model2vec) embeddings.
21
+
22
+ ## Features
23
+
24
+ - **Two deduplication modes**:
25
+ - **Single dataset**: Find and remove duplicates within one dataset
26
+ - **Cross-dataset**: Remove entries from Dataset 2 that are similar to entries in Dataset 1
27
+
28
+ - **Customizable similarity threshold**: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only)
29
+
30
+ - **Detailed results**: View statistics and examples of found duplicates with word-level differences highlighted
31
+
32
+ - **Hub Integration**: 🆕 **Push deduplicated datasets directly to the Hugging Face Hub** after logging in
33
+
34
+ ## How to Use
35
+
36
+ ### 1. Choose Deduplication Type
37
+ - **Cross-dataset**: Useful for removing training data contamination from test sets
38
+ - **Single dataset**: Clean up duplicate entries within a single dataset
39
+
40
+ ### 2. Configure Datasets
41
+ - Enter the HuggingFace dataset names (e.g., `SetFit/amazon_massive_scenario_en-US`)
42
+ - Specify the dataset splits (e.g., `train`, `test`, `validation`)
43
+ - Set the text column name (usually `text`, `sentence`, or `content`)
44
+
45
+ ### 3. Set Similarity Threshold
46
+ - **0.9** (default): Good balance between precision and recall
47
+ - **Higher values** (0.95-0.99): More conservative, only removes very similar texts
48
+ - **Lower values** (0.7-0.85): More aggressive, may remove semantically similar but different texts
49
+
50
+ ### 4. Run Deduplication
51
+ Click **"Deduplicate"** to start the process. You'll see:
52
+ - Loading progress for datasets
53
+ - Deduplication progress
54
+ - Results with statistics and example duplicates
55
+
56
+ ### 5. Push to Hub (New!)
57
+ After deduplication completes:
58
+ 1. **Log in** with your Hugging Face account using the login button
59
+ 2. Enter a **dataset name** for your cleaned dataset
60
+ 3. Click **"Push to Hub"** to upload the deduplicated dataset
61
+
62
+ The dataset will be saved as `your-username/dataset-name` and be publicly available.
63
+
64
+ ## Technical Details
65
+
66
+ - **Embedding Model**: Uses `minishlab/potion-base-8M` (Model2Vec) for fast, efficient text embeddings
67
+ - **Deduplication Algorithm**: SemHash for scalable semantic similarity detection
68
+ - **Backend**: Runs on CPU (may be slow for large datasets on free tier)
69
+
70
+ ## Local Usage
71
+
72
+ For faster processing of large datasets, run locally:
73
+
74
+ ```bash
75
+ git clone <repository-url>
76
+ cd semantic-deduplication
77
+ pip install -r requirements.txt
78
+ python app.py
79
+ ```
80
+
81
+ ## Examples
82
+
83
+ ### Cross-dataset Deduplication
84
+ Remove test set contamination:
85
+ - **Dataset 1**: `your-org/training-data` (split: `train`)
86
+ - **Dataset 2**: `your-org/test-data` (split: `test`)
87
+ - **Result**: Clean test set with training examples removed
88
+
89
+ ### Single Dataset Cleaning
90
+ Remove duplicates from a dataset:
91
+ - **Dataset 1**: `common_voice` (split: `train`)
92
+ - **Result**: Training set with duplicate audio transcriptions removed
93
+
94
+ ## Notes
95
+
96
+ - The app preserves all original columns from the datasets
97
+ - Only the text similarity is used for deduplication decisions
98
+ - Deduplicated datasets maintain the same structure as the original
99
+ - OAuth login is required only for pushing to the Hub, not for deduplication