--- title: Semantic Deduplication emoji: ๐Ÿงน colorFrom: green colorTo: green sdk: gradio sdk_version: 5.0.2 app_file: app.py pinned: false license: mit short_description: Deduplicate HuggingFace datasets in seconds hf_oauth: true hf_oauth_scopes: - write-repo - manage-repo --- # Semantic Text Deduplication Using SemHash This Gradio application performs **semantic deduplication** on HuggingFace datasets using [SemHash](https://github.com/MinishLab/semhash) with [Model2Vec](https://github.com/MinishLab/model2vec) embeddings. ## Features - **Two deduplication modes**: - **Single dataset**: Find and remove duplicates within one dataset - **Cross-dataset**: Remove entries from Dataset 2 that are similar to entries in Dataset 1 - **Customizable similarity threshold**: Control how strict the deduplication should be (0.0 = very loose, 1.0 = exact matches only) - **Detailed results**: View statistics and examples of found duplicates with word-level differences highlighted - **Hub Integration**: ๐Ÿ†• **Push deduplicated datasets directly to the Hugging Face Hub** after logging in ## How to Use ### 1. Choose Deduplication Type - **Cross-dataset**: Useful for removing training data contamination from test sets - **Single dataset**: Clean up duplicate entries within a single dataset ### 2. Configure Datasets - Enter the HuggingFace dataset names (e.g., `SetFit/amazon_massive_scenario_en-US`) - Specify the dataset splits (e.g., `train`, `test`, `validation`) - Set the text column name (usually `text`, `sentence`, or `content`) ### 3. Set Similarity Threshold - **0.9** (default): Good balance between precision and recall - **Higher values** (0.95-0.99): More conservative, only removes very similar texts - **Lower values** (0.7-0.85): More aggressive, may remove semantically similar but different texts ### 4. Run Deduplication Click **"Deduplicate"** to start the process. You'll see: - Loading progress for datasets - Deduplication progress - Results with statistics and example duplicates ### 5. Push to Hub (New!) After deduplication completes: 1. **Log in** with your Hugging Face account using the login button 2. Enter a **dataset name** for your cleaned dataset 3. Click **"Push to Hub"** to upload the deduplicated dataset The dataset will be saved as `your-username/dataset-name` and be publicly available. ## Technical Details - **Embedding Model**: Uses `minishlab/potion-base-8M` (Model2Vec) for fast, efficient text embeddings - **Deduplication Algorithm**: SemHash for scalable semantic similarity detection - **Backend**: Runs on CPU (may be slow for large datasets on free tier) ## Local Usage For faster processing of large datasets, run locally: ```bash git clone cd semantic-deduplication pip install -r requirements.txt python app.py ``` ## Examples ### Cross-dataset Deduplication Remove test set contamination: - **Dataset 1**: `your-org/training-data` (split: `train`) - **Dataset 2**: `your-org/test-data` (split: `test`) - **Result**: Clean test set with training examples removed ### Single Dataset Cleaning Remove duplicates from a dataset: - **Dataset 1**: `common_voice` (split: `train`) - **Result**: Training set with duplicate audio transcriptions removed ## Notes - The app preserves all original columns from the datasets - Only the text similarity is used for deduplication decisions - Deduplicated datasets maintain the same structure as the original - OAuth login is required only for pushing to the Hub, not for deduplication