vrashad commited on
Commit
913c705
·
verified ·
1 Parent(s): f0b5f77

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -23,7 +23,7 @@ It is designed for tasks involving both languages, such as training bilingual se
23
  * **Type:** SentencePiece Unigram
24
  * **Languages:** Azerbaijani (az), English (en)
25
  * **Vocabulary Size:** Approximately 50,000 (actual size might be slightly larger due to special tokens, e.g., 50001).
26
- * **Training Data:** Trained on a parallel corpus of ~4.14 million sentence pairs (total ~8.28 million sentences) sourced from `merged_data.csv`. The corpus was balanced between Azerbaijani and English.
27
  * **Normalization:** NFKC Unicode normalization (standard for SentencePiece).
28
  * **Character Coverage:** 0.9995 (ensuring good coverage for Azerbaijani specific characters: ç, ö, ə, ü, ğ, ş).
29
 
 
23
  * **Type:** SentencePiece Unigram
24
  * **Languages:** Azerbaijani (az), English (en)
25
  * **Vocabulary Size:** Approximately 50,000 (actual size might be slightly larger due to special tokens, e.g., 50001).
26
+ * **Training Data:** Trained on a parallel corpus of ~4.14 million sentence pairs (total ~8.28 million sentences). The corpus was balanced between Azerbaijani and English.
27
  * **Normalization:** NFKC Unicode normalization (standard for SentencePiece).
28
  * **Character Coverage:** 0.9995 (ensuring good coverage for Azerbaijani specific characters: ç, ö, ə, ü, ğ, ş).
29