Update README.md
Browse files
README.md
CHANGED
@@ -23,7 +23,7 @@ It is designed for tasks involving both languages, such as training bilingual se
|
|
23 |
* **Type:** SentencePiece Unigram
|
24 |
* **Languages:** Azerbaijani (az), English (en)
|
25 |
* **Vocabulary Size:** Approximately 50,000 (actual size might be slightly larger due to special tokens, e.g., 50001).
|
26 |
-
* **Training Data:** Trained on a parallel corpus of ~4.14 million sentence pairs (total ~8.28 million sentences)
|
27 |
* **Normalization:** NFKC Unicode normalization (standard for SentencePiece).
|
28 |
* **Character Coverage:** 0.9995 (ensuring good coverage for Azerbaijani specific characters: ç, ö, ə, ü, ğ, ş).
|
29 |
|
|
|
23 |
* **Type:** SentencePiece Unigram
|
24 |
* **Languages:** Azerbaijani (az), English (en)
|
25 |
* **Vocabulary Size:** Approximately 50,000 (actual size might be slightly larger due to special tokens, e.g., 50001).
|
26 |
+
* **Training Data:** Trained on a parallel corpus of ~4.14 million sentence pairs (total ~8.28 million sentences). The corpus was balanced between Azerbaijani and English.
|
27 |
* **Normalization:** NFKC Unicode normalization (standard for SentencePiece).
|
28 |
* **Character Coverage:** 0.9995 (ensuring good coverage for Azerbaijani specific characters: ç, ö, ə, ü, ğ, ş).
|
29 |
|