Bilingual Azerbaijani-English Unigram Tokenizer (az-en-unigram-tokenizer-50k)

This repository contains a SentencePiece Unigram tokenizer trained on a bilingual corpus of Azerbaijani and English text. It is designed for tasks involving both languages, such as training bilingual sentence embeddings, machine translation, or cross-lingual information retrieval.

Tokenizer Details

Type: SentencePiece Unigram
Languages: Azerbaijani (az), English (en)
Vocabulary Size: Approximately 50,000 (actual size might be slightly larger due to special tokens, e.g., 50001).
Training Data: Trained on a parallel corpus of ~4.14 million sentence pairs (total ~8.28 million sentences). The corpus was balanced between Azerbaijani and English.
Normalization: NFKC Unicode normalization (standard for SentencePiece).
Character Coverage: 0.9995 (ensuring good coverage for Azerbaijani specific characters: ç, ö, ə, ü, ğ, ş).

Special Tokens

The tokenizer includes the following special tokens, which are standard for many transformer-based models:

[UNK] (Unknown Token): ID 0
[CLS] (Classification Token / Start of Sequence): ID 1
[SEP] (Separator Token / End of Sequence): ID 2
[MASK] (Mask Token): ID 3
[PAD] (Padding Token): ID 50000 (Note: This ID might be different if your tokenizer assigned it differently; please verify from your training output.)

Please verify the PAD token ID from your training script output (e.g., PAD token: '[PAD]' (ID: XXXXX)) and update it here if necessary.

Intended Use

This tokenizer is intended to be used with transformer models for tasks that require processing of both Azerbaijani and English text. It can be particularly useful for:

Initializing the tokenizer for new bilingual (Azerbaijani-English) sentence transformer models.
Fine-tuning multilingual models on Azerbaijani-English data.
Pre-training new models from scratch on Azerbaijani and English text.

How to Use

You can use this tokenizer directly with the transformers library:

from transformers import AutoTokenizer

tokenizer_id = "LocalDoc/az-en-unigram-tokenizer-50k"

try:
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
    print(f"Tokenizer loaded successfully from {tokenizer_id}!")
except Exception as e:
    print(f"Failed to load tokenizer. Make sure 'sentencepiece_model_pb2.py' is available or you have 'protobuf' installed if needed by the tokenizer loading mechanism.")
    print(f"Error: {e}")
    # As a fallback or for certain environments, you might need to ensure protobuf and sentencepiece_model_pb2.py are handled
    # For example, if the user's environment is minimal:
    # !pip install protobuf
    # !wget https://raw.githubusercontent.com/google/sentencepiece/master/python/src/sentencepiece/sentencepiece_model_pb2.py
    # tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)


# Example Azerbaijani text
az_text = "Bu, Azərbaycan dilində bir test cümləsidir."
encoded_az = tokenizer.encode(az_text)
tokens_az = tokenizer.convert_ids_to_tokens(encoded_az)
print(f"Azerbaijani Original: {az_text}")
print(f"Azerbaijani Encoded IDs: {encoded_az}")
print(f"Azerbaijani Tokens: {tokens_az}")

# Example English text
en_text = "This is a test sentence in English."
encoded_en = tokenizer.encode(en_text)
tokens_en = tokenizer.convert_ids_to_tokens(encoded_en)
print(f"\nEnglish Original: {en_text}")
print(f"English Encoded IDs: {encoded_en}")
print(f"English Tokens: {tokens_en}")

# Example with special tokens
special_text = "[CLS] Bu bir cümlədir. [SEP] This is a sentence. [MASK]"
encoded_special = tokenizer.encode(special_text)
tokens_special = tokenizer.convert_ids_to_tokens(encoded_special)
print(f"\nSpecial Text Original: {special_text}")
print(f"Special Text Encoded IDs: {encoded_special}")
print(f"Special Text Tokens: {tokens_special}")

Tokenizer loaded successfully from LocalDoc/az-en-unigram-tokenizer-50k!
Azerbaijani Original: Bu, Azərbaycan dilində bir test cümləsidir.
Azerbaijani Encoded IDs: [90, 4, 66, 2940, 30, 2248, 34485, 116, 5]
Azerbaijani Tokens: ['▁Bu', ',', '▁Azərbaycan', '▁dilində', '▁bir', '▁test', '▁cümləsi', 'dir', '.']

English Original: This is a test sentence in English.
English Encoded IDs: [283, 18, 14, 2248, 3841, 10, 2784, 5]
English Tokens: ['▁This', '▁is', '▁a', '▁test', '▁sentence', '▁in', '▁English', '.']

Special Text Original: [CLS] Bu bir cümlədir. [SEP] This is a sentence. [MASK]
Special Text Encoded IDs: [1, 90, 30, 10798, 116, 5, 15, 2, 283, 18, 14, 3841, 5, 15, 3]
Special Text Tokens: ['[CLS]', '▁Bu', '▁bir', '▁cümlə', 'dir', '.', '▁', '[SEP]', '▁This', '▁is', '▁a', '▁sentence', '.', '▁', '[MASK]']

Tokenizer Training Procedure

The tokenizer was trained using the SentencePiece library with the following key parameters:

Input Data

A text file containing approximately 8.28 million sentences, consisting of concatenated Azerbaijani and English texts.

Model Configuration

Model Type: unigram
Vocabulary Size: 50000
Character Coverage: 0.9995

Special Tokens

The following special tokens were defined and included during training:

Token	Description
`[UNK]`	Unknown token
`[PAD]`	Padding token
`[CLS]`	Beginning of sentence
`[SEP]`	End of sentence
`[MASK]`	Mask token (user-defined)

Token Definitions in SentencePiece:

unk_piece: [UNK]
pad_piece: [PAD]
bos_piece: [CLS]
eos_piece: [SEP]
user_defined_symbols: [MASK]

Contact

For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].