Assamese GPT-2 Model

This is a GPT-2 language model trained from scratch on Assamese monolingual text, using data from IndicCorpV2 . The model is developed for educational and research purposes to support natural language understanding and generation tasks in Assamese โ€” a low-resource language.

๐Ÿ“– Model Description

The Assamese GPT-2 model is based on the standard GPT-2 decoder-only transformer architecture with 12 layers, 12 attention heads, 768 hidden size. It is capable of generating grammatically coherent and contextually relevant Assamese text and serves as a foundation for downstream NLP tasks such as:

  • Language modeling
  • Text completion/generation
  • Fine-tuning for classification or summarization

โœ… Intended Uses

  • Academic research on Assamese NLP
  • Training and benchmarking in educational settings
  • Exploration of low-resource language modeling

๐Ÿšซ Limitations

  • Trained on general-domain monolingual data, may not perform well on domain-specific texts (e.g., legal, medical).
  • Might generate biased, incomplete, or hallucinated outputs.
  • Not suitable for production use or deployment in sensitive applications.

๐Ÿ“š Training and Evaluation Data

The model was trained using Assamese monolingual data collected from:

  • IndicCorpV2: A curated collection of web-crawled and processed data for Indic languages.

Data preprocessing included:

  • Unicode normalization
  • Removal of noisy characters and malformed tokens
  • Sentence segmentation using Assamese-specific heuristics

๐Ÿงช Training Procedure

Hyperparameters

  • Architecture: GPT2 (12 layers, 12 heads, 768 hidden size)
  • Tokenizer vocab size: 50,000
  • Context window size: 1024 tokens
  • Learning rate: 5e-5
  • Epochs: 20
  • Batch size: 64
  • Optimizer: AdamW (ฮฒโ‚=0.9, ฮฒโ‚‚=0.999, ฮต=1e-8)
  • Scheduler: Linear
  • Mixed Precision: Native AMP
  • Seed: 42

Results

  • Final Evaluation Loss: -29.1890
  • Accuracy: 0.3452

๐Ÿš€ Example Usage

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("BharatVLM/AssameseGPT2")
tokenizer = GPT2Tokenizer.from_pretrained("BharatVLM/AssameseGPT2")

prompt = "เฆ…เฆธเฆฎเงฐ เฆ‡เฆคเฆฟเฆนเฆพเฆธ"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

๐Ÿ“„ License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Commercial use is not permitted. Use is allowed for academic and research purposes only.

๐Ÿ“ฌ Citation

Please cite this model as:

@misc{assamesegpt2, author = {BharatVLM}, title = {Assamese GPT-2 Model}, year = 2025, howpublished = {\url{https://huggingface.co/BharatVLM/AssameseGPT2}}, note = {Trained using IndicCorpV2 and OSCAR corpora} }

๐Ÿงฐ Framework Versions

  • Transformers: 4.52.0.dev0

  • PyTorch: 2.5.1+cu121

  • Datasets: 3.6.0

  • Tokenizers: 0.21.1

Contact Us

For questions or academic collaboration, please contact: ai.bharatvlm@gmail.com.

Downloads last month
4
Safetensors
Model size
124M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support