File size: 6,681 Bytes
0fd7176 af48f6a 0fd7176 378e2dc 0fd7176 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
---
library_name: mlx
license: apache-2.0
language:
- en
- bn
- hi
- kn
- gu
- mr
- ml
- or
- pa
- ta
- te
base_model: sarvamai/sarvam-m
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- mlx
- quantized
- 4bit
- indian-languages
- multilingual
- apple-silicon
- sarvam
- mistral
---
# Sarvam-M 4-bit MLX
This is a 4-bit quantized version of [sarvamai/sarvam-m](https://huggingface.co/sarvamai/sarvam-m) optimized for Apple Silicon using [MLX](https://github.com/ml-explore/mlx).
## Model Details
- **Base Model**: [Sarvam-M](https://huggingface.co/sarvamai/sarvam-m) (24B parameters)
- **Quantization**: 4.5 bits per weight
- **Framework**: MLX (optimized for Apple Silicon)
- **Model Size**: ~12GB (75% reduction from original ~48GB)
- **Languages**: English + 10 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)
## Key Features
- **🇮🇳 Indic Language Excellence**: Specifically optimized for Indian languages with cultural context
- **🧮 Hybrid Reasoning**: Supports both "thinking" and "non-thinking" modes for different use cases
- **⚡ Fast Inference**: 4-6x faster than larger models while maintaining quality
- **🎯 Versatile**: Strong performance in math, programming, and multilingual tasks
- **💻 Apple Silicon Optimized**: Runs efficiently on M1/M2/M3 MacBooks
## Installation
```bash
# Install MLX and dependencies
pip install mlx-lm transformers
# For chat functionality (optional)
pip install gradio
```
## 🛠️ LM Studio Setup
**Having issues with short responses or "EOS token" problems in LM Studio?**
👉 **[See the complete LM Studio Setup Guide](./LM_Studio_Setup_Guide.md)**
**Quick Fix:** Use proper chat formatting:
```
[INST] Your question here [/INST]
```
The model requires specific prompt formatting to work correctly in LM Studio.
## Usage
### Basic Generation
```python
from mlx_lm import load, generate
# Load the model
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")
# Simple generation
response = generate(
model,
tokenizer,
prompt="What is the capital of India?",
max_tokens=50
)
print(response)
```
### Chat with Thinking Mode Control
```python
from mlx_lm import load, generate
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")
# No thinking mode (direct answers)
messages = [{'role': 'user', 'content': 'What is 2+2?'}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=20)
print(response) # Output: The sum of 2 and 2 is **4**.
# With thinking mode (shows reasoning)
messages = [{'role': 'user', 'content': 'Solve: 15 * 23'}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response) # Output: <think>Let me calculate...</think> The answer is 345.
```
### Hindi Language Example
```python
# Hindi conversation
messages = [{'role': 'user', 'content': 'भारत की राजधानी क्या है?'}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=50)
print(response)
# Output: भारत की राजधानी **नई दिल्ली** है। यह देश की राजनीतिक, प्रशासनिक...
```
### Programming Example
```python
# Code generation
messages = [{'role': 'user', 'content': 'Write a Python function to calculate fibonacci numbers'}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=150)
print(response)
```
## Command Line Usage
```bash
# Simple generation
python -m mlx_lm generate \
--model your-username/sarvam-m-4bit-mlx \
--prompt "Hello, how are you?" \
--max-tokens 50
# Interactive chat
python -m mlx_lm chat --model your-username/sarvam-m-4bit-mlx
```
## Performance Benchmarks
| Metric | Value |
|--------|-------|
| Model Size | ~12GB |
| Peak Memory Usage | ~13.3GB |
| Generation Speed | 18-36 tokens/sec |
| Quantization Bits | 4.5 bits per weight |
| Supported Languages | 11 (English + 10 Indic) |
### Quality Comparison
- **Math**: Accurate arithmetic and reasoning
- **Hindi**: Native-level language understanding
- **Programming**: Strong code generation capabilities
- **Cultural Context**: Indian-specific knowledge and values
## Hardware Requirements
- **Minimum**: Apple Silicon Mac (M1/M2/M3/M4) with 16GB RAM
- **Recommended**: 32GB+ RAM for optimal performance
- **Storage**: ~15GB free space
## Supported Languages
1. **English** - Primary language
2. **Hindi** (हिन्दी) - 28% of Indic data
3. **Bengali** (বাংলা) - 8% of Indic data
4. **Gujarati** (ગુજરાતી) - 8% of Indic data
5. **Kannada** (ಕನ್ನಡ) - 8% of Indic data
6. **Malayalam** (മലയാളം) - 8% of Indic data
7. **Marathi** (मराठी) - 8% of Indic data
8. **Oriya** (ଓଡ଼ିଆ) - 8% of Indic data
9. **Punjabi** (ਪੰਜਾਬੀ) - 8% of Indic data
10. **Tamil** (தமிழ்) - 8% of Indic data
11. **Telugu** (తెలుగు) - 8% of Indic data
## License
This model follows the same license as the original Sarvam-M model. Please refer to the [original model card](https://huggingface.co/sarvamai/sarvam-m) for license details.
## Citation
```bibtex
@misc{sarvam-m-mlx,
title={Sarvam-M 4-bit MLX: Quantized Indian Language Model for Apple Silicon},
author={Community Contribution},
year={2025},
url={https://huggingface.co/your-username/sarvam-m-4bit-mlx}
}
```
## Credits
- **Original Model**: [Sarvam AI](https://sarvam.ai/) for creating Sarvam-M
- **Base Model**: Built on [Mistral Small](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)
- **MLX Framework**: [Apple's MLX team](https://github.com/ml-explore/mlx)
- **Quantization**: Community contribution using MLX-LM tools
## Issues and Support
For issues specific to this MLX version:
- Check that you're using Apple Silicon hardware
- Ensure MLX is properly installed
- Verify you have sufficient RAM (16GB minimum)
For general model issues, refer to the [original Sarvam-M repository](https://huggingface.co/sarvamai/sarvam-m).
---
*This model was quantized using MLX-LM tools and optimized for Apple Silicon. It maintains the quality and capabilities of the original Sarvam-M while providing significant efficiency improvements.*
|