File size: 6,681 Bytes
0fd7176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af48f6a
 
 
 
 
 
 
 
 
 
 
 
 
 
0fd7176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
378e2dc
0fd7176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
library_name: mlx
license: apache-2.0
language:
- en
- bn
- hi
- kn
- gu
- mr
- ml
- or
- pa
- ta
- te
base_model: sarvamai/sarvam-m
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- mlx
- quantized
- 4bit
- indian-languages
- multilingual
- apple-silicon
- sarvam
- mistral
---

# Sarvam-M 4-bit MLX

This is a 4-bit quantized version of [sarvamai/sarvam-m](https://huggingface.co/sarvamai/sarvam-m) optimized for Apple Silicon using [MLX](https://github.com/ml-explore/mlx).

## Model Details

- **Base Model**: [Sarvam-M](https://huggingface.co/sarvamai/sarvam-m) (24B parameters)
- **Quantization**: 4.5 bits per weight
- **Framework**: MLX (optimized for Apple Silicon)
- **Model Size**: ~12GB (75% reduction from original ~48GB)
- **Languages**: English + 10 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

## Key Features

- **🇮🇳 Indic Language Excellence**: Specifically optimized for Indian languages with cultural context
- **🧮 Hybrid Reasoning**: Supports both "thinking" and "non-thinking" modes for different use cases
- **⚡ Fast Inference**: 4-6x faster than larger models while maintaining quality
- **🎯 Versatile**: Strong performance in math, programming, and multilingual tasks
- **💻 Apple Silicon Optimized**: Runs efficiently on M1/M2/M3 MacBooks

## Installation

```bash
# Install MLX and dependencies
pip install mlx-lm transformers

# For chat functionality (optional)
pip install gradio
```

## 🛠️ LM Studio Setup

**Having issues with short responses or "EOS token" problems in LM Studio?**

👉 **[See the complete LM Studio Setup Guide](./LM_Studio_Setup_Guide.md)** 

**Quick Fix:** Use proper chat formatting:
```
[INST] Your question here [/INST]
```

The model requires specific prompt formatting to work correctly in LM Studio.


## Usage

### Basic Generation

```python
from mlx_lm import load, generate

# Load the model
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# Simple generation
response = generate(
    model, 
    tokenizer, 
    prompt="What is the capital of India?", 
    max_tokens=50
)
print(response)
```

### Chat with Thinking Mode Control

```python
from mlx_lm import load, generate

model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# No thinking mode (direct answers)
messages = [{'role': 'user', 'content': 'What is 2+2?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=20)
print(response)  # Output: The sum of 2 and 2 is **4**.

# With thinking mode (shows reasoning)
messages = [{'role': 'user', 'content': 'Solve: 15 * 23'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)  # Output: <think>Let me calculate...</think> The answer is 345.
```

### Hindi Language Example

```python
# Hindi conversation
messages = [{'role': 'user', 'content': 'भारत की राजधानी क्या है?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=50)
print(response)
# Output: भारत की राजधानी **नई दिल्ली** है। यह देश की राजनीतिक, प्रशासनिक...
```

### Programming Example

```python
# Code generation
messages = [{'role': 'user', 'content': 'Write a Python function to calculate fibonacci numbers'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=150)
print(response)
```

## Command Line Usage

```bash
# Simple generation
python -m mlx_lm generate \
    --model your-username/sarvam-m-4bit-mlx \
    --prompt "Hello, how are you?" \
    --max-tokens 50

# Interactive chat
python -m mlx_lm chat --model your-username/sarvam-m-4bit-mlx
```

## Performance Benchmarks

| Metric | Value |
|--------|-------|
| Model Size | ~12GB |
| Peak Memory Usage | ~13.3GB |
| Generation Speed | 18-36 tokens/sec |
| Quantization Bits | 4.5 bits per weight |
| Supported Languages | 11 (English + 10 Indic) |

### Quality Comparison

- **Math**: Accurate arithmetic and reasoning
- **Hindi**: Native-level language understanding
- **Programming**: Strong code generation capabilities
- **Cultural Context**: Indian-specific knowledge and values

## Hardware Requirements

- **Minimum**: Apple Silicon Mac (M1/M2/M3/M4) with 16GB RAM
- **Recommended**: 32GB+ RAM for optimal performance
- **Storage**: ~15GB free space

## Supported Languages

1. **English** - Primary language
2. **Hindi** (हिन्दी) - 28% of Indic data
3. **Bengali** (বাংলা) - 8% of Indic data
4. **Gujarati** (ગુજરાતી) - 8% of Indic data
5. **Kannada** (ಕನ್ನಡ) - 8% of Indic data
6. **Malayalam** (മലയാളം) - 8% of Indic data
7. **Marathi** (मराठी) - 8% of Indic data
8. **Oriya** (ଓଡ଼ିଆ) - 8% of Indic data
9. **Punjabi** (ਪੰਜਾਬੀ) - 8% of Indic data
10. **Tamil** (தமிழ்) - 8% of Indic data
11. **Telugu** (తెలుగు) - 8% of Indic data

## License

This model follows the same license as the original Sarvam-M model. Please refer to the [original model card](https://huggingface.co/sarvamai/sarvam-m) for license details.

## Citation

```bibtex
@misc{sarvam-m-mlx,
  title={Sarvam-M 4-bit MLX: Quantized Indian Language Model for Apple Silicon},
  author={Community Contribution},
  year={2025},
  url={https://huggingface.co/your-username/sarvam-m-4bit-mlx}
}
```

## Credits

- **Original Model**: [Sarvam AI](https://sarvam.ai/) for creating Sarvam-M
- **Base Model**: Built on [Mistral Small](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)
- **MLX Framework**: [Apple's MLX team](https://github.com/ml-explore/mlx)
- **Quantization**: Community contribution using MLX-LM tools

## Issues and Support

For issues specific to this MLX version:
- Check that you're using Apple Silicon hardware
- Ensure MLX is properly installed
- Verify you have sufficient RAM (16GB minimum)

For general model issues, refer to the [original Sarvam-M repository](https://huggingface.co/sarvamai/sarvam-m).

---

*This model was quantized using MLX-LM tools and optimized for Apple Silicon. It maintains the quality and capabilities of the original Sarvam-M while providing significant efficiency improvements.*