File size: 6,681 Bytes

---
library_name: mlx
license: apache-2.0
language:
- en
- bn
- hi
- kn
- gu
- mr
- ml
- or
- pa
- ta
- te
base_model: sarvamai/sarvam-m
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- mlx
- quantized
- 4bit
- indian-languages
- multilingual
- apple-silicon
- sarvam
- mistral
---

# Sarvam-M 4-bit MLX

This is a 4-bit quantized version of [sarvamai/sarvam-m](https://huggingface.co/sarvamai/sarvam-m) optimized for Apple Silicon using [MLX](https://github.com/ml-explore/mlx).

## Model Details

- **Base Model**: [Sarvam-M](https://huggingface.co/sarvamai/sarvam-m) (24B parameters)
- **Quantization**: 4.5 bits per weight
- **Framework**: MLX (optimized for Apple Silicon)
- **Model Size**: ~12GB (75% reduction from original ~48GB)
- **Languages**: English + 10 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

## Key Features

- **🇮🇳 Indic Language Excellence**: Specifically optimized for Indian languages with cultural context
- **🧮 Hybrid Reasoning**: Supports both "thinking" and "non-thinking" modes for different use cases
- **⚡ Fast Inference**: 4-6x faster than larger models while maintaining quality
- **🎯 Versatile**: Strong performance in math, programming, and multilingual tasks
- **💻 Apple Silicon Optimized**: Runs efficiently on M1/M2/M3 MacBooks

## Installation

```bash
# Install MLX and dependencies
pip install mlx-lm transformers

# For chat functionality (optional)
pip install gradio
```

## 🛠️ LM Studio Setup

**Having issues with short responses or "EOS token" problems in LM Studio?**

👉 **[See the complete LM Studio Setup Guide](./LM_Studio_Setup_Guide.md)** 

**Quick Fix:** Use proper chat formatting:
```
[INST] Your question here [/INST]
```

The model requires specific prompt formatting to work correctly in LM Studio.


## Usage

### Basic Generation

```python
from mlx_lm import load, generate

# Load the model
model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# Simple generation
response = generate(
    model, 
    tokenizer, 
    prompt="What is the capital of India?", 
    max_tokens=50
)
print(response)
```

### Chat with Thinking Mode Control

```python
from mlx_lm import load, generate

model, tokenizer = load("your-username/sarvam-m-4bit-mlx")

# No thinking mode (direct answers)
messages = [{'role': 'user', 'content': 'What is 2+2?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=20)
print(response)  # Output: The sum of 2 and 2 is **4**.

# With thinking mode (shows reasoning)
messages = [{'role': 'user', 'content': 'Solve: 15 * 23'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)  # Output: <think>Let me calculate...</think> The answer is 345.
```

### Hindi Language Example

```python
# Hindi conversation
messages = [{'role': 'user', 'content': 'भारत की राजधानी क्या है?'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=50)
print(response)
# Output: भारत की राजधानी **नई दिल्ली** है। यह देश की राजनीतिक, प्रशासनिक...
```

### Programming Example

```python
# Code generation
messages = [{'role': 'user', 'content': 'Write a Python function to calculate fibonacci numbers'}]
prompt = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    enable_thinking=False
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=150)
print(response)
```

## Command Line Usage

```bash
# Simple generation
python -m mlx_lm generate \
    --model your-username/sarvam-m-4bit-mlx \
    --prompt "Hello, how are you?" \
    --max-tokens 50

# Interactive chat
python -m mlx_lm chat --model your-username/sarvam-m-4bit-mlx
```

## Performance Benchmarks

| Metric | Value |
|--------|-------|
| Model Size | ~12GB |
| Peak Memory Usage | ~13.3GB |
| Generation Speed | 18-36 tokens/sec |
| Quantization Bits | 4.5 bits per weight |
| Supported Languages | 11 (English + 10 Indic) |

### Quality Comparison

- **Math**: Accurate arithmetic and reasoning
- **Hindi**: Native-level language understanding
- **Programming**: Strong code generation capabilities
- **Cultural Context**: Indian-specific knowledge and values

## Hardware Requirements

- **Minimum**: Apple Silicon Mac (M1/M2/M3/M4) with 16GB RAM
- **Recommended**: 32GB+ RAM for optimal performance
- **Storage**: ~15GB free space

## Supported Languages

1. **English** - Primary language
2. **Hindi** (हिन्दी) - 28% of Indic data
3. **Bengali** (বাংলা) - 8% of Indic data
4. **Gujarati** (ગુજરાતી) - 8% of Indic data
5. **Kannada** (ಕನ್ನಡ) - 8% of Indic data
6. **Malayalam** (മലയാളം) - 8% of Indic data
7. **Marathi** (मराठी) - 8% of Indic data
8. **Oriya** (ଓଡ଼ିଆ) - 8% of Indic data
9. **Punjabi** (ਪੰਜਾਬੀ) - 8% of Indic data
10. **Tamil** (தமிழ்) - 8% of Indic data
11. **Telugu** (తెలుగు) - 8% of Indic data

## License

This model follows the same license as the original Sarvam-M model. Please refer to the [original model card](https://huggingface.co/sarvamai/sarvam-m) for license details.

## Citation

```bibtex
@misc{sarvam-m-mlx,
  title={Sarvam-M 4-bit MLX: Quantized Indian Language Model for Apple Silicon},
  author={Community Contribution},
  year={2025},
  url={https://huggingface.co/your-username/sarvam-m-4bit-mlx}
}
```

## Credits

- **Original Model**: [Sarvam AI](https://sarvam.ai/) for creating Sarvam-M
- **Base Model**: Built on [Mistral Small](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503)
- **MLX Framework**: [Apple's MLX team](https://github.com/ml-explore/mlx)
- **Quantization**: Community contribution using MLX-LM tools

## Issues and Support

For issues specific to this MLX version:
- Check that you're using Apple Silicon hardware
- Ensure MLX is properly installed
- Verify you have sufficient RAM (16GB minimum)

For general model issues, refer to the [original Sarvam-M repository](https://huggingface.co/sarvamai/sarvam-m).

---

*This model was quantized using MLX-LM tools and optimized for Apple Silicon. It maintains the quality and capabilities of the original Sarvam-M while providing significant efficiency improvements.*