YAML Metadata
Warning:
The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Llama-3.2 1B 4-bit Quantized Model
Model Overview
- Base Model: Meta-Llama/Llama-3.2-1B
- Model Name: rautaditya/llama-3.2-1b-4bit-gptq
- Quantization: 4-bit GPTQ (Generative Pretrained Transformer Quantization)
Model Description
This is a 4-bit quantized version of the Llama-3.2 1B model, designed to reduce model size and inference latency while maintaining reasonable performance. The quantization process allows for more efficient deployment on resource-constrained environments.
Key Features
- Reduced model size
- Faster inference times
- Compatible with Hugging Face Transformers
- GPTQ quantization for optimal compression
Quantization Details
- Quantization Method: GPTQ (Generative Pretrained Transformer Quantization)
- Bit Depth: 4-bit
- Base Model: Llama-3.2 1B
- Quantization Library: AutoGPTQ
Installation Requirements
pip install transformers accelerate auto-gptq torch
Usage
Transformers Pipeline
from transformers import AutoTokenizer, pipeline
ModelFolder = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(ModelFolder)
pipe = pipeline(
"text-generation",
model=ModelFolder,
tokenizer=tokenizer,
device_map="auto"
)
prompt = "What is the meaning of life?"
generated_text = pipe(prompt, max_length=100)
print(generated_text)
Direct Model Loading
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
model_name = "rautaditya/llama-3.2-1b-4bit-gptq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
device_map="auto"
)
Performance Considerations
- Memory Efficiency: Significantly reduced memory footprint compared to full-precision model
- Inference Speed: Faster inference due to reduced computational requirements
- Potential Accuracy Trade-off: Minor performance degradation compared to full-precision model
Limitations
- May show slight differences in output quality compared to the original model
- Performance can vary based on specific use case and inference environment
Recommended Use Cases
- Low-resource environments
- Edge computing
- Mobile applications
- Embedded systems
- Rapid prototyping
License
Please refer to the original Meta Llama 3.2 model license for usage restrictions and permissions.
Citation
If you use this model, please cite:
@misc{llama3.2_4bit_quantized,
title={Llama-3.2 1B 4-bit Quantized Model},
author={Raut, Aditya},
year={2024},
publisher={Hugging Face}
}
Contributions and Feedback
- Open to suggestions and improvements
- Please file issues on the GitHub repository for any bugs or performance concerns
Acknowledgments
- Meta AI for the base Llama-3.2 model
- Hugging Face Transformers team
- AutoGPTQ library contributors
- Downloads last month
- 14
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for rautaditya/llama-3.2-1b-4bit-gptq
Base model
meta-llama/Llama-3.2-1B