all-gguf-same-where

Sleeping

App Files Files Community

matrixportal commited on Mar 29

Commit

0c618fe

verified ·

1 Parent(s): 01da532

Update app.py

Browse files

Files changed (1) hide show

app.py +0 -126

app.py CHANGED Viewed

@@ -269,132 +269,6 @@ def process_model(model_id, q_method, use_imatrix, imatrix_q_method, private_rep
 | [Download](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
 💡 **Tip:** Use `F16` for maximum precision when quality is critical
-# GGUF Model Quantization & Usage Guide with llama.cpp
-## What is GGUF and Quantization?
-**GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
-- Supports multiple quantization levels
-- Works cross-platform
-- Enables fast loading and inference
-**Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
-- Reduce model size
-- Decrease memory usage
-- Speed up inference
-- (With minor accuracy trade-offs)
-## Step-by-Step Guide
-### 1. Prerequisites
-```bash
-# System updates
-sudo apt update && sudo apt upgrade -y
-# Dependencies
-sudo apt install -y build-essential cmake python3-pip
-# Clone and build llama.cpp
-git clone https://github.com/ggerganov/llama.cpp
-cd llama.cpp
-make -j4
-```
-### 2. Using Quantized Models from Hugging Face
-My automated quantization script produces models in this format:
-```
-https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q4_k_m.gguf
-```
-Download your quantized model directly:
-```bash
-wget https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q4_k_m.gguf
-```
-### 3. Running the Quantized Model
-Basic usage:
-```bash
-./main -m {model_name.lower()}-q4_k_m.gguf -p "Your prompt here" -n 128
-```
-Example with a creative writing prompt:
-```bash
-./main -m {model_name.lower()}-q4_k_m.gguf \
-       -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" \
-       -n 256 -c 2048 -t 8 --temp 0.7
-```
-Advanced parameters:
-```bash
-./main -m {model_name.lower()}-q4_k_m.gguf \
-       -p "Question: What is the GGUF format?\nAnswer:" \
-       -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
-```
-### 4. Python Integration
-Install the Python package:
-```bash
-pip install llama-cpp-python
-```
-Example script:
-```python
-from llama_cpp import Llama
-# Initialize the model
-llm = Llama(
-    model_path="{model_name.lower()}-q4_k_m.gguf",
-    n_ctx=2048,
-    n_threads=8
-)
-# Run inference
-response = llm(
-    "[INST] Explain GGUF quantization to a beginner [/INST]",
-    max_tokens=256,
-    temperature=0.7,
-    top_p=0.9
-)
-print(response["choices"][0]["text"])
-```
-## Performance Tips
-1. **Hardware Utilization**:
-   - Set thread count with `-t` (typically CPU core count)
-   - Compile with CUDA/OpenCL for GPU support
-2. **Memory Optimization**:
-   - Lower quantization (like q4_k_m) uses less RAM
-   - Adjust context size with `-c` parameter
-3. **Speed/Accuracy Balance**:
-   - Higher bit quantization is slower but more accurate
-   - Reduce randomness with `--temp 0` for consistent results
-## FAQ
-**Q: What quantization levels are available?**
-A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
-**Q: How much performance loss occurs with q4_k_m?**
-A: Typically 2-5% accuracy reduction but 4x smaller size
-**Q: How to enable GPU support?**
-A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
-## Useful Resources
-1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
-2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
-3. [Hugging Face Model Hub](https://huggingface.co/models)
 """
             # README'yi güncelle (ModelCard kullanarak)

 | [Download](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
 💡 **Tip:** Use `F16` for maximum precision when quality is critical
 """
             # README'yi güncelle (ModelCard kullanarak)