matrixportal commited on
Commit
0c618fe
·
verified ·
1 Parent(s): 01da532

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +0 -126
app.py CHANGED
@@ -269,132 +269,6 @@ def process_model(model_id, q_method, use_imatrix, imatrix_q_method, private_rep
269
  | [Download](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
270
 
271
  💡 **Tip:** Use `F16` for maximum precision when quality is critical
272
-
273
- # GGUF Model Quantization & Usage Guide with llama.cpp
274
-
275
- ## What is GGUF and Quantization?
276
-
277
- **GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
278
- - Supports multiple quantization levels
279
- - Works cross-platform
280
- - Enables fast loading and inference
281
-
282
- **Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
283
- - Reduce model size
284
- - Decrease memory usage
285
- - Speed up inference
286
- - (With minor accuracy trade-offs)
287
-
288
- ## Step-by-Step Guide
289
-
290
- ### 1. Prerequisites
291
-
292
- ```bash
293
- # System updates
294
- sudo apt update && sudo apt upgrade -y
295
-
296
- # Dependencies
297
- sudo apt install -y build-essential cmake python3-pip
298
-
299
- # Clone and build llama.cpp
300
- git clone https://github.com/ggerganov/llama.cpp
301
- cd llama.cpp
302
- make -j4
303
- ```
304
-
305
- ### 2. Using Quantized Models from Hugging Face
306
-
307
- My automated quantization script produces models in this format:
308
- ```
309
- https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q4_k_m.gguf
310
- ```
311
-
312
- Download your quantized model directly:
313
-
314
- ```bash
315
- wget https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q4_k_m.gguf
316
- ```
317
-
318
- ### 3. Running the Quantized Model
319
-
320
- Basic usage:
321
- ```bash
322
- ./main -m {model_name.lower()}-q4_k_m.gguf -p "Your prompt here" -n 128
323
- ```
324
-
325
- Example with a creative writing prompt:
326
- ```bash
327
- ./main -m {model_name.lower()}-q4_k_m.gguf \
328
- -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" \
329
- -n 256 -c 2048 -t 8 --temp 0.7
330
- ```
331
-
332
- Advanced parameters:
333
- ```bash
334
- ./main -m {model_name.lower()}-q4_k_m.gguf \
335
- -p "Question: What is the GGUF format?\nAnswer:" \
336
- -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
337
- ```
338
-
339
- ### 4. Python Integration
340
-
341
- Install the Python package:
342
- ```bash
343
- pip install llama-cpp-python
344
- ```
345
-
346
- Example script:
347
- ```python
348
- from llama_cpp import Llama
349
-
350
- # Initialize the model
351
- llm = Llama(
352
- model_path="{model_name.lower()}-q4_k_m.gguf",
353
- n_ctx=2048,
354
- n_threads=8
355
- )
356
-
357
- # Run inference
358
- response = llm(
359
- "[INST] Explain GGUF quantization to a beginner [/INST]",
360
- max_tokens=256,
361
- temperature=0.7,
362
- top_p=0.9
363
- )
364
-
365
- print(response["choices"][0]["text"])
366
- ```
367
-
368
- ## Performance Tips
369
-
370
- 1. **Hardware Utilization**:
371
- - Set thread count with `-t` (typically CPU core count)
372
- - Compile with CUDA/OpenCL for GPU support
373
-
374
- 2. **Memory Optimization**:
375
- - Lower quantization (like q4_k_m) uses less RAM
376
- - Adjust context size with `-c` parameter
377
-
378
- 3. **Speed/Accuracy Balance**:
379
- - Higher bit quantization is slower but more accurate
380
- - Reduce randomness with `--temp 0` for consistent results
381
-
382
- ## FAQ
383
-
384
- **Q: What quantization levels are available?**
385
- A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
386
-
387
- **Q: How much performance loss occurs with q4_k_m?**
388
- A: Typically 2-5% accuracy reduction but 4x smaller size
389
-
390
- **Q: How to enable GPU support?**
391
- A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
392
-
393
- ## Useful Resources
394
-
395
- 1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
396
- 2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
397
- 3. [Hugging Face Model Hub](https://huggingface.co/models)
398
  """
399
 
400
  # README'yi güncelle (ModelCard kullanarak)
 
269
  | [Download](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
270
 
271
  💡 **Tip:** Use `F16` for maximum precision when quality is critical
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272
  """
273
 
274
  # README'yi güncelle (ModelCard kullanarak)