-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 143 -
Elucidating the Design Space of Diffusion-Based Generative Models
Paper • 2206.00364 • Published • 17 -
GLU Variants Improve Transformer
Paper • 2002.05202 • Published • 3 -
StarCoder 2 and The Stack v2: The Next Generation
Paper • 2402.19173 • Published • 147
Jeffrey Magder
jmagder
AI & ML interests
None yet
Recent Activity
upvoted
an
article
18 days ago
The Large Language Model Course
reacted
to
danielhanchen's
post
with 🔥
28 days ago
🦥 Introducing Unsloth Dynamic v2.0 GGUFs!
Our v2.0 quants set new benchmarks on 5-shot MMLU and KL Divergence, meaning you can now run & fine-tune quantized LLMs while preserving as much accuracy as possible.
Llama 4: https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF
DeepSeek-R1: https://huggingface.co/unsloth/DeepSeek-R1-GGUF-UD
Gemma 3: https://huggingface.co/unsloth/gemma-3-27b-it-GGUF
We made selective layer quantization much smarter. Instead of modifying only a subset of layers, we now dynamically quantize all layers so every layer has a different bit. Now, our dynamic method can be applied to all LLM architectures, not just MoE's.
Blog with Details: https://docs.unsloth.ai/basics/dynamic-v2.0
All our future GGUF uploads will leverage Dynamic 2.0 and our hand curated 300K–1.5M token calibration dataset to improve conversational chat performance.
For accurate benchmarking, we built an evaluation framework to match the reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard iMatrix quants.
Dynamic v2.0 aims to minimize the performance gap between full-precision models and their quantized counterparts.
liked
a Space
6 months ago
HuggingFaceH4/blogpost-scaling-test-time-compute
Organizations
None yet
Collections
2
-
Self-Play Preference Optimization for Language Model Alignment
Paper • 2405.00675 • Published • 28 -
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Paper • 2205.14135 • Published • 13 -
Attention Is All You Need
Paper • 1706.03762 • Published • 64 -
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Paper • 2307.08691 • Published • 8
models
0
None public yet
datasets
0
None public yet