Papers
arxiv:2310.09259

Towards End-to-end 4-Bit Inference on Generative Large Language Models

Published on Oct 13, 2023
Authors:
,
,
,
,
,
,

Abstract

Using a hybrid quantization strategy, QUIK, large generative models like LLaMA and OPT can achieve significant speedups while maintaining accuracy by compressing most weights and activations to 4-bit.

AI-generated summary

We show that the majority of the inference computations for large generative models such as LLaMA and OPT can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. Crucially, our scheme is designed with computational efficiency in mind: we provide GPU kernels with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.1x relative to FP16 execution. Code and models are provided at https://github.com/IST-DASLab/QUIK.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2310.09259 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2310.09259 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.09259 in a Space README.md to link it from this page.

Collections including this paper 2