arxiv:2306.04050

LLMZip: Lossless Text Compression using Large Language Models

Published on Jun 6, 2023

· Submitted by

akhaliq on Jun 8, 2023

Upvote

Authors:

Chandra Shekhara Kaushik Valmeekam ,

Abstract

A new estimate of English entropy using LLaMA-7B for next-token prediction leads to a compression algorithm that outperforms current text compression methods.

AI-generated summary

We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in cover1978convergent, lutati2023focus. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.

View arXiv page View PDF Add to collection

Community

falahh

Sep 6, 2024

hey

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.04050 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.04050 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.04050 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.