--- license: mit --- ## GPT-2 Tokenizer with unmerged digits ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('cyrilzhang/gpt2-numfix') ``` A fork of the GPT-2 tokenizer, which **removes multi-digit tokens**: ```python tokenizer('123.45') # [16, 17, 18, 13, 19, 20] gpt2_tokenizer('123.45') # [10163, 13, 2231] ``` Backwards-compatible: ```python tokenizer.decode([10163, 46387]) # ' pigeon' gpt2_tokenizer.decode([10163, 46387]) # '123 pigeon' ``` - This is for my investigations into the arithmetic capabilities of large language models. There is no model here, only a tokenizer. - [PaLM](https://arxiv.org/abs/2204.02311) does this. - Many models (illustriously, [GPT-3](https://arxiv.org/abs/2005.14165)) use the GPT-2 tokenizer, which doesn't do this.