CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code
Abstract
The study explores the impact of different subtokenization methods on large language models for source code, finding that a novel subtokenization can decrease token length by 17% without performance loss or slightly enhance it by 0.5-2% with potential length increases.
Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, namely the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account code specifics. We propose subtokenziation that reduces average length by 17% without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2%, possibly with some length increase.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper