🦉 Shuu12121/CodeModernBERT-Owl-2.0-Pre

CodeModernBERT-Owl-2.0-Pre は、マルチリンガルなコード理解・検索に対応した CodeModernBERT-Owl 系列の最新事前学習モデルです。

本モデルは、CodeBERT（Feng et al., 2020）で使用されたバイモーダル学習データの約4倍 に相当する、全て独自収集・構築した高品質なコーパスのみに基づいて事前学習を行っています。前バージョン（CodeModernBERT-Owl-1.0）と比較しても、約2倍のデータ量で学習されており、よりリッチな構文・意味情報を学習しています。

今回新たに、これまで対応していた 7言語（Python, Java, JavaScript, PHP, Ruby, Go, Rust）に加えて、TypeScript を新たにコーパスに加え、より幅広いコード言語に対応しました。

また、最大2048トークンまでの長文コードを学習データとして使用しており、推論時には最大8192トークンまでの入力を処理可能です（Position Embeddingは拡張済み）。

さらに、以下のような独自の前処理・フィルタリング処理を組み合わせることで、ノイズを除去し、学習の効率と精度を最大化しています：

Tree-sitter に基づく構文解析による関数・docstringの厳密な抽出
英語以外のdocstringや、意味のない定型文コメントの除去
APIキーやシークレット情報の検出・自動マスキング
ライセンス情報を含む関数の除外
関数・docstringペアの重複除去（データリーク対策）

基本情報

対応言語: Python, Java, JavaScript, PHP, Ruby, Go, Rust, TypeScript
学習時の最大トークン長: 2048
推論時の最大トークン長: 8192（拡張済み）
トークナイザ: 独自に学習したBPEベース
モデルサイズ: 約150Mパラメータ（ModernBERTベース）

主な用途例:

関数レベルのコード検索（自然言語→コード）
コード補完、要約、分類、コードクローン検出などの下流タスク
Retrieval-Augmented Generation（RAG）のためのコード検索基盤

English　ver

CodeModernBERT-Owl-2.0-Pre is the latest pretrained model in the CodeModernBERT-Owl series for multilingual code understanding and retrieval.

This model was trained entirely on a custom-built high-quality corpus, approximately 4 times larger than the bimodal dataset used in CodeBERT (Feng et al., 2020). Compared to the previous version (CodeModernBERT-Owl-1.0), it has been trained on twice the amount of data, capturing more structural and semantic patterns.

I also newly added TypeScript to the previously supported 7 languages (Python, Java, JavaScript, PHP, Ruby, Go, Rust), further broadening the model’s applicability.

The model was trained on inputs up to 2048 tokens, and supports inference up to 8192 tokens thanks to extended positional embeddings.

A set of custom preprocessing and filtering steps was applied to ensure data quality and training stability:

Precise function and docstring extraction via Tree-sitter-based parsing
Removal of non-English or templated comments
Automatic masking of API keys and secrets
Exclusion of license-related content
Deduplication of code/docstring pairs to prevent data leakage

Supported Languages: Python, Java, JavaScript, PHP, Ruby, Go, Rust, TypeScript
Max Training Sequence Length: 2048 tokens
Max Inference Sequence Length: 8192 tokens (positionally extended)
Tokenizer: Custom-trained BPE
Model Size: ~150M parameters (ModernBERT backbone)

Primary Use Cases:

Function-level code search (natural language → code)
Tasks such as code summarization, completion, classification, and clone detection
High-quality retrieval for RAG (Retrieval-Augmented Generation) systems

Shuu12121
/

CodeModernBERT-Owl-2.0-Pre

🦉 Shuu12121/CodeModernBERT-Owl-2.0-Pre

基本情報

English　ver

Model tree for Shuu12121/CodeModernBERT-Owl-2.0-Pre

🦉 Shuu12121/CodeModernBERT-Owl-2.0-Pre

基本情報

English ver

Model tree for Shuu12121/CodeModernBERT-Owl-2.0-Pre

English　ver