|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- multilingual |
|
tags: |
|
- code-to-docstring |
|
- code-summarization |
|
- code-documentation |
|
- encoder-decoder |
|
- code |
|
- python |
|
- java |
|
- transformers |
|
- huggingface |
|
- modernbert |
|
- gpt2 |
|
base_model: |
|
- Shuu12121/CodeModernBERT-Ghost |
|
- openai-community/gpt2-large |
|
pipeline_tag: text2text-generation |
|
--- |
|
|
|
# CodeEncoderDecoderModel-Ghost-large👻 |
|
|
|
A multilingual encoder-decoder model for generating **docstrings from code snippets**. |
|
It is based on a custom BERT-style encoder pretrained on source code (`CodeModernBERT-Ghost`) and a large-scale decoder model (`GPT2-large`). |
|
|
|
## 🏗️ Model Architecture |
|
|
|
- **Encoder:** [`Shuu12121/CodeModernBERT-Ghost`](https://huggingface.co/Shuu12121/CodeModernBERT-Ghost) |
|
- **Decoder:** [`openai-community/gpt2-large`](https://huggingface.co/openai-community/gpt2-large) |
|
- Connected via HuggingFace's `EncoderDecoderModel` with cross-attention. |
|
|
|
## 🎯 Intended Use |
|
|
|
- Generating docstrings (documentation comments) for functions or methods in multiple languages. |
|
- Summarizing code for educational or review purposes. |
|
- Assisting in automated documentation generation pipelines. |
|
|
|
Supported languages (code input): |
|
- Python |
|
- Java |
|
|
|
## 📦 How to Use |
|
|
|
```python |
|
from transformers import AutoTokenizer, EncoderDecoderModel |
|
import torch |
|
|
|
model = EncoderDecoderModel.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large").to("cuda") |
|
encoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="encoder_tokenizer") |
|
decoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="decoder_tokenizer") |
|
|
|
if decoder_tokenizer.pad_token is None: |
|
decoder_tokenizer.pad_token = decoder_tokenizer.eos_token |
|
|
|
code = ''' |
|
def greet(name): |
|
return f"Hello, {name}!" |
|
''' |
|
|
|
inputs = encoder_tokenizer(code, return_tensors="pt", truncation=True, padding=True, max_length=2048).to("cuda") |
|
outputs = model.generate( |
|
input_ids=inputs.input_ids, |
|
attention_mask=inputs.attention_mask, |
|
max_length=256, |
|
num_beams=5, |
|
early_stopping=True, |
|
decoder_start_token_id=model.config.decoder_start_token_id, |
|
eos_token_id=model.config.eos_token_id, |
|
pad_token_id=model.config.pad_token_id, |
|
no_repeat_ngram_size=2 |
|
) |
|
|
|
docstring = decoder_tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(docstring) |
|
``` |
|
|
|
## 🧪 Training Details |
|
|
|
- **Task:** Code-to-docstring generation |
|
- **Dataset:** [CodeXGLUE: Code-to-Text](https://github.com/microsoft/CodeXGLUE) – using subsets of Python, Java, JavaScript, Go, Ruby, PHP |
|
- **Loss:** Cross-entropy loss over tokenized docstrings |
|
- **Max input length:** 2048 (encoder), max output length: 256 (decoder) |
|
- **Decoder modifications:** Adapted GPT2-large with padding and cross-attention |
|
|
|
## ⚠️ Limitations & Risks |
|
|
|
1. **Generated documentation may be inaccurate, incomplete, or misleading**. Always review generated docstrings manually. |
|
2. **Formatting may not follow specific standards** (e.g., Google/Numpy style in Python or full Javadoc). |
|
3. **Limited context:** Only considers single-function input; lacks broader project-level understanding. |
|
4. **Language variance:** Performance may differ depending on the programming language due to data distribution. |
|
5. **⚠️ Decoder risks (GPT2-large):** |
|
GPT-2 models are known to sometimes generate inappropriate, offensive, or biased outputs, depending on the prompt. |
|
Although this model is fine-tuned on technical datasets (code-docstring pairs), due to inherited properties from `gpt2-large`, similar risks **may still be present** in edge cases. Please exercise caution, especially when using the model in public or educational settings. |
|
|
|
## 📄 License |
|
|
|
Apache-2.0 |
|
Model weights and tokenizer artifacts are released under the same license. You are free to use, modify, and redistribute with attribution. |
|
|
|
|