fbnhnsl's picture
Update README.md
1e9d85d verified
|
raw
history blame
3.39 kB
metadata
datasets:
  - fbnhnsl/FIM_Solidity_Dataset
language:
  - en
metrics:
  - bleu
  - meteor
base_model:
  - deepseek-ai/deepseek-coder-1.3b-base
pipeline_tag: text-generation
tags:
  - code
license: cc-by-4.0

This is a fine-tuned deepseek-coder-1.3b-base model for automatic completion of Solidity code. The model was fine-tuned using the Parameter Efficient Fine-tuning (PEFT) method Quantized Low Rank Adaptation (QLoRA) and a Fill-in-the-Middle (FIM) transformed and Slither audited dataset.

Example usage:

# Load the fine-tuned model
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pretrained_checkpoint = 'deepseek-ai/deepseek-coder-1.3b-base'
finetuned_checkpoint = 'path/to/model'

tokenizer = AutoTokenizer.from_pretrained(finetuned_checkpoint)

old_model = AutoModelForCausalLM.from_pretrained(pretrained_checkpoint)
old_model.resize_token_embeddings(len(tokenizer))

finetuned_model = PeftModel.from_pretrained(old_model, checkpoint).to(device)

# ----------------------------------------------------------------------------
# General automatic code completion
code_example = '''<|secure_function|>\tfunction add('''

model_inputs = tokenizer(code_example, return_tensors="pt").to(device)

input_ids = model_inputs["input_ids"]
attention_mask = model_inputs["attention_mask"]

generated_ids = finetuned_model.generate(input_ids,
                                         do_sample=True,
                                         max_length=256,
                                         num_beams=4,
                                         temperature=0.3,
                                         pad_token_id=tokenizer.eos_token_id,
                                         attention_mask=attention_mask)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

# Expected output:
# 	function add(uint256 a, uint256 b) internal pure returns (uint256) {
#     return a + b;
#	}

# ----------------------------------------------------------------------------
# Fill-in-the-middle
def generate_fim(prefix, suffix, model, tokenizer, max_length=256):
    input_text = f"<|fim_begin|>{prefix}<|fim_hole|>{suffix}<|fim_end|>"
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        inputs,
        max_length=max_length,
        num_beams=8,
        temperature=0.3,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    middle = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
    return prefix + middle + suffix

prefix = '''pragma solidity ^0.8.0;\n\n'''

suffix = '''\n\ncontract FOO is Context, IERC20, Ownable {'''

print(generate_fim(prefix, suffix, finetuned_model, tokenizer))

# Expected output:
# pragma solidity ^0.8.0;
#
# import "@openzeppelin/contracts/utils/Context.sol" as Context;
# import "@openzeppelin/contracts/interfaces/IERC20.sol" as IERC20;
# import "@openzeppelin/contracts/access/Ownable.sol" as Ownable;
#
# contract FOO is Context, IERC20, Ownable {

If you wish to use this model, you can cite it as follows:

@misc{hensel2025fim_model,
  title = {Finetuned deepseek-coder-1.3b-base model for automatic code completion of Solidity code},
  author={Fabian Hensel},
  year={2025}
}