metadata
language:
- ko
license: apache-2.0
tags:
- text2sql
- spider
- korean
- llama
- text-generation
- table-question-answering
datasets:
- spider
- huggingface-KREW/spider-ko
base_model: unsloth/Meta-Llama-3.1-8B-Instruct
model-index:
- name: Llama-3.1-8B-Spider-SQL-Ko
results:
- task:
type: text2sql
name: Text to SQL
dataset:
name: Spider (Korean)
type: text2sql
metrics:
- type: exact_match
value: 42.65
- type: execution_accuracy
value: 65.47
Llama-3.1-8B-Spider-SQL-Ko
ํ๊ตญ์ด ์ง๋ฌธ์ SQL ์ฟผ๋ฆฌ๋ก ๋ณํํ๋ Text-to-SQL ๋ชจ๋ธ์ ๋๋ค. Spider train ๋ฐ์ดํฐ์ ์ ํ๊ตญ์ด๋ก ๋ฒ์ญํ spider-ko ๋ฐ์ดํฐ์ ์ ํ์ฉํ์ฌ ๋ฏธ์ธ์กฐ์ ํ์์ต๋๋ค.
๐ ์ฃผ์ ์ฑ๋ฅ
Spider ํ๊ตญ์ด ๊ฒ์ฆ ๋ฐ์ดํฐ์ (1,034๊ฐ) ํ๊ฐ ๊ฒฐ๊ณผ:
- ์ ํ ์ผ์น์จ: 42.65% (441/1034)
- ์คํ ์ ํ๋: 65.47% (677/1034)
๐ก ์คํ ์ ํ๋๊ฐ ์ ํ ์ผ์น์จ๋ณด๋ค ๋์ ์ด์ ๋, SQL ๋ฌธ๋ฒ์ด ๋ค๋ฅด๋๋ผ๋ ๋์ผํ ๊ฒฐ๊ณผ๋ฅผ ๋ฐํํ๋ ๊ฒฝ์ฐ๊ฐ ๋ง๊ธฐ ๋๋ฌธ์ ๋๋ค.
๐ ๋ฐ๋ก ์์ํ๊ธฐ
from unsloth import FastLanguageModel
# ๋ชจ๋ธ ๋ถ๋ฌ์ค๊ธฐ
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# ํ๊ตญ์ด ์ง๋ฌธ โ SQL ๋ณํ
question = "๊ฐ์๋ ๋ช ๋ช
์ด ์๋์?"
schema = """ํ
์ด๋ธ: singer
์ปฌ๋ผ: singer_id, name, country, age"""
prompt = f"""๋ฐ์ดํฐ๋ฒ ์ด์ค ์คํค๋ง:
{schema}
์ง๋ฌธ: {question}
SQL:"""
# ๊ฒฐ๊ณผ: SELECT count(*) FROM singer
๐ ๋ชจ๋ธ ์๊ฐ
- ๊ธฐ๋ฐ ๋ชจ๋ธ: Llama 3.1 8B Instruct (4bit ์์ํ)
- ํ์ต ๋ฐ์ดํฐ: spider-ko (1-epoch)
- ์ง์ DB: 166๊ฐ์ ๋ค์ํ ๋๋ฉ์ธ ๋ฐ์ดํฐ๋ฒ ์ด์ค ( spider dataset )
- ํ์ต ๋ฐฉ๋ฒ: LoRA (r=16, alpha=32)
๐ฌ ํ์ฉ ์์
๊ธฐ๋ณธ ์ฌ์ฉ๋ฒ
def generate_sql(question, schema_info):
"""ํ๊ตญ์ด ์ง๋ฌธ์ SQL๋ก ๋ณํ"""
prompt = f"""๋ค์ ๋ฐ์ดํฐ๋ฒ ์ด์ค ์คํค๋ง๋ฅผ ์ฐธ๊ณ ํ์ฌ ์ง๋ฌธ์ ๋ํ SQL ์ฟผ๋ฆฌ๋ฅผ ์์ฑํ์ธ์.
### ๋ฐ์ดํฐ๋ฒ ์ด์ค ์คํค๋ง:
{schema_info}
### ์ง๋ฌธ: {question}
### SQL ์ฟผ๋ฆฌ:"""
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=150, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response.split("### SQL ์ฟผ๋ฆฌ:")[-1].strip()
์ค์ ์ฌ์ฉ ์์
# ์์ 1: ์ง๊ณ ํจ์
question = "๋ถ์์ฅ๋ค ์ค 56์ธ๋ณด๋ค ๋์ด๊ฐ ๋ง์ ์ฌ๋์ด ๋ช ๋ช
์
๋๊น?"
# ๊ฒฐ๊ณผ: SELECT count(*) FROM head WHERE age > 56
# ์์ 2: ์กฐ์ธ
question = "๊ฐ์ฅ ๋ง์ ๋ํ๋ฅผ ๊ฐ์ตํ ๋์์ ์ํ๋ ๋ฌด์์ธ๊ฐ์?"
# ๊ฒฐ๊ณผ: SELECT T1.Status FROM city AS T1 JOIN farm_competition AS T2 ON T1.City_ID = T2.Host_city_ID GROUP BY T2.Host_city_ID ORDER BY COUNT(*) DESC LIMIT 1
# ์์ 3: ์๋ธ์ฟผ๋ฆฌ
question = "๊ธฐ์
๊ฐ๊ฐ ์๋ ์ฌ๋๋ค์ ์ด๋ฆ์ ๋ฌด์์
๋๊น?"
# ๊ฒฐ๊ณผ: SELECT Name FROM people WHERE People_ID NOT IN (SELECT People_ID FROM entrepreneur)
โ ๏ธ ์ฌ์ฉ ์ ์ฃผ์์ฌํญ
์ ํ์ฌํญ
- โ ์์ด ํ ์ด๋ธ/์ปฌ๋ผ๋ช ์ฌ์ฉ (ํ๊ตญ์ด ์ง๋ฌธ โ ์์ด SQL)
- โ Spider ๋ฐ์ดํฐ์ ๋๋ฉ์ธ์ ์ต์ ํ
- โ NoSQL, ๊ทธ๋ํ DB ๋ฏธ์ง์
- โ ๋งค์ฐ ๋ณต์กํ ์ค์ฒฉ ์ฟผ๋ฆฌ๋ ์ ํ๋ ํ๋ฝ
๐ง ๊ธฐ์ ์ฌ์
ํ์ต ํ๊ฒฝ
- GPU: NVIDIA Tesla T4 (16GB)
- ํ์ต ์๊ฐ: ์ฝ 4์๊ฐ
- ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ: ์ต๋ 7.6GB VRAM
ํ์ดํผํ๋ผ๋ฏธํฐ
training_args = {
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 4,
"learning_rate": 5e-4,
"num_train_epochs": 1,
"optimizer": "adamw_8bit",
"lr_scheduler_type": "cosine",
"warmup_ratio": 0.05
}
lora_config = {
"r": 16,
"lora_alpha": 32,
"lora_dropout": 0,
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]
}
๐ ์ฐธ๊ณ ์๋ฃ
์ธ์ฉ
@misc{llama31_spider_sql_ko_2025,
title={Llama-3.1-8B-Spider-SQL-Ko: Korean Text-to-SQL Model},
author={[Sohyun Sim, Youngjun Cho, Seongwoo Choi]},
year={2025},
publisher={Hugging Face KREW},
url={https://huggingface.co/huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko}
}
๊ด๋ จ ๋ ผ๋ฌธ
- Spider: A Large-Scale Human-Labeled Dataset (Yu et al., 2018)