nuatmochoi's picture
Update README.md
2b5dfe5 verified
metadata
language:
  - ko
license: apache-2.0
tags:
  - text2sql
  - spider
  - korean
  - llama
  - text-generation
  - table-question-answering
datasets:
  - spider
  - huggingface-KREW/spider-ko
base_model: unsloth/Meta-Llama-3.1-8B-Instruct
model-index:
  - name: Llama-3.1-8B-Spider-SQL-Ko
    results:
      - task:
          type: text2sql
          name: Text to SQL
        dataset:
          name: Spider (Korean)
          type: text2sql
        metrics:
          - type: exact_match
            value: 42.65
          - type: execution_accuracy
            value: 65.47

Llama-3.1-8B-Spider-SQL-Ko

ํ•œ๊ตญ์–ด ์งˆ๋ฌธ์„ SQL ์ฟผ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” Text-to-SQL ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Spider train ๋ฐ์ดํ„ฐ์…‹์„ ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญํ•œ spider-ko ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ ๋ฏธ์„ธ์กฐ์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ“Š ์ฃผ์š” ์„ฑ๋Šฅ

Spider ํ•œ๊ตญ์–ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹(1,034๊ฐœ) ํ‰๊ฐ€ ๊ฒฐ๊ณผ:

  • ์ •ํ™• ์ผ์น˜์œจ: 42.65% (441/1034)
  • ์‹คํ–‰ ์ •ํ™•๋„: 65.47% (677/1034)

๐Ÿ’ก ์‹คํ–‰ ์ •ํ™•๋„๊ฐ€ ์ •ํ™• ์ผ์น˜์œจ๋ณด๋‹ค ๋†’์€ ์ด์œ ๋Š”, SQL ๋ฌธ๋ฒ•์ด ๋‹ค๋ฅด๋”๋ผ๋„ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๐Ÿš€ ๋ฐ”๋กœ ์‹œ์ž‘ํ•˜๊ธฐ

from unsloth import FastLanguageModel

# ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# ํ•œ๊ตญ์–ด ์งˆ๋ฌธ โ†’ SQL ๋ณ€ํ™˜
question = "๊ฐ€์ˆ˜๋Š” ๋ช‡ ๋ช…์ด ์žˆ๋‚˜์š”?"
schema = """ํ…Œ์ด๋ธ”: singer
์ปฌ๋Ÿผ: singer_id, name, country, age"""

prompt = f"""๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ:
{schema}

์งˆ๋ฌธ: {question}
SQL:"""

# ๊ฒฐ๊ณผ: SELECT count(*) FROM singer

๐Ÿ“ ๋ชจ๋ธ ์†Œ๊ฐœ

  • ๊ธฐ๋ฐ˜ ๋ชจ๋ธ: Llama 3.1 8B Instruct (4bit ์–‘์žํ™”)
  • ํ•™์Šต ๋ฐ์ดํ„ฐ: spider-ko (1-epoch)
  • ์ง€์› DB: 166๊ฐœ์˜ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ( spider dataset )
  • ํ•™์Šต ๋ฐฉ๋ฒ•: LoRA (r=16, alpha=32)

๐Ÿ’ฌ ํ™œ์šฉ ์˜ˆ์‹œ

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

def generate_sql(question, schema_info):
    """ํ•œ๊ตญ์–ด ์งˆ๋ฌธ์„ SQL๋กœ ๋ณ€ํ™˜"""
    prompt = f"""๋‹ค์Œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์งˆ๋ฌธ์— ๋Œ€ํ•œ SQL ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜์„ธ์š”.

### ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์Šคํ‚ค๋งˆ:
{schema_info}

### ์งˆ๋ฌธ: {question}

### SQL ์ฟผ๋ฆฌ:"""
    
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
    
    outputs = model.generate(inputs, max_new_tokens=150, temperature=0.1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return response.split("### SQL ์ฟผ๋ฆฌ:")[-1].strip()

์‹ค์ œ ์‚ฌ์šฉ ์˜ˆ์‹œ

# ์˜ˆ์‹œ 1: ์ง‘๊ณ„ ํ•จ์ˆ˜
question = "๋ถ€์„œ์žฅ๋“ค ์ค‘ 56์„ธ๋ณด๋‹ค ๋‚˜์ด๊ฐ€ ๋งŽ์€ ์‚ฌ๋žŒ์ด ๋ช‡ ๋ช…์ž…๋‹ˆ๊นŒ?"
# ๊ฒฐ๊ณผ: SELECT count(*) FROM head WHERE age > 56

# ์˜ˆ์‹œ 2: ์กฐ์ธ
question = "๊ฐ€์žฅ ๋งŽ์€ ๋Œ€ํšŒ๋ฅผ ๊ฐœ์ตœํ•œ ๋„์‹œ์˜ ์ƒํƒœ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?"
# ๊ฒฐ๊ณผ: SELECT T1.Status FROM city AS T1 JOIN farm_competition AS T2 ON T1.City_ID = T2.Host_city_ID GROUP BY T2.Host_city_ID ORDER BY COUNT(*) DESC LIMIT 1

# ์˜ˆ์‹œ 3: ์„œ๋ธŒ์ฟผ๋ฆฌ
question = "๊ธฐ์—…๊ฐ€๊ฐ€ ์•„๋‹Œ ์‚ฌ๋žŒ๋“ค์˜ ์ด๋ฆ„์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?"
# ๊ฒฐ๊ณผ: SELECT Name FROM people WHERE People_ID NOT IN (SELECT People_ID FROM entrepreneur)

โš ๏ธ ์‚ฌ์šฉ ์‹œ ์ฃผ์˜์‚ฌํ•ญ

์ œํ•œ์‚ฌํ•ญ

  • โœ… ์˜์–ด ํ…Œ์ด๋ธ”/์ปฌ๋Ÿผ๋ช… ์‚ฌ์šฉ (ํ•œ๊ตญ์–ด ์งˆ๋ฌธ โ†’ ์˜์–ด SQL)
  • โœ… Spider ๋ฐ์ดํ„ฐ์…‹ ๋„๋ฉ”์ธ์— ์ตœ์ ํ™”
  • โŒ NoSQL, ๊ทธ๋ž˜ํ”„ DB ๋ฏธ์ง€์›
  • โŒ ๋งค์šฐ ๋ณต์žกํ•œ ์ค‘์ฒฉ ์ฟผ๋ฆฌ๋Š” ์ •ํ™•๋„ ํ•˜๋ฝ

๐Ÿ”ง ๊ธฐ์ˆ  ์‚ฌ์–‘

ํ•™์Šต ํ™˜๊ฒฝ

  • GPU: NVIDIA Tesla T4 (16GB)
  • ํ•™์Šต ์‹œ๊ฐ„: ์•ฝ 4์‹œ๊ฐ„
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ: ์ตœ๋Œ€ 7.6GB VRAM

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

training_args = {
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "learning_rate": 5e-4,
    "num_train_epochs": 1,
    "optimizer": "adamw_8bit",
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.05
}

lora_config = {
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", 
                      "gate_proj", "up_proj", "down_proj"]
}

๐Ÿ“š ์ฐธ๊ณ  ์ž๋ฃŒ

์ธ์šฉ

@misc{llama31_spider_sql_ko_2025,
  title={Llama-3.1-8B-Spider-SQL-Ko: Korean Text-to-SQL Model},
  author={[Sohyun Sim, Youngjun Cho, Seongwoo Choi]},
  year={2025},
  publisher={Hugging Face KREW},
  url={https://huggingface.co/huggingface-KREW/Llama-3.1-8B-Spider-SQL-Ko}
}

๊ด€๋ จ ๋…ผ๋ฌธ

๐Ÿค ๊ธฐ์—ฌ์ž

@sim-so, @choincnp, @nuatmochoi