|
# Boolean Search Query Model |
|
|
|
This model is fine-tuned to convert natural language queries into boolean search expressions, optimized for academic and research database searching. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: Meta-Llama-3.1-8B |
|
- **Training Type**: LoRA fine-tuning |
|
- **Task**: Converting natural language to boolean search queries |
|
- **Languages**: English |
|
- **License**: Same as base model |
|
|
|
## Intended Use |
|
|
|
- Converting natural language search requests into proper boolean expressions |
|
- Academic and research database searching |
|
- Information retrieval query formulation |
|
|
|
## Performance |
|
|
|
### Test Results |
|
|
|
Base Model vs Fine-tuned Model comparison: |
|
|
|
``` |
|
Natural Query: "Studies examining the relationship between exercise and mental health" |
|
Base: exercise AND mental health |
|
Fine-tuned: exercise AND "mental health" # Properly handles multi-word terms |
|
|
|
Natural Query: "Articles about artificial intelligence ethics and regulation or policy" |
|
Base: "artificial intelligence ethics" AND ("regulation" OR "policy") # Treats AI ethics as one concept |
|
Fine-tuned: "artificial intelligence" AND (ethics OR regulation OR policy) # Properly splits concepts |
|
``` |
|
|
|
### Key Improvements |
|
|
|
1. Meta-term Removal |
|
- Automatically removes terms like "articles", "papers", "research", "studies" |
|
- Focuses on actual search concepts |
|
|
|
2. Proper Term Quoting |
|
- Only quotes multi-word phrases |
|
- Single words remain unquoted |
|
|
|
3. Logical Grouping |
|
- Appropriate use of parentheses for OR groups |
|
- Clear operator precedence |
|
|
|
4. Minimal Formatting |
|
- No unnecessary parentheses |
|
- No duplicate terms |
|
|
|
## Limitations |
|
|
|
- English language only |
|
- May not handle specialized domain terminology optimally |
|
- Limited to boolean operators (AND, OR, NOT) |
|
- Designed for academic/research context |
|
|
|
## Training Data |
|
|
|
The model was trained on a curated dataset of natural language queries paired with their correct boolean translations. Dataset characteristics: |
|
|
|
- Size: 135 examples |
|
- Format: Natural query β Boolean expression pairs |
|
- Source: Manually curated academic search examples |
|
- Validation: Expert-reviewed for accuracy |
|
|
|
## Training Process |
|
|
|
- **Method**: LoRA fine-tuning |
|
- **Hardware**: NVIDIA GeForce RTX 4070 Ti SUPER |
|
|
|
## How to Use |
|
|
|
```python |
|
from unsloth import FastLanguageModel |
|
|
|
# Load model |
|
model, tokenizer = FastLanguageModel.from_pretrained( |
|
"Zwounds/boolean-search-model", |
|
max_seq_length=2048, |
|
dtype=None, # Auto-detect |
|
load_in_4bit=True |
|
) |
|
FastLanguageModel.for_inference(model) |
|
|
|
# Format query |
|
query = "Find papers about climate change and renewable energy" |
|
formatted = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. |
|
|
|
### Instruction: |
|
Convert this natural language query into a boolean search query by following these rules: |
|
|
|
1. FIRST: Remove all meta-terms from this list (they should NEVER appear in output): |
|
- articles, papers, research, studies |
|
- examining, investigating, analyzing |
|
- findings, documents, literature |
|
- publications, journals, reviews |
|
Example: "Research examining X" β just "X" |
|
|
|
2. SECOND: Remove generic implied terms that don't add search value: |
|
- Remove words like "practices," "techniques," "methods," "approaches," "strategies" |
|
- Remove words like "impacts," "effects," "influences," "role," "applications" |
|
- For example: "sustainable agriculture practices" β "sustainable agriculture" |
|
- For example: "teaching methodologies" β "teaching" |
|
- For example: "leadership styles" β "leadership" |
|
|
|
3. THEN: Format the remaining terms: |
|
CRITICAL QUOTING RULES: |
|
- Multi-word phrases MUST ALWAYS be in quotes - NO EXCEPTIONS |
|
- Examples of correct quoting: |
|
- Wrong: machine learning AND deep learning |
|
- Right: "machine learning" AND "deep learning" |
|
- Wrong: natural language processing |
|
- Right: "natural language processing" |
|
- Single words must NEVER have quotes (e.g., science, research, learning) |
|
- Use AND to connect required concepts |
|
- Use OR with parentheses for alternatives (e.g., ("soil health" OR biodiversity)) |
|
|
|
Example conversions showing proper quoting: |
|
"Research on machine learning for natural language processing" |
|
β "machine learning" AND "natural language processing" |
|
|
|
"Studies examining anxiety depression stress in workplace" |
|
β (anxiety OR depression OR stress) AND workplace |
|
|
|
"Articles about deep learning impact on computer vision" |
|
β "deep learning" AND "computer vision" |
|
|
|
"Research on sustainable agriculture practices and their impact on soil health or biodiversity" |
|
β "sustainable agriculture" AND ("soil health" OR biodiversity) |
|
|
|
"Articles about effective teaching methods for second language acquisition" |
|
β teaching AND "second language acquisition" |
|
|
|
### Input: |
|
{query} |
|
|
|
### Response: |
|
""" |
|
|
|
# Generate boolean query |
|
inputs = tokenizer(formatted, return_tensors="pt") |
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(result) # "climate change" AND "renewable energy" |
|
``` |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
```bibtex |
|
@misc{boolean-search-llm, |
|
title={Boolean Search Query LLM}, |
|
author={Stephen Zweibel}, |
|
year={2025}, |
|
publisher={Hugging Face}, |
|
url={https://huggingface.co/Zwounds/boolean-search-model} |
|
} |
|
|