Boolean Search Query Model
This model is fine-tuned to convert natural language queries into boolean search expressions, optimized for academic and research database searching.
Model Details
- Base Model: Meta-Llama-3.1-8B
- Training Type: LoRA fine-tuning
- Task: Converting natural language to boolean search queries
- Languages: English
- License: Same as base model
Intended Use
- Converting natural language search requests into proper boolean expressions
- Academic and research database searching
- Information retrieval query formulation
Performance
Test Results
Base Model vs Fine-tuned Model comparison:
Natural Query: "Studies examining the relationship between exercise and mental health"
Base: exercise AND mental health
Fine-tuned: exercise AND "mental health" # Properly handles multi-word terms
Natural Query: "Articles about artificial intelligence ethics and regulation or policy"
Base: "artificial intelligence ethics" AND ("regulation" OR "policy") # Treats AI ethics as one concept
Fine-tuned: "artificial intelligence" AND (ethics OR regulation OR policy) # Properly splits concepts
Key Improvements
Meta-term Removal
- Automatically removes terms like "articles", "papers", "research", "studies"
- Focuses on actual search concepts
Proper Term Quoting
- Only quotes multi-word phrases
- Single words remain unquoted
Logical Grouping
- Appropriate use of parentheses for OR groups
- Clear operator precedence
Minimal Formatting
- No unnecessary parentheses
- No duplicate terms
Limitations
- English language only
- May not handle specialized domain terminology optimally
- Limited to boolean operators (AND, OR, NOT)
- Designed for academic/research context
Training Data
The model was trained on a curated dataset of natural language queries paired with their correct boolean translations. Dataset characteristics:
- Size: 135 examples
- Format: Natural query β Boolean expression pairs
- Source: Manually curated academic search examples
- Validation: Expert-reviewed for accuracy
Training Process
- Method: LoRA fine-tuning
- Hardware: NVIDIA GeForce RTX 4070 Ti SUPER
How to Use
from unsloth import FastLanguageModel
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
"Zwounds/boolean-search-model",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True
)
FastLanguageModel.for_inference(model)
# Format query
query = "Find papers about climate change and renewable energy"
formatted = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Convert this natural language query into a boolean search query by following these rules:
1. FIRST: Remove all meta-terms from this list (they should NEVER appear in output):
- articles, papers, research, studies
- examining, investigating, analyzing
- findings, documents, literature
- publications, journals, reviews
Example: "Research examining X" β just "X"
2. SECOND: Remove generic implied terms that don't add search value:
- Remove words like "practices," "techniques," "methods," "approaches," "strategies"
- Remove words like "impacts," "effects," "influences," "role," "applications"
- For example: "sustainable agriculture practices" β "sustainable agriculture"
- For example: "teaching methodologies" β "teaching"
- For example: "leadership styles" β "leadership"
3. THEN: Format the remaining terms:
CRITICAL QUOTING RULES:
- Multi-word phrases MUST ALWAYS be in quotes - NO EXCEPTIONS
- Examples of correct quoting:
- Wrong: machine learning AND deep learning
- Right: "machine learning" AND "deep learning"
- Wrong: natural language processing
- Right: "natural language processing"
- Single words must NEVER have quotes (e.g., science, research, learning)
- Use AND to connect required concepts
- Use OR with parentheses for alternatives (e.g., ("soil health" OR biodiversity))
Example conversions showing proper quoting:
"Research on machine learning for natural language processing"
β "machine learning" AND "natural language processing"
"Studies examining anxiety depression stress in workplace"
β (anxiety OR depression OR stress) AND workplace
"Articles about deep learning impact on computer vision"
β "deep learning" AND "computer vision"
"Research on sustainable agriculture practices and their impact on soil health or biodiversity"
β "sustainable agriculture" AND ("soil health" OR biodiversity)
"Articles about effective teaching methods for second language acquisition"
β teaching AND "second language acquisition"
### Input:
{query}
### Response:
"""
# Generate boolean query
inputs = tokenizer(formatted, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result) # "climate change" AND "renewable energy"
Citation
If you use this model in your research, please cite:
@misc{boolean-search-llm,
title={Boolean Search Query LLM},
author={Stephen Zweibel},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/Zwounds/boolean-search-model}
}