File size: 5,387 Bytes
5043f4a
19f477c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5043f4a
19f477c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# Boolean Search Query Model

This model is fine-tuned to convert natural language queries into boolean search expressions, optimized for academic and research database searching.

## Model Details

- **Base Model**: Meta-Llama-3.1-8B
- **Training Type**: LoRA fine-tuning
- **Task**: Converting natural language to boolean search queries
- **Languages**: English
- **License**: Same as base model

## Intended Use

- Converting natural language search requests into proper boolean expressions
- Academic and research database searching
- Information retrieval query formulation

## Performance

### Test Results

Base Model vs Fine-tuned Model comparison:

```
Natural Query: "Studies examining the relationship between exercise and mental health"
Base: exercise AND mental health
Fine-tuned: exercise AND "mental health"  # Properly handles multi-word terms

Natural Query: "Articles about artificial intelligence ethics and regulation or policy"
Base: "artificial intelligence ethics" AND ("regulation" OR "policy")  # Treats AI ethics as one concept
Fine-tuned: "artificial intelligence" AND (ethics OR regulation OR policy)  # Properly splits concepts
```

### Key Improvements

1. Meta-term Removal
   - Automatically removes terms like "articles", "papers", "research", "studies"
   - Focuses on actual search concepts

2. Proper Term Quoting
   - Only quotes multi-word phrases
   - Single words remain unquoted

3. Logical Grouping
   - Appropriate use of parentheses for OR groups
   - Clear operator precedence

4. Minimal Formatting
   - No unnecessary parentheses
   - No duplicate terms

## Limitations

- English language only
- May not handle specialized domain terminology optimally
- Limited to boolean operators (AND, OR, NOT)
- Designed for academic/research context

## Training Data

The model was trained on a curated dataset of natural language queries paired with their correct boolean translations. Dataset characteristics:

- Size: 135 examples
- Format: Natural query β†’ Boolean expression pairs
- Source: Manually curated academic search examples
- Validation: Expert-reviewed for accuracy

## Training Process

- **Method**: LoRA fine-tuning
- **Hardware**: NVIDIA GeForce RTX 4070 Ti SUPER

## How to Use

```python
from unsloth import FastLanguageModel

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    "Zwounds/boolean-search-model",
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True
)
FastLanguageModel.for_inference(model)

# Format query
query = "Find papers about climate change and renewable energy"
formatted = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Convert this natural language query into a boolean search query by following these rules:

1. FIRST: Remove all meta-terms from this list (they should NEVER appear in output):
   - articles, papers, research, studies
   - examining, investigating, analyzing
   - findings, documents, literature
   - publications, journals, reviews
   Example: "Research examining X" β†’ just "X"

2. SECOND: Remove generic implied terms that don't add search value:
   - Remove words like "practices," "techniques," "methods," "approaches," "strategies"
   - Remove words like "impacts," "effects," "influences," "role," "applications"
   - For example: "sustainable agriculture practices" β†’ "sustainable agriculture"
   - For example: "teaching methodologies" β†’ "teaching"
   - For example: "leadership styles" β†’ "leadership"

3. THEN: Format the remaining terms:
   CRITICAL QUOTING RULES:
   - Multi-word phrases MUST ALWAYS be in quotes - NO EXCEPTIONS
   - Examples of correct quoting:
     - Wrong: machine learning AND deep learning
     - Right: "machine learning" AND "deep learning"
     - Wrong: natural language processing
     - Right: "natural language processing"
   - Single words must NEVER have quotes (e.g., science, research, learning)
   - Use AND to connect required concepts
   - Use OR with parentheses for alternatives (e.g., ("soil health" OR biodiversity))

Example conversions showing proper quoting:
"Research on machine learning for natural language processing"
β†’ "machine learning" AND "natural language processing"

"Studies examining anxiety depression stress in workplace"
β†’ (anxiety OR depression OR stress) AND workplace

"Articles about deep learning impact on computer vision"
β†’ "deep learning" AND "computer vision"

"Research on sustainable agriculture practices and their impact on soil health or biodiversity"
β†’ "sustainable agriculture" AND ("soil health" OR biodiversity)

"Articles about effective teaching methods for second language acquisition"
β†’ teaching AND "second language acquisition"

### Input:
{query}

### Response:
"""

# Generate boolean query
inputs = tokenizer(formatted, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)  # "climate change" AND "renewable energy"
```

## Citation

If you use this model in your research, please cite:
```bibtex
@misc{boolean-search-llm,
  title={Boolean Search Query LLM},
  author={Stephen Zweibel},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Zwounds/boolean-search-model}
}