File size: 5,684 Bytes
f4998bb
 
37e2140
 
 
 
 
218a680
 
 
6100275
 
 
 
 
f4998bb
 
218a680
f4998bb
218a680
f4998bb
218a680
f4998bb
218a680
6100275
218a680
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
tags:
  - transformers
  - llama
  - boolean-search
  - search
  - language-to-query
library_name: transformers
pipeline_tag: text2text-generation
license: llama2
title: boolean-search-query-model
emoji: πŸ”
sdk: gradio
sdk_version: 4.0.0
app_file: demo.py
---

# Boolean Search Query Model

Convert natural language queries into proper boolean search expressions for academic databases. This model helps researchers and librarians create properly formatted boolean search queries from natural language descriptions.

## Features

- Converts natural language to boolean search expressions
- (MOSTLY!) Handles multi-word terms correctly with quotes
- Removes meta-terms (articles, papers, research, etc.)
- Groups OR clauses appropriately
- Minimal, clean formatting

## Installation

```bash
pip install transformers torch unsloth
```

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "Zwounds/boolean-search-model",
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True
)
FastLanguageModel.for_inference(model)
```

## Quick Start

```python
# Format your query
query = "Find papers about climate change and renewable energy"
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Convert this natural language query into a boolean search query by following these rules:

1. FIRST: Remove all meta-terms from this list (they should NEVER appear in output):
   - articles, papers, research, studies
   - examining, investigating, analyzing
   - findings, documents, literature
   - publications, journals, reviews
   Example: "Research examining X" β†’ just "X"

2. SECOND: Remove generic implied terms that don't add search value:
   - Remove words like "practices," "techniques," "methods," "approaches," "strategies"
   - Remove words like "impacts," "effects," "influences," "role," "applications"
   - For example: "sustainable agriculture practices" β†’ "sustainable agriculture"
   - For example: "teaching methodologies" β†’ "teaching"
   - For example: "leadership styles" β†’ "leadership"

3. THEN: Format the remaining terms:
   CRITICAL QUOTING RULES:
   - Multi-word phrases MUST ALWAYS be in quotes - NO EXCEPTIONS
   - Examples of correct quoting:
     - Wrong: machine learning AND deep learning
     - Right: "machine learning" AND "deep learning"
     - Wrong: natural language processing
     - Right: "natural language processing"
   - Single words must NEVER have quotes (e.g., science, research, learning)
   - Use AND to connect required concepts
   - Use OR with parentheses for alternatives (e.g., ("soil health" OR biodiversity))

Example conversions showing proper quoting:
"Research on machine learning for natural language processing"
β†’ "machine learning" AND "natural language processing"

"Studies examining anxiety depression stress in workplace"
β†’ (anxiety OR depression OR stress) AND workplace

"Articles about deep learning impact on computer vision"
β†’ "deep learning" AND "computer vision"

"Research on sustainable agriculture practices and their impact on soil health or biodiversity"
β†’ "sustainable agriculture" AND ("soil health" OR biodiversity)

"Articles about effective teaching methods for second language acquisition"
β†’ teaching AND "second language acquisition"

### Input:
{query}

### Response:
"""

# Generate boolean query
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)  # "climate change" AND "renewable energy"
```

## Examples

Input queries and their boolean translations:

1. Natural: "Studies about anxiety depression stress in workplace"
   - Boolean: (anxiety OR depression OR stress) AND workplace

2. Natural: "Articles about artificial intelligence ethics and regulation or policy"
   - Boolean: "artificial intelligence" AND (ethics OR regulation OR policy)

3. Natural: "Research on quantum computing applications in cryptography or optimization"
   - Boolean: "quantum computing" AND (cryptography OR optimization)

## Rules

The model follows these formatting rules:

1. Meta-terms are removed:
   - "articles", "papers", "research", "studies"
   - Focus on actual search concepts

2. Quotes only for multi-word terms:
   - "artificial intelligence" AND ethics βœ“
   - NOT: "ethics" AND "ai" βœ—

3. Logical grouping:
   - Use parentheses for OR groups
   - (x OR y) AND z

4. Minimal formatting:
   - No unnecessary parentheses
   - No repeated terms

## Local Development

```bash
# Clone repo
git clone https://github.com/your-username/boolean-search-model.git
cd boolean-search-model

# Install dependencies
pip install -r requirements.txt

# Run tests
python test_boolean_model.py
```

## Contributing

1. Fork the repository
2. Create your feature branch
3. Add tests for any new functionality
4. Submit a pull request

## Model Card

See [MODEL_CARD.md](MODEL_CARD.md) for detailed model information including:
- Training data details
- Performance metrics
- Limitations
- Intended use cases

## License

This model is subject to the Llama 2 license. See the [LICENSE](LICENSE) file for details.

## Citation

If you use this model in your research, please cite:
```bibtex
@misc{boolean-search-llm,
  title={Boolean Search Query LLM},
  author={Stephen Zweibel},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Zwounds/boolean-search-model}
}
```

## Contact

Stephen Zweibel - [@szweibel](https://github.com/szweibel)