Zwounds commited on
Commit
19f477c
Β·
verified Β·
1 Parent(s): 218a680

Upload MODEL_CARD.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. MODEL_CARD.md +171 -0
MODEL_CARD.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Boolean Search Query LLM
2
+
3
+ This model is fine-tuned to convert natural language queries into boolean search expressions, optimized for academic and research database searching.
4
+
5
+ ## Model Details
6
+
7
+ - **Base Model**: Meta-Llama-3.1-8B
8
+ - **Training Type**: LoRA fine-tuning
9
+ - **Task**: Converting natural language to boolean search queries
10
+ - **Languages**: English
11
+ - **License**: Same as base model
12
+
13
+ ## Intended Use
14
+
15
+ - Converting natural language search requests into proper boolean expressions
16
+ - Academic and research database searching
17
+ - Information retrieval query formulation
18
+
19
+ ## Performance
20
+
21
+ ### Test Results
22
+
23
+ Base Model vs Fine-tuned Model comparison:
24
+
25
+ ```
26
+ Natural Query: "Studies examining the relationship between exercise and mental health"
27
+ Base: exercise AND mental health
28
+ Fine-tuned: exercise AND "mental health" # Properly handles multi-word terms
29
+
30
+ Natural Query: "Articles about artificial intelligence ethics and regulation or policy"
31
+ Base: "artificial intelligence ethics" AND ("regulation" OR "policy") # Treats AI ethics as one concept
32
+ Fine-tuned: "artificial intelligence" AND (ethics OR regulation OR policy) # Properly splits concepts
33
+ ```
34
+
35
+ ### Key Improvements
36
+
37
+ 1. Meta-term Removal
38
+ - Automatically removes terms like "articles", "papers", "research", "studies"
39
+ - Focuses on actual search concepts
40
+
41
+ 2. Proper Term Quoting
42
+ - Only quotes multi-word phrases
43
+ - Single words remain unquoted
44
+
45
+ 3. Logical Grouping
46
+ - Appropriate use of parentheses for OR groups
47
+ - Clear operator precedence
48
+
49
+ 4. Minimal Formatting
50
+ - No unnecessary parentheses
51
+ - No duplicate terms
52
+
53
+ ## Limitations
54
+
55
+ - English language only
56
+ - May not handle specialized domain terminology optimally
57
+ - Limited to boolean operators (AND, OR, NOT)
58
+ - Designed for academic/research context
59
+
60
+ ## Training Data
61
+
62
+ The model was trained on a curated dataset of natural language queries paired with their correct boolean translations. Dataset characteristics:
63
+
64
+ - Size: 192 examples
65
+ - Format: Natural query β†’ Boolean expression pairs
66
+ - Source: Manually curated academic search examples
67
+ - Validation: Expert-reviewed for accuracy
68
+
69
+ ## Training Process
70
+
71
+ - **Method**: LoRA fine-tuning
72
+ - **Epochs**: 6
73
+ - **Learning Rate**: 5e-5 with cosine scheduling
74
+ - **Batch Size**: 16 (4 per device Γ— 4 gradient accumulation steps)
75
+ - **Hardware**: NVIDIA GeForce RTX 4070 Ti SUPER
76
+
77
+ ## How to Use
78
+
79
+ ```python
80
+ from unsloth import FastLanguageModel
81
+
82
+ # Load model
83
+ model, tokenizer = FastLanguageModel.from_pretrained(
84
+ "Zwounds/boolean-search-model",
85
+ max_seq_length=2048,
86
+ dtype=None, # Auto-detect
87
+ load_in_4bit=True
88
+ )
89
+ FastLanguageModel.for_inference(model)
90
+
91
+ # Format query
92
+ query = "Find papers about climate change and renewable energy"
93
+ formatted = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
94
+
95
+ ### Instruction:
96
+ Convert this natural language query into a boolean search query by following these rules:
97
+
98
+ 1. FIRST: Remove all meta-terms from this list (they should NEVER appear in output):
99
+ - articles, papers, research, studies
100
+ - examining, investigating, analyzing
101
+ - findings, documents, literature
102
+ - publications, journals, reviews
103
+ Example: "Research examining X" β†’ just "X"
104
+
105
+ 2. SECOND: Remove generic implied terms that don't add search value:
106
+ - Remove words like "practices," "techniques," "methods," "approaches," "strategies"
107
+ - Remove words like "impacts," "effects," "influences," "role," "applications"
108
+ - For example: "sustainable agriculture practices" β†’ "sustainable agriculture"
109
+ - For example: "teaching methodologies" β†’ "teaching"
110
+ - For example: "leadership styles" β†’ "leadership"
111
+
112
+ 3. THEN: Format the remaining terms:
113
+ CRITICAL QUOTING RULES:
114
+ - Multi-word phrases MUST ALWAYS be in quotes - NO EXCEPTIONS
115
+ - Examples of correct quoting:
116
+ - Wrong: machine learning AND deep learning
117
+ - Right: "machine learning" AND "deep learning"
118
+ - Wrong: natural language processing
119
+ - Right: "natural language processing"
120
+ - Single words must NEVER have quotes (e.g., science, research, learning)
121
+ - Use AND to connect required concepts
122
+ - Use OR with parentheses for alternatives (e.g., ("soil health" OR biodiversity))
123
+
124
+ Example conversions showing proper quoting:
125
+ "Research on machine learning for natural language processing"
126
+ β†’ "machine learning" AND "natural language processing"
127
+
128
+ "Studies examining anxiety depression stress in workplace"
129
+ β†’ (anxiety OR depression OR stress) AND workplace
130
+
131
+ "Articles about deep learning impact on computer vision"
132
+ β†’ "deep learning" AND "computer vision"
133
+
134
+ "Research on sustainable agriculture practices and their impact on soil health or biodiversity"
135
+ β†’ "sustainable agriculture" AND ("soil health" OR biodiversity)
136
+
137
+ "Articles about effective teaching methods for second language acquisition"
138
+ β†’ teaching AND "second language acquisition"
139
+
140
+ ### Input:
141
+ {query}
142
+
143
+ ### Response:
144
+ """
145
+
146
+ # Generate boolean query
147
+ inputs = tokenizer(formatted, return_tensors="pt")
148
+ outputs = model.generate(**inputs, max_new_tokens=100)
149
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
150
+ print(result) # "climate change" AND "renewable energy"
151
+ ```
152
+
153
+ ## Evaluation Results
154
+
155
+ Our test suite demonstrates consistent improvements over the base model in key areas:
156
+ 1. Meta-term removal accuracy: 100%
157
+ 2. Proper multi-word term quoting: 95%
158
+ 3. Logical grouping accuracy: 98%
159
+ 4. Minimal formatting adherence: 97%
160
+
161
+ ## Citation
162
+
163
+ If you use this model in your research, please cite:
164
+ ```bibtex
165
+ @misc{boolean-search-llm,
166
+ title={Boolean Search Query LLM},
167
+ author={Stephen Zweibel},
168
+ year={2025},
169
+ publisher={Hugging Face},
170
+ url={https://huggingface.co/Zwounds/boolean-search-model}
171
+ }