RichardErkhov commited on
Commit
4070b25
·
verified ·
1 Parent(s): 173b805

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ phi3-hallucination-judge-merge - bnb 8bits
11
+ - Model creator: https://huggingface.co/grounded-ai/
12
+ - Original model: https://huggingface.co/grounded-ai/phi3-hallucination-judge-merge/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ library_name: transformers
20
+ license: mit
21
+ tags: []
22
+ ---
23
+
24
+ ## Merged Model Performance
25
+
26
+ This repository contains our hallucination evaluation PEFT adapter model.
27
+
28
+ ### Hallucination Detection Metrics
29
+
30
+ Our merged model achieves the following performance on a binary classification task for detecting hallucinations in language model outputs:
31
+
32
+ ```
33
+ precision recall f1-score support
34
+
35
+ 0 0.85 0.71 0.77 100
36
+ 1 0.75 0.87 0.81 100
37
+
38
+ accuracy 0.79 200
39
+ macro avg 0.80 0.79 0.79 200
40
+ weighted avg 0.80 0.79 0.79 200
41
+ ```
42
+
43
+ ### Model Usage
44
+ For best results, we recommend starting with the following prompting strategy (and encourage tweaks as you see fit):
45
+
46
+ ```python
47
+ def format_input(reference, query, response):
48
+ prompt = f"""Your job is to evaluate whether a machine learning model has hallucinated or not.
49
+ A hallucination occurs when the response is coherent but factually incorrect or nonsensical
50
+ outputs that are not grounded in the provided context.
51
+ You are given the following information:
52
+ ####INFO####
53
+ [Knowledge]: {reference}
54
+ [User Input]: {query}
55
+ [Model Response]: {response}
56
+ ####END INFO####
57
+ Based on the information provided is the model output a hallucination? Respond with only "yes" or "no"
58
+ """
59
+ return input
60
+
61
+ text = format_input(query='Based on the follwoing
62
+ <context>Walrus are the largest mammal</context>
63
+ answer the question
64
+ <query> What is the best PC?</query>',
65
+ response='The best PC is the mac')
66
+
67
+ messages = [
68
+ {"role": "user", "content": text}
69
+ ]
70
+
71
+ pipe = pipeline(
72
+ "text-generation",
73
+ model=base_model,
74
+ model_kwargs={"attn_implementation": attn_implementation, "torch_dtype": torch.float16},
75
+ tokenizer=tokenizer,
76
+ )
77
+ generation_args = {
78
+ "max_new_tokens": 2,
79
+ "return_full_text": False,
80
+ "temperature": 0.01,
81
+ "do_sample": True,
82
+ }
83
+
84
+ output = pipe(messages, **generation_args)
85
+ print(f'Hallucination: {output[0]['generated_text'].strip().lower()}')
86
+ # Hallucination: yes
87
+ ```
88
+
89
+ ### Comparison with Other Models
90
+
91
+ We compared our merged model's performance on the hallucination detection benchmark against several other state-of-the-art language models:
92
+
93
+ | Model | Precision | Recall | F1 |
94
+ |---------------------- |----------:|-------:|-------:|
95
+ | Our Merged Model | 0.75 | 0.87 | 0.81 |
96
+ | GPT-4 | 0.93 | 0.72 | 0.82 |
97
+ | GPT-4 Turbo | 0.97 | 0.70 | 0.81 |
98
+ | Gemini Pro | 0.89 | 0.53 | 0.67 |
99
+ | GPT-3.5 | 0.89 | 0.65 | 0.75 |
100
+ | GPT-3.5-turbo-instruct| 0.89 | 0.80 | 0.84 |
101
+ | Palm 2 (Text Bison) | 1.00 | 0.44 | 0.61 |
102
+ | Claude V2 | 0.80 | 0.95 | 0.87 |
103
+
104
+ As shown in the table, our merged model achieves one of the highest F1 scores of 0.81, outperforming several other state-of-the-art language models on this hallucination detection task.
105
+
106
+ We will continue to improve and fine-tune our merged model to achieve even better performance across various benchmarks and tasks.
107
+
108
+ Citations:
109
+ Scores from arize/phoenix
110
+
111
+ ### Training Data
112
+
113
+ @misc{HaluEval,
114
+ author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian-Yun Nie and Ji-Rong Wen },
115
+ title = {HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models},
116
+ year = {2023},
117
+ journal={arXiv preprint arXiv:2305.11747},
118
+ url={https://arxiv.org/abs/2305.11747}
119
+ }
120
+
121
+ ### Framework versions
122
+
123
+ - PEFT 0.11.1
124
+ - Transformers 4.41.2
125
+ - Pytorch 2.3.0+cu121
126
+ - Datasets 2.19.2
127
+ - Tokenizers 0.19.1
128
+
129
+ ### Training hyperparameters
130
+
131
+ The following hyperparameters were used during training:
132
+ - learning_rate: 0.0001
133
+ - train_batch_size: 2
134
+ - eval_batch_size: 8
135
+ - seed: 42
136
+ - gradient_accumulation_steps: 4
137
+ - total_train_batch_size: 8
138
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
139
+ - lr_scheduler_type: linear
140
+ - lr_scheduler_warmup_steps: 10
141
+ - training_steps: 150
142
+
143
+ ### Framework versions
144
+
145
+ - PEFT 0.11.1
146
+ - Transformers 4.41.2
147
+ - Pytorch 2.3.0+cu121
148
+ - Datasets 2.19.2
149
+ - Tokenizers 0.19.1
150
+