File size: 10,025 Bytes
f7c7a3d
 
fd686a5
 
 
 
 
193296f
 
f7c7a3d
 
 
 
 
fd686a5
f7c7a3d
 
 
 
 
 
 
fd686a5
f7c7a3d
fd686a5
 
 
 
 
f7c7a3d
fd686a5
f7c7a3d
 
 
fd686a5
 
f7c7a3d
 
 
 
 
 
 
 
fd686a5
f7c7a3d
fd686a5
f7c7a3d
 
fd686a5
f7c7a3d
 
 
 
fd686a5
f7c7a3d
 
 
 
fd686a5
f7c7a3d
 
 
 
 
fd686a5
f7c7a3d
 
 
 
fd686a5
 
 
 
 
f7c7a3d
a0c598a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ba7858
a0c598a
 
 
 
 
 
 
 
6ba7858
fd686a5
f7c7a3d
 
 
 
 
 
193296f
f7c7a3d
 
 
 
fd686a5
 
 
 
f7c7a3d
fd686a5
f7c7a3d
fd686a5
f7c7a3d
 
 
 
fd686a5
 
6ba7858
fd686a5
 
f7c7a3d
fd686a5
f7c7a3d
 
fd686a5
 
f7c7a3d
 
 
 
 
 
 
 
 
 
fd686a5
 
f7c7a3d
 
 
 
fd686a5
 
f7c7a3d
 
fd686a5
 
 
 
 
 
f7c7a3d
 
fd686a5
f7c7a3d
 
 
fd686a5
f7c7a3d
 
fd686a5
f7c7a3d
 
 
 
 
 
 
fd686a5
 
 
 
 
f7c7a3d
 
 
 
 
fd686a5
 
 
 
 
 
 
 
 
 
 
f7c7a3d
 
 
fd686a5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
datasets:
- rishanthrajendhran/VeriFastScore
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
VeriFastScore is a factuality evaluation model designed for long-form LLM outputs. It jointly extracts and verifies factual claims in a single model pass, providing a faster alternative to pipeline-based evaluators like VeriScore.


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->
This is a fine-tuned LLaMA 3.1 8B Instruct model trained to extract and verify factual claims in long-form text, given associated retrieved evidence. The model is designed to reduce inference latency and cost while maintaining high agreement with more expensive pipeline-based factuality metrics.

- **Developed by:** NGRAM at UMD, Lambda Labs
- **Model type:** Factuality evaluation model (joint claim extraction and verification) (Causal LM)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** meta-llama/Llama-3.1-8B-Instruct

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** <a href="https://github.com/RishanthRajendhran/VeriFastScore">github.com/RishanthRajendhran/VeriFastScore</a>
- **Paper:** <a href="https://arxiv.org/abs/2505.16973">arxiv.org/abs/2505.16973</a>

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
The model takes as input a generated long-form response and a consolidated set of retrieved evidence sentences. It outputs a list of verifiable claims and corresponding factuality labels (Supported or Unsupported).

### Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
Can be used to score factuality in evaluation pipelines (e.g., RLHF supervision), dataset filtering, or system-level benchmarking of LLM factuality.

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
- Not intended for use without retrieved evidence.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The model inherits potential biases from its teacher supervision (VeriScore) and the base language model. It may underperform on ambiguous claims, noisy evidence, or non-English text.

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Use caution in high-stakes domains and supplement with human review if used for system-level feedback or alignment. Avoid use cases without explicit, relevant evidence input.

## How to Get Started with the Model

Use the code below to get started with the model.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rishanthrajendhran/VeriFastScore")
model = AutoModelForCausalLM.from_pretrained("rishanthrajendhran/VeriFastScore")

system_prompt = '''You are trying to verify how factual a response is by extracting fine-grained, verifiable claims. Each claim must describe one single event or one single state (for example, “Nvidia was founded in 1993 in Sunnyvale, California, U.S.”) in one sentence with at most one embedded clause. Each fact should be understandable on its own and require no additional context. This means that all entities must be referred to by name but not by pronoun. Use the name of entities rather than definite noun phrases (e.g., “the teacher”) whenever possible. If a definite noun phrase is used, be sure to add modifiers (e.g., an embedded clause or a prepositional phrase). Each fact must be situated within relevant temporal and location details whenever needed.

All necessary specific details—including entities, dates, and locations—must be explicitly named, and verify here means that every detail of a claim is directly confirmed by the provided evidence. The verification process involves cross-checking each detail against the evidence; a detail is considered verified if it is clearly confirmed by the evidence.

Avoid extracting stories, personal experiences, hypotheticals (e.g., those using “would be” or the subjunctive mood), subjective opinions, suggestions, advice, instructions, or similarly non-factual content; however, biographical, historical, scientific, and similar texts are acceptable. Also, ignore any listed references.

For each extracted claim, classify it as follows:

Supported: Every detail of the claim (including entities, dates, and locations) is directly confirmed by the provided evidence with no contradictions.
Unsupported: One or more details of the claim are either missing from or contradicted by the provided evidence, even though the claim remains verifiable using external sources.

You do not need to justify what you extract.

Output format:
<fact 1>: <your judgment of fact 1>
<fact 2>: <your judgment of fact 2>

<fact n>: <your judgment of fact n>

If no verifiable claim can be extracted, simply output "No verifiable claim."'''

prompt = "### Response\n{response}\n### Evidence\n{evidence}".format(
  response=response,
  evidence=evidence
)

conversation_history = [
    {
    "role": "system",
    "content": system_prompt,
  }, {
    "role": "user",
    "content": prompt
  }
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation_history,
    add_generation_prompt=True,
    tokenize=True,
    truncation=False,
    padding="do_not_pad"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Synthetic (response, evidence, claim, label) examples generated via VeriScore applied to long-form prompts from datasets like Tulu3-Personas. The dataset is available on <a href="https://huggingface.co/datasets/rishanthrajendhran/VeriFastScore">HuggingFace</a>. See <a href="https://arxiv.org/abs/2505.16973" style="color:black;">paper</a> for more details.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Two-stage fine-tuning:

- Stage 1: Supervision with claim-level evidence.
- Stage 2: Supervision with a mixture of claim- and sentence-level evidence.

#### Preprocessing

In the original VeriFastScore pipeline, evidence is aggregated at the sentence level per response, tokenized, and paired with output claims using a structured prompt template. However, the \VeriFastScore model is agnostic to the provenance of the provided evidence.


#### Training Hyperparameters

- **Training regime:** : bf16 mixed precision
- **Optimizer**: AdamW
- **Scheduler**: Cosine decay
- **Batch size**: 8 (effective)
- **Epochs**: 10 (5+5)

#### Speeds, Sizes, Times

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
- Training Time: ~24*4 GPU hours (roughly 2 sec per training instance)
- Model Size: 8B parameters

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->
- ~9k test instances using both claim-level and sentence-level evidence
- Model rankings: 100 prompts from the Tulu3-Personas test set with responses from 12 LLMs

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->
- Claim-level accuracy, precision, recall (automatic judgements using GPT-4o-mini)
- Pearson correlation with factuality scores from VeriScore

### Results
- (Claim-level evidence) Pearson r with VeriScore: 0.86, p<0.001
- (Sentence-level evidence) Pearson r with VeriScore: 0.80, p<0.001
- Model rankings:
  - System-level Pearson r: 0.94, p<0.001
  - Speedup: 6.6× (9.9× if excluding retrieval)
See paper for more details.

#### Summary
VeriFastScore delivers fast, interpretable factuality scores that closely track a strong multi-step baseline, while reducing cost and latency for large-scale evaluation.



## Model Examination

<!-- Relevant interpretability work for the model goes here -->
Future work could explore explainability or rationale generation via mode-switching techniques or chain-of-thought prompting.

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** A100 (Training), GH200 (Evaluation, Testing)
- **Hours used:** 96 (Training)
- **Cloud Provider:** Lambda Labs
- **Compute Region:** us-central1
- **Carbon Emitted:** 10.37 (Training)

## Citation [optional]

**BibTeX:**

<pre>
  @misc{rajendhran2025verifastscorespeedinglongformfactuality,
      title={VeriFastScore: Speeding up long-form factuality evaluation}, 
      author={Rishanth Rajendhran and Amir Zadeh and Matthew Sarte and Chuan Li and Mohit Iyyer},
      year={2025},
      eprint={2505.16973},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.16973}, 
}
</pre>

## Model Card Contact

rishanth@umd.edu