If no verifiable claim can be extracted, simply output "No verifiable claim."'''
prompt = "### Response\n{response}\n### Evidence\n{evidence}".format(
response=response,
evidence=evidence
)
conversation_history = [
{
"role": "system",
"content": system_prompt,
}, {
"role": "user",
"content": prompt
}
]
inputs = tokenizer.apply_chat_template(
conversation=conversation_history,
add_generation_prompt=True,
tokenize=True,
truncation=False,
padding="do_not_pad",
return_tensors="pt"
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))
```
## Training Details
### Training Data
Synthetic (response, evidence, claim, label) examples generated via VeriScore applied to long-form prompts from datasets like Tulu3-Personas. The dataset is available on HuggingFace. See paper for more details.
### Training Procedure
Two-stage fine-tuning:
- Stage 1: Supervision with claim-level evidence.
- Stage 2: Supervision with a mixture of claim- and sentence-level evidence.
#### Preprocessing
In the original VeriFastScore pipeline, evidence is aggregated at the sentence level per response, tokenized, and paired with output claims using a structured prompt template. However, the \VeriFastScore model is agnostic to the provenance of the provided evidence.
#### Training Hyperparameters
- **Training regime:** : bf16 mixed precision
- **Optimizer**: AdamW
- **Scheduler**: Cosine decay
- **Batch size**: 8 (effective)
- **Epochs**: 10 (5+5)
#### Speeds, Sizes, Times
- Training Time: ~24*4 GPU hours (roughly 2 sec per training instance)
- Model Size: 8B parameters
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- ~9k test instances using both claim-level and sentence-level evidence
- Model rankings: 100 prompts from the Tulu3-Personas test set with responses from 12 LLMs
#### Metrics
- Claim-level accuracy, precision, recall (automatic judgements using GPT-4o-mini)
- Pearson correlation with factuality scores from VeriScore
### Results
- (Claim-level evidence) Pearson r with VeriScore: 0.86, p<0.001
- (Sentence-level evidence) Pearson r with VeriScore: 0.80, p<0.001
- Model rankings:
- System-level Pearson r: 0.94, p<0.001
- Speedup: 6.6× (9.9× if excluding retrieval)
See paper for more details.
#### Summary
VeriFastScore delivers fast, interpretable factuality scores that closely track a strong multi-step baseline, while reducing cost and latency for large-scale evaluation.
## Model Examination
Future work could explore explainability or rationale generation via mode-switching techniques or chain-of-thought prompting.
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** A100 (Training), GH200 (Evaluation, Testing)
- **Hours used:** 96 (Training)
- **Cloud Provider:** Lambda Labs
- **Compute Region:** us-central1
- **Carbon Emitted:** 10.37 (Training)
## Citation [optional]
**BibTeX:**
@misc{rajendhran2025verifastscorespeedinglongformfactuality,
title={VeriFastScore: Speeding up long-form factuality evaluation},
author={Rishanth Rajendhran and Amir Zadeh and Matthew Sarte and Chuan Li and Mohit Iyyer},
year={2025},
eprint={2505.16973},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.16973},
}
## Model Card Contact
rishanth@umd.edu