Model Card for Model ID
VeriFastScore is a factuality evaluation model designed for long-form LLM outputs. It jointly extracts and verifies factual claims in a single model pass, providing a faster alternative to pipeline-based evaluators like VeriScore.
Model Details
Model Description
This is a fine-tuned LLaMA 3.1 8B Instruct model trained to extract and verify factual claims in long-form text, given associated retrieved evidence. The model is designed to reduce inference latency and cost while maintaining high agreement with more expensive pipeline-based factuality metrics.
- Developed by: NGRAM at UMD, Lambda Labs
- Model type: Factuality evaluation model (joint claim extraction and verification) (Causal LM)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: meta-llama/Llama-3.1-8B-Instruct
Model Sources
- Repository: github.com/RishanthRajendhran/VeriFastScore
- Paper: arxiv.org/abs/2505.16973
Uses
Direct Use
The model takes as input a generated long-form response and a consolidated set of retrieved evidence sentences. It outputs a list of verifiable claims and corresponding factuality labels (Supported or Unsupported).
Downstream Use
Can be used to score factuality in evaluation pipelines (e.g., RLHF supervision), dataset filtering, or system-level benchmarking of LLM factuality.
Out-of-Scope Use
- Not intended for use without retrieved evidence.
Bias, Risks, and Limitations
The model inherits potential biases from its teacher supervision (VeriScore) and the base language model. It may underperform on ambiguous claims, noisy evidence, or non-English text.
Recommendations
Use caution in high-stakes domains and supplement with human review if used for system-level feedback or alignment. Avoid use cases without explicit, relevant evidence input.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rishanthrajendhran/VeriFastScore")
model = AutoModelForCausalLM.from_pretrained("rishanthrajendhran/VeriFastScore")
system_prompt = '''You are trying to verify how factual a response is by extracting fine-grained, verifiable claims. Each claim must describe one single event or one single state (for example, “Nvidia was founded in 1993 in Sunnyvale, California, U.S.”) in one sentence with at most one embedded clause. Each fact should be understandable on its own and require no additional context. This means that all entities must be referred to by name but not by pronoun. Use the name of entities rather than definite noun phrases (e.g., “the teacher”) whenever possible. If a definite noun phrase is used, be sure to add modifiers (e.g., an embedded clause or a prepositional phrase). Each fact must be situated within relevant temporal and location details whenever needed.
All necessary specific details—including entities, dates, and locations—must be explicitly named, and verify here means that every detail of a claim is directly confirmed by the provided evidence. The verification process involves cross-checking each detail against the evidence; a detail is considered verified if it is clearly confirmed by the evidence.
Avoid extracting stories, personal experiences, hypotheticals (e.g., those using “would be” or the subjunctive mood), subjective opinions, suggestions, advice, instructions, or similarly non-factual content; however, biographical, historical, scientific, and similar texts are acceptable. Also, ignore any listed references.
For each extracted claim, classify it as follows:
Supported: Every detail of the claim (including entities, dates, and locations) is directly confirmed by the provided evidence with no contradictions.
Unsupported: One or more details of the claim are either missing from or contradicted by the provided evidence, even though the claim remains verifiable using external sources.
You do not need to justify what you extract.
Output format:
<fact 1>: <your judgment of fact 1>
<fact 2>: <your judgment of fact 2>
…
<fact n>: <your judgment of fact n>
If no verifiable claim can be extracted, simply output "No verifiable claim."'''
prompt = "### Response\n{response}\n### Evidence\n{evidence}".format(
response=response,
evidence=evidence
)
conversation_history = [
{
"role": "system",
"content": system_prompt,
}, {
"role": "user",
"content": prompt
}
]
inputs = self.tokenizer.apply_chat_template(
conversation=conversation_history,
add_generation_prompt=True,
tokenize=True,
truncation=False,
padding="do_not_pad"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs, skip_special_tokens=True))
Training Details
Training Data
Synthetic (response, evidence, claim, label) examples generated via VeriScore applied to long-form prompts from datasets like Tulu3-Personas. The dataset is available on HuggingFace. See paper for more details.
Training Procedure
Two-stage fine-tuning:
- Stage 1: Supervision with claim-level evidence.
- Stage 2: Supervision with a mixture of claim- and sentence-level evidence.
Preprocessing
In the original VeriFastScore pipeline, evidence is aggregated at the sentence level per response, tokenized, and paired with output claims using a structured prompt template. However, the \VeriFastScore model is agnostic to the provenance of the provided evidence.
Training Hyperparameters
- Training regime: : bf16 mixed precision
- Optimizer: AdamW
- Scheduler: Cosine decay (optional placeholder)
- Batch size: 8 (effective)
- Epochs: 10 (5+5)
Speeds, Sizes, Times
- Training Time: ~24*4 GPU hours (roughly 2 sec per training instance)
- Model Size: 8B parameters
Evaluation
Testing Data, Factors & Metrics
Testing Data
- ~9k test instances using both claim-level and sentence-level evidence
- Model rankings: 100 prompts from the Tulu3-Personas test set with responses from 12 LLMs
Metrics
- Claim-level accuracy, precision, recall (automatic judgements using GPT-4o-mini)
- Pearson correlation with factuality scores from VeriScore
Results
- (Claim-level evidence) Pearson r with VeriScore: 0.86, p<0.001
- (Sentence-level evidence) Pearson r with VeriScore: 0.80, p<0.001
- Model rankings:
- System-level Pearson r: 0.94, p<0.001
- Speedup: 6.6× (9.9× if excluding retrieval) See paper for more details.
Summary
VeriFastScore delivers fast, interpretable factuality scores that closely track a strong multi-step baseline, while reducing cost and latency for large-scale evaluation.
Model Examination
Future work could explore explainability or rationale generation via mode-switching techniques or chain-of-thought prompting.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: A100 (Training), GH200 (Evaluation, Testing)
- Hours used: 96 (Training)
- Cloud Provider: Lambda Labs
- Compute Region: us-central1
- Carbon Emitted: 10.37 (Training)
Citation [optional]
BibTeX:
@misc{rajendhran2025verifastscorespeedinglongformfactuality, title={VeriFastScore: Speeding up long-form factuality evaluation}, author={Rishanth Rajendhran and Amir Zadeh and Matthew Sarte and Chuan Li and Mohit Iyyer}, year={2025}, eprint={2505.16973}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.16973}, }
Model Card Contact
- Downloads last month
- 89
Model tree for rishanthrajendhran/VeriFastScore
Base model
meta-llama/Llama-3.1-8B