Model Card for Model ID

VeriFastScore is a factuality evaluation model designed for long-form LLM outputs. It jointly extracts and verifies factual claims in a single model pass, providing a faster alternative to pipeline-based evaluators like VeriScore.

Model Details

Model Description

This is a fine-tuned LLaMA 3.1 8B Instruct model trained to extract and verify factual claims in long-form text, given associated retrieved evidence. The model is designed to reduce inference latency and cost while maintaining high agreement with more expensive pipeline-based factuality metrics.

Developed by: NGRAM at UMD, Lambda Labs
Model type: Factuality evaluation model (joint claim extraction and verification) (Causal LM)
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: meta-llama/Llama-3.1-8B-Instruct

Model Sources

Repository: github.com/RishanthRajendhran/VeriFastScore
Paper: arxiv.org/abs/2505.16973

Uses

Direct Use

The model takes as input a generated long-form response and a consolidated set of retrieved evidence sentences. It outputs a list of verifiable claims and corresponding factuality labels (Supported or Unsupported).

Downstream Use

Can be used to score factuality in evaluation pipelines (e.g., RLHF supervision), dataset filtering, or system-level benchmarking of LLM factuality.

Out-of-Scope Use

Not intended for use without retrieved evidence.

Bias, Risks, and Limitations

The model inherits potential biases from its teacher supervision (VeriScore) and the base language model. It may underperform on ambiguous claims, noisy evidence, or non-English text.

Recommendations

Use caution in high-stakes domains and supplement with human review if used for system-level feedback or alignment. Avoid use cases without explicit, relevant evidence input.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rishanthrajendhran/VeriFastScore")
model = AutoModelForCausalLM.from_pretrained("rishanthrajendhran/VeriFastScore")

system_prompt = '''You are trying to verify how factual a response is by extracting fine-grained, verifiable claims. Each claim must describe one single event or one single state (for example, “Nvidia was founded in 1993 in Sunnyvale, California, U.S.”) in one sentence with at most one embedded clause. Each fact should be understandable on its own and require no additional context. This means that all entities must be referred to by name but not by pronoun. Use the name of entities rather than definite noun phrases (e.g., “the teacher”) whenever possible. If a definite noun phrase is used, be sure to add modifiers (e.g., an embedded clause or a prepositional phrase). Each fact must be situated within relevant temporal and location details whenever needed.

All necessary specific details—including entities, dates, and locations—must be explicitly named, and verify here means that every detail of a claim is directly confirmed by the provided evidence. The verification process involves cross-checking each detail against the evidence; a detail is considered verified if it is clearly confirmed by the evidence.

Avoid extracting stories, personal experiences, hypotheticals (e.g., those using “would be” or the subjunctive mood), subjective opinions, suggestions, advice, instructions, or similarly non-factual content; however, biographical, historical, scientific, and similar texts are acceptable. Also, ignore any listed references.

For each extracted claim, classify it as follows:

Supported: Every detail of the claim (including entities, dates, and locations) is directly confirmed by the provided evidence with no contradictions.
Unsupported: One or more details of the claim are either missing from or contradicted by the provided evidence, even though the claim remains verifiable using external sources.

You do not need to justify what you extract.

Output format:
<fact 1>: <your judgment of fact 1>
<fact 2>: <your judgment of fact 2>
…
<fact n>: <your judgment of fact n>

If no verifiable claim can be extracted, simply output "No verifiable claim."'''

prompt = "### Response\n{response}\n### Evidence\n{evidence}".format(
  response=response,
  evidence=evidence
)

conversation_history = [
    {
    "role": "system",
    "content": system_prompt,
  }, {
    "role": "user",
    "content": prompt
  }
]
inputs = self.tokenizer.apply_chat_template(
    conversation=conversation_history,
    add_generation_prompt=True,
    tokenize=True,
    truncation=False,
    padding="do_not_pad"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs, skip_special_tokens=True))

Training Details

Training Data

Synthetic (response, evidence, claim, label) examples generated via VeriScore applied to long-form prompts from datasets like Tulu3-Personas. The dataset is available on HuggingFace. See paper for more details.

Training Procedure

Two-stage fine-tuning:

Stage 1: Supervision with claim-level evidence.
Stage 2: Supervision with a mixture of claim- and sentence-level evidence.

Preprocessing

In the original VeriFastScore pipeline, evidence is aggregated at the sentence level per response, tokenized, and paired with output claims using a structured prompt template. However, the \VeriFastScore model is agnostic to the provenance of the provided evidence.

Training Hyperparameters

Training regime: : bf16 mixed precision
Optimizer: AdamW
Scheduler: Cosine decay (optional placeholder)
Batch size: 8 (effective)
Epochs: 10 (5+5)

Speeds, Sizes, Times

Training Time: ~24*4 GPU hours (roughly 2 sec per training instance)
Model Size: 8B parameters

Evaluation

Testing Data, Factors & Metrics

Testing Data

~9k test instances using both claim-level and sentence-level evidence
Model rankings: 100 prompts from the Tulu3-Personas test set with responses from 12 LLMs

Metrics

Claim-level accuracy, precision, recall (automatic judgements using GPT-4o-mini)
Pearson correlation with factuality scores from VeriScore

Results

(Claim-level evidence) Pearson r with VeriScore: 0.86, p<0.001
(Sentence-level evidence) Pearson r with VeriScore: 0.80, p<0.001
Model rankings:
- System-level Pearson r: 0.94, p<0.001
- Speedup: 6.6× (9.9× if excluding retrieval) See paper for more details.

Summary

VeriFastScore delivers fast, interpretable factuality scores that closely track a strong multi-step baseline, while reducing cost and latency for large-scale evaluation.

Model Examination

Future work could explore explainability or rationale generation via mode-switching techniques or chain-of-thought prompting.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: A100 (Training), GH200 (Evaluation, Testing)
Hours used: 96 (Training)
Cloud Provider: Lambda Labs
Compute Region: us-central1
Carbon Emitted: 10.37 (Training)

Citation [optional]

BibTeX:

  @misc{rajendhran2025verifastscorespeedinglongformfactuality,
      title={VeriFastScore: Speeding up long-form factuality evaluation}, 
      author={Rishanth Rajendhran and Amir Zadeh and Matthew Sarte and Chuan Li and Mohit Iyyer},
      year={2025},
      eprint={2505.16973},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.16973}, 
}

Model Card Contact

rishanth@umd.edu

rishanthrajendhran
/

VeriFastScore