Model Card for Model ID

VeriFastScore is a factuality evaluation model designed for long-form LLM outputs. It jointly extracts and verifies factual claims in a single model pass, providing a faster alternative to pipeline-based evaluators like VeriScore.

Model Details

Model Description

This is a fine-tuned LLaMA 3.1 8B Instruct model trained to extract and verify factual claims in long-form text, given associated retrieved evidence. The model is designed to reduce inference latency and cost while maintaining high agreement with more expensive pipeline-based factuality metrics.

  • Developed by: NGRAM at UMD, Lambda Labs
  • Model type: Factuality evaluation model (joint claim extraction and verification) (Causal LM)
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model: meta-llama/Llama-3.1-8B-Instruct

Model Sources

Uses

Direct Use

The model takes as input a generated long-form response and a consolidated set of retrieved evidence sentences. It outputs a list of verifiable claims and corresponding factuality labels (Supported or Unsupported).

Downstream Use

Can be used to score factuality in evaluation pipelines (e.g., RLHF supervision), dataset filtering, or system-level benchmarking of LLM factuality.

Out-of-Scope Use

  • Not intended for use without retrieved evidence.

Bias, Risks, and Limitations

The model inherits potential biases from its teacher supervision (VeriScore) and the base language model. It may underperform on ambiguous claims, noisy evidence, or non-English text.

Recommendations

Use caution in high-stakes domains and supplement with human review if used for system-level feedback or alignment. Avoid use cases without explicit, relevant evidence input.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rishanthrajendhran/VeriFastScore")
model = AutoModelForCausalLM.from_pretrained("rishanthrajendhran/VeriFastScore")

system_prompt = '''You are trying to verify how factual a response is by extracting fine-grained, verifiable claims. Each claim must describe one single event or one single state (for example, “Nvidia was founded in 1993 in Sunnyvale, California, U.S.”) in one sentence with at most one embedded clause. Each fact should be understandable on its own and require no additional context. This means that all entities must be referred to by name but not by pronoun. Use the name of entities rather than definite noun phrases (e.g., “the teacher”) whenever possible. If a definite noun phrase is used, be sure to add modifiers (e.g., an embedded clause or a prepositional phrase). Each fact must be situated within relevant temporal and location details whenever needed.

All necessary specific details—including entities, dates, and locations—must be explicitly named, and verify here means that every detail of a claim is directly confirmed by the provided evidence. The verification process involves cross-checking each detail against the evidence; a detail is considered verified if it is clearly confirmed by the evidence.

Avoid extracting stories, personal experiences, hypotheticals (e.g., those using “would be” or the subjunctive mood), subjective opinions, suggestions, advice, instructions, or similarly non-factual content; however, biographical, historical, scientific, and similar texts are acceptable. Also, ignore any listed references.

For each extracted claim, classify it as follows:

Supported: Every detail of the claim (including entities, dates, and locations) is directly confirmed by the provided evidence with no contradictions.
Unsupported: One or more details of the claim are either missing from or contradicted by the provided evidence, even though the claim remains verifiable using external sources.

You do not need to justify what you extract.

Output format:
<fact 1>: <your judgment of fact 1>
<fact 2>: <your judgment of fact 2>

<fact n>: <your judgment of fact n>

If no verifiable claim can be extracted, simply output "No verifiable claim."'''

prompt = "### Response\n{response}\n### Evidence\n{evidence}".format(
  response=response,
  evidence=evidence
)

conversation_history = [
    {
    "role": "system",
    "content": system_prompt,
  }, {
    "role": "user",
    "content": prompt
  }
]
inputs = self.tokenizer.apply_chat_template(
    conversation=conversation_history,
    add_generation_prompt=True,
    tokenize=True,
    truncation=False,
    padding="do_not_pad"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs, skip_special_tokens=True))

Training Details

Training Data

Synthetic (response, evidence, claim, label) examples generated via VeriScore applied to long-form prompts from datasets like Tulu3-Personas. The dataset is available on HuggingFace. See paper for more details.

Training Procedure

Two-stage fine-tuning:

  • Stage 1: Supervision with claim-level evidence.
  • Stage 2: Supervision with a mixture of claim- and sentence-level evidence.

Preprocessing

In the original VeriFastScore pipeline, evidence is aggregated at the sentence level per response, tokenized, and paired with output claims using a structured prompt template. However, the \VeriFastScore model is agnostic to the provenance of the provided evidence.

Training Hyperparameters

  • Training regime: : bf16 mixed precision
  • Optimizer: AdamW
  • Scheduler: Cosine decay (optional placeholder)
  • Batch size: 8 (effective)
  • Epochs: 10 (5+5)

Speeds, Sizes, Times

  • Training Time: ~24*4 GPU hours (roughly 2 sec per training instance)
  • Model Size: 8B parameters

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • ~9k test instances using both claim-level and sentence-level evidence
  • Model rankings: 100 prompts from the Tulu3-Personas test set with responses from 12 LLMs

Metrics

  • Claim-level accuracy, precision, recall (automatic judgements using GPT-4o-mini)
  • Pearson correlation with factuality scores from VeriScore

Results

  • (Claim-level evidence) Pearson r with VeriScore: 0.86, p<0.001
  • (Sentence-level evidence) Pearson r with VeriScore: 0.80, p<0.001
  • Model rankings:
    • System-level Pearson r: 0.94, p<0.001
    • Speedup: 6.6× (9.9× if excluding retrieval) See paper for more details.

Summary

VeriFastScore delivers fast, interpretable factuality scores that closely track a strong multi-step baseline, while reducing cost and latency for large-scale evaluation.

Model Examination

Future work could explore explainability or rationale generation via mode-switching techniques or chain-of-thought prompting.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: A100 (Training), GH200 (Evaluation, Testing)
  • Hours used: 96 (Training)
  • Cloud Provider: Lambda Labs
  • Compute Region: us-central1
  • Carbon Emitted: 10.37 (Training)

Citation [optional]

BibTeX:

  @misc{rajendhran2025verifastscorespeedinglongformfactuality,
      title={VeriFastScore: Speeding up long-form factuality evaluation}, 
      author={Rishanth Rajendhran and Amir Zadeh and Matthew Sarte and Chuan Li and Mohit Iyyer},
      year={2025},
      eprint={2505.16973},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.16973}, 
}

Model Card Contact

rishanth@umd.edu

Downloads last month
89
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rishanthrajendhran/VeriFastScore

Finetuned
(1408)
this model

Dataset used to train rishanthrajendhran/VeriFastScore