README.md · AstroMLab/AstroSage-70B at main

metadata

license: llama3.1
datasets:
  - teknium/OpenHermes-2.5
  - nvidia/Llama-Nemotron-Post-Training-Dataset
language:
  - en
base_model:
  - meta-llama/Llama-3.1-70B
pipeline_tag: text-generation

Model Name: AstroSage-70B

Version: 1.0

Release Date: 2025-05-20

Developed by: AstroMLab (Tijmen de Haan, Yuan-Sen Ting, Tirthankar Ghosal, Tuan Dung Nguyen, Alberto Accomazzi, Emily Herron, Vanessa Lama, Azton Wells, Nesar Ramachandra, Rui Pan)

Corresponding Contact: Tijmen de Haan (tijmen.dehaan@gmail.com)

Funded by:

Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science User Facility at Oak Ridge National Laboratory (U.S. Department of Energy).
Microsoft’s Accelerating Foundation Models Research (AFMR) program.
World Premier International Research Center Initiative (WPI), MEXT, Japan.
National Science Foundation (NSF).
UChicago Argonne LLC, Operator of Argonne National Laboratory (U.S. Department of Energy).

License: Llama 3.1 Community License

Reference Paper: Tijmen de Haan et al. (2025). "AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model" https://arxiv.org/abs/2505.17592

Model Details

Model Type: Autoregressive transformer-based LLM, specialized in astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation.

Base Model: Meta-Llama-3.1-70B

Model Architecture: AstroSage-70B is a fine-tuned derivative of the Meta-Llama-3.1-70B architecture, making no architectural changes. The Llama-3.1-70B-Instruct tokenizer is also used without modification.

Context Length: Fine-tuned on 8192-token sequences. Base model was trained to 128k context length.

Model Description

Overview: AstroSage-70B is a large-scale, domain-specialized language model tailored for research and education in astronomy, astrophysics, space science, cosmology, and astronomical instrumentation. It builds on the Llama-3.1-70B foundation model, enhanced through extensive continued pre-training (CPT) on a vast corpus of astronomical literature, further refined with supervised fine-tuning (SFT) on instruction-following datasets, and finally enhanced via parameter averaging (model merging) with other popular fine tunes. AstroSage-70B aims to achieve state-of-the-art performance on astronomy-specific tasks, providing researchers, students, and enthusiasts with an advanced AI assistant. This 70B parameter model represents a significant scaling up from the AstroSage-8B model. The primary enhancements from the AstroSage-8B model are:

Stronger base model, higher parameter count for increased capacity
Improved datasets
Improved learning hyperparameters
Reasoning capability (can be enabled or disabled at inference time)

Training Lineage

Base Model: Meta-Llama-3.1-70B.
Continued Pre-Training (CPT): The base model underwent 2.5 epochs of CPT (168k GPU-hours) on a specialized astronomy corpus (details below, largely inherited from AstroSage-8B) to produce AstroSage-70B-CPT. This stage imbues domain-specific knowledge and language nuances.
Supervised Fine-Tuning (SFT): AstroSage-70B-CPT was then fine-tuned for 0.6 epochs (13k GPU-hours) using astronomy-relevant and general-purpose instruction-following datasets, resulting in AstroSage-70B-SFT.
Final Mixture: The released AstroSage-70B model is created via parameter averaging / model merging:
- DARE-TIES with rescale: true and lambda: 1.2
- AstroSage-70B-CPT designated as the "base model"
- 70% AstroSage-70B-SFT (density 0.7)
- 15% Llama-3.1-Nemotron-70B-Instruct (density 0.5)
- 7.5% Llama-3.3-70B-Instruct (density 0.5)
- 7.5% Llama-3.1-70B-Instruct (density 0.5)

Intended Use: Like AstroSage-8B, this model can be used for a variety of LLM application, including

Providing factual information and explanations in astronomy, astrophysics, cosmology, and instrumentation.
Assisting with literature reviews and summarizing scientific papers.
Answering domain-specific questions with high accuracy.
Brainstorming research ideas and formulating hypotheses.
Assisting with programming tasks related to astronomical data analysis.
Serving as an educational tool for learning astronomical concepts.
Potentially forming the core of future agentic research assistants capable of more autonomous scientific tasks.

We hope that with the enhanced intelligence and reasoning ability of AstroSage-70B compared to AstroSage-8B you can find additional use cases.

Training Data

AstroSage-70B's training data is split into pre-training (Continued Pre-Training, or CPT for short) and post-training (Supervised Fine-Tuning, or SFT).

Continued Pre-Training (CPT) Data: The CPT data for AstroSage-70B starts with AstroSage-8B training dataset (see https://arxiv.org/abs/2411.09012 for more detail), and adds:

Data Processing & Cleaning: We apply ftfy post-processing.
Replay Data: We add a random selection of FineWeb samples to each training epoch. The added FineWeb samples are different for each epoch of training.

Supervised Fine-Tuning (SFT) Data for AstroSage-70B-SFT: The SFT dataset is a diverse mix of astronomy-specific and general-purpose instruction-following data, totaling approximately 8.7 GB and over 7.5 million entries. The components are:

NVIDIA Reasoning, Science: 2048.0 MB (252,847 entries)
AstroSage Q&A: 1983.7 MB (4,683,569 entries) - Astronomy-specific question-answer pairs, see https://arxiv.org/abs/2411.09012 / https://doi.org/10.1038/s41598-025-97131-y
Teknium OpenHermes 2.5: 1621.0 MB (1,001,551 entries)
cosmosage Q&A: 616.0 MB (1,354,733 entries) - Cosmology-specific question-answer pairs, see https://arxiv.org/abs/2407.04420 / https://doi.org/10.1016/j.ascom.2025.100934
NVIDIA Reasoning, Code: 600.0 MB (12,170 entries)
NVIDIA Reasoning, Math: 600.0 MB (33,236 entries)
NVIDIA Reasoning, Chat: 243.1 MB (36,395 entries)
Miscellaneous Other Astronomy & Instruction Data: 6 additional datasets totalling 810.2 MB (125,923 entries)

Evaluation

Quantitative evaluation using the AstroMLab-1 benchmark gives state-of-the-art performance, getting 86.2% of questions correct. This score is higher than all other models at the time of writing (May, 2025).

Bias, Risks, and Limitations

Knowledge Cutoff: The model's knowledge cutoff is a mixture of the knowledge cutoff of the foundation model Llama-3.1-70B, which is December 2023, and the cutoff for the astro-ph papers used in the training dataset, which is January 2024. The model will not be aware of developments in astronomy after that cutoff.
Hallucinations & Factual Accuracy: All LLMs, including AstroSage-70B, can generate outputs that sound plausible but are incorrect or nonsensical. All scientific or factual claims must be verified by users.
Inherited Biases: The model may reflect biases present in its extensive training data (e.g., from scientific literature, general web data).
Errors in Training Data: After training completed, the AstroSage Q&A SFT dataset was found to have a non-trivial proportion of references to "the article" or "the text", which are artifacts of how the synthetic training data was generated. The insufficient cleaning resulted in AstroSage-70B inadvertently being trained to occasionally refer to information that is not available in the current conversation. We find that this failure mode is however, fairly rare.
Depth of Specialization: AstroSage-70B is highly specialized, having been trained on all astro-ph papers from April 2007 to January 2024. However, LLM training does not result in complete memorization of all facts. We find it excels in topics that are discussed in approximately five or more papers.

Chat Template

AstroSage-70B follows the Llama-3.1 chat template. For example:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in astronomy, astrophysics, space science, cosmology, and astronomical instrumentation. Your role is provide helpful, factual answers to the user's query.<|eot_id|><|start_header_id|>user<|end_header_id|>

Explain the ISW effect.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Enabling Reasoning (a.k.a. Thinking, Chain-of-Thought)

Like popular models such as o1, QwQ and DeepSeek-R1, AstroSage-70B is capable of reasoning through a problem before giving an answer. To enable this:

Set the system prompt to: detailed thinking on
Prefill the assistant response with <think>

To enable reasoning for the example above, you would give

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

detailed thinking on<|eot_id|><|start_header_id|>user<|end_header_id|>

Explain the ISW effect.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<think>