arxiv:2412.02595

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Published on Dec 3, 2024

Upvote

Authors:

Dan Su ,

Markus Kliegl ,

Abstract

Enhanced dataset filtering techniques improve model accuracy and data quantity, enabling state-of-the-art performance in long-horizon training over 15T tokens.

AI-generated summary

Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html

View arXiv page View PDF Add to collection

Community

Datahero330

Apr 15

Hi team, thanks a lot for open-sourcing the Nemotron-CC dataset — really appreciate the effort behind building such a large corpus.

While exploring the data, I noticed that around 15% of the samples appear to be near-duplicates. I ran a MinHash-based deduplication and was able to filter out a number of repeated entries.

Here are a few examples of near-duplicate samples that were identified:

quality_tag	warc_record_id pair	content1	content2
high	01b263ef-40a1-4cd6-bfc4-63737f750fa9 vs fac16667-7e18-43f0-884f-091f61a6f2bf	generate a class with properties from an ActiveX control? Greetings! I am using Visual Studio 2008. I need to generate a wrapper class for an ActiveX control that has public methods and a public property. The generated class has no methods for dealing with the ConnectionString property. What do I have to do to get it? The definition of the class, using Visual Studio 6's OLE/COM Object Viewer, is below, along with the generated class files. Also, if you would be so kind, could you point me to a replacement for the OLE/COM object viewer? It doesn't seem to ship with VS 2008 (or at least, I've never found it. And what happened to word wrapping in this forum? As I'm typing this, "least" got split into two lines. Re: How to generate a class with properties from an ActiveX control? Which do you recommend I use? I mainly ask out of curiosity, since I already found an answer. I don't know if it's the best answer. I had been using "MFC Class from ActiveX Control". That gave me the public methods of my control, but it did not give me access to public properties. But when I used "MFC Class from Type Library", I got Set and Get methods for the property, and I was able to proceed from there	generate a class with properties from an ActiveX control? Greetings! I am using Visual Studio 2008. I need to generate a wrapper class for an ActiveX control that has public methods and a public property. The generated class has no methods for dealing with the ConnectionString property. What do I have to do to get it? The definition of the class, using Visual Studio 6's OLE/COM Object Viewer, is below, along with the generated class files. Also, if you would be so kind, could you point me to a replacement for the OLE/COM object viewer? It doesn't seem to ship with VS 2008 (or at least, I've never found it. And what happened to word wrapping in this forum? As I'm typing this, "least" got split into two lines. Re: How to generate a class with properties from an ActiveX control? Which do you recommend I use? I mainly ask out of curiosity, since I already found an answer. I don't know if it's the best answer. I had been using "MFC Class from ActiveX Control". That gave me the public methods of my control, but it did not give me access to public properties. But when I used "MFC Class from Type Library", I got Set and Get methods for the property, and I was able to proceed from there.
high	64eb6c00-a268-4d1a-906e-b0640b047fa4 vs b868a813-3bd5-4c3d-bf25-dbed563d0191	The endless news cycle of 2020 is brutal. So, let's get straight to it. Here are some Hollywood headlines you might have missed. The Headlines 1. Netflix pushed out a legacy executive with two decades at the company in favor of someone with more international experience. The entire industry has been playing a game of […]	Hollywood Headlines: Netflix's Big Shakeup, Wonder Woman's Delay Share The endless news cycle of 2020 is brutal. So, let's get straight to it. Here are some Hollywood headlines you might have missed. The Headlines 1. Netflix pushed out a legacy executive with two decades at the company in favor of someone with more international experience. The entire industry has been playing a game
low	e4995a50-8961-447e-b6f8-05efe50b4956 vs 7dfae74a-36f2-482d-a897-209c58e35f14	[Advanced Chemdry You can trust Chem-Dry to deliver quality and value. ABOUT Chem-Dry is the world's leading carpet and upholstery cleaner as ranked by Entrepreneur Magazine for 27 years in a row as the #1 in category. Chem-Dry also earned an award from Franchise Direct as one of the top 100 global franchises. With more than 3,500 locations worldwide, Chem-Dry is the world's leading carpet cleaner with international coverage by locally-owned franchises. We use only the industry's finest, most powerful equipment and proprietary cleaning solutions to ensure the best clean for your family. SERVICES RESIDENTIAL Carpet Cleaning Area & Oriental Rugs Upholstery Cleaning Protectant and Sanitizer for Carpets and Upholstery Leather Cleaning & Restoration COMMERCIAL Our Cleaning Process Why Choose Chem-Dry for Your Business Commercial Services CONTACT US TODAY FOR A QUOTE! View all photos (17](SOTOGRANDE:Magnific and luxury villa located in the fine complex of Sotogrande, Las Margaritas, a beautiful typical Andalusian villa where you can enjoy Sotogrande, Costa del Sol, Cádiz, Spain € 2,500,000.00 Property ID: R2509715 House Bedrooms: 6 SOTOGRANDE:Magnific and luxury villa located in the fine complex of Sotogrande, Las Margaritas, a beautiful typical Andalusian villa where you can enjoy amazing patios in an expansive but delightful residence. A true family home where you can enjoy the real Spanish way of living with all its traditional features including high ceilings, patios, nice stairways, great outside areas and much more. The villa reaches an incredible 862 m2, surrounded by an amazing mature garden of pure enjoyment that opens to a unique courtyard with a water fountain just upon the entrance of the house. It's convenient double garage, cozy living and dining room with a large fireplace, comfortable TV room, kitchen with an island and breakfast area, laundry and maids quarters suits the needs of many. For those who enjoy entertainment on the lower level there is a fitness gym, shower room, steam bath, pool table/games and a cinema room, as well as a professional wine cellar. It is worth mentioning as well the underground heating and air conditioning in the lower ground floor. Along with the outdoor terraces, you may also enjoy a lovely dining experience outside, overlooking the pool area and gardens for that summer escape moment with your family and friends. Interested? Contact us! Thank you. Interested? Contact us! Please check following form submission errors. I would like to receive more information about Loistawa Homes properties)	6 Bedroom Villa in Sotogrande Gallery Description SOTOGRANDE:Magnific and luxury villa located in the fine complex of Sotogrande, Las Margaritas, a beautiful typical Andalusian villa where you can enjoy amazing patios in an expansive but delightful residence. A true family home where you can enjoy the real Spanish way of living with all its traditional features including high ceilings, patios, nice stairways, great outside areas and much more. The villa reaches an incredible 862 m2, surrounded by an amazing mature garden of pure enjoyment that opens to a unique courtyard with a water fountain just upon the entrance of the house. It's convenient double garage, cozy living and dining room with a large fireplace, comfortable TV room, kitchen with an island and breakfast area, laundry and maids quarters suits the needs of many. For those who enjoy entertainment on the lower level there is a fitness gym, shower room, steam bath, pool table/games and a cinema room, as well as a professional wine cellar. It is worth mentioning as well the underground heating and air conditioning in the lower ground floor. Along with the outdoor terraces, you may also enjoy a lovely dining experience outside, overlooking the pool area and gardens for that summer escape moment with your family and friends.
low	f6c7b823-110e-4b73-9636-598e82e6a9b0 vs 17a44416-e4d3-4b40-9e93-5f9af3a9bb24	Description Cozy 1 bedroom apartment for rent in BKK3 area – Phnom Penh. This apartment comprises a closed-door kitchen that comes with a range hood, electric stove, bottom and top pantry cabinets, and a spacious bedroom with a 2-seater desk and a smart TV. There is also a balcony attached to the bedroom that overlooks the quiet street. Whereas the bathroom sits on the right side next to the main door. This apartment is located in BKK3 area; 450m from BELTEI International School, 500m from Lucky Express Supermarket BKK3/ East-West International School, 1km from Chip Mong Noro Mall, 1.2km from Amass Central Tower, 1.3km from Russian Market, and 2.4	Description Cozy 1 bedroom apartment for rent in BKK3 area – Phnom Penh. This apartment comprises a closed-door kitchen that comes with a range hood, electric stove, bottom and top pantry cabinets, and a spacious bedroom with a 2-seater desk and a smart TV. There is also a balcony attached to the bedroom that overlooks the quiet street. Whereas the bathroom sits on the right side next to the main door. This apartment is located in BKK3 area; 450m from BELTEI International School, 500m from Lucky Express Supermarket BKK3/ East-West International School, 1km from Chip Mong Noro Mall, 1.2km from Amass Central Tower, 1.3km from Russian Market, and 2.4km from Olympic National Stadium. Rental price: $270 up to $350/ month (backside doesn't have a balcony/ front side has a private balcony.)

These kinds of duplicates are fairly common throughout the dataset and may impact downstream model training. Just wanted to raise this in case it’s helpful for future iterations.

Thanks again for sharing this great resource!

Best,
Brent

mkliegl-nv

Paper author 3 days ago

Hi Brent! Thank you for sharing this observation. This could be due to the thresholds / hyperparameters we used for the global fuzzy deduplication. We will try to investigate in a bit more detail in the future.

AshleyLL

6 days ago

I hope this message finds you well. I've been thoroughly impressed by the rigor and innovation demonstrated in your work on the Nemotron-CC dataset. As I explore this valuable resource, I would be grateful for some clarification regarding two key aspects of the dataset construction:

1.Dataset Composition Questions:
For the 15TB dataset token dataset used in Nemotron-CC's performance evaluation, it is stated to consist of:
7.2TB from Nemotron-CC's own dataset
7.8TB from fixed-ratio mixed professional datasets (math, code, papers, books, patents, Wikipedia, etc.)

Could you please provide details on:
a) How the original 6.3TB Nemotron-CC dataset was expanded to 7.2TB?
b) The exact composition of the 7.8TB mixed dataset - specifically which datasets were used for each category (math, code, papers, etc.) and their respective proportions?

2.Deduplication Process:
For Nemotron-CC's core 6.3TB dataset, could you clarify:
a) What deduplication methods were employed (e.g., exact matching, fuzzy matching, semantic deduplication)?
b) Whether temporal deduplication was performed? For example, if a webpage with the same URL (or identical content) is crawled in multiple years (e.g., 2013, 2014, 2015) , how is this handled?

Thank you for your time and for making this valuable resource available!

mkliegl-nv

Paper author 3 days ago

Hi AshleyLL! Thank you for your interest and questions. We have added some clarifications to the v2 version of the paper on arXiv: https://arxiv.org/abs/2412.02595 .

1.(a) For the 15T token training run, a two-phase curriculum was employed that is described in more detail in https://arxiv.org/abs/2412.15285v1 . The first phase of 9T tokens used 59% English Common Crawl data (5.31T) and the second phase of 6T tokens used 31% (1.86T), for a combined total of 47.8% (7.17T). In the first phase, we used medium, medium-high, and high
quality data (real and synthetic), and in the second phase we used only high quality data (real and synthetic). The weights were a bit different for different quality buckets. Generally speaking, for benchmarks we found using about 4-8 epochs of high quality data is best before it becomes more beneficial to start using tokens from medium-high and medium quality buckets. That said, we also wanted to make sure we have diverse long-tail data so we included more medium-quality data than is optimal from purely a benchmark perspective. The dataset is available broken down by buckets and synthetic data type, so we encourage the community to do their own blending and curriculum experiments.

Please also see the Nemotron-H paper, though that used an expanded 3 or 4 phase curriculum: https://arxiv.org/abs/2504.03624

1.(b) The 27% non-English-Common-Crawl portion includes a lot of internal datasets we cannot spell out here. But the category breakdown is roughly: Books and patents (9%), papers (9%), code (5%), conversational (3%), wikipedia (1%). (See Table 12 in Appendix D of the updated paper.)

2.(a) We performed global exact and fuzzy deduplication across all 99 Common Crawl snapshots used. Additionally, we split each chunk into 8 roughly equal parts and did exact substring deduplication on each chunk.

2.(b) We did not do any temporal deduplication. If the same URL occurred in multiple years, then there could be multiple copies in the dataset. That said, if the content did not substantially change over the years, then most copies would probably get removed by the global exact and/or fuzzy deduplication from 2.(a).

MikeDd2025

3 days ago

Hi mkliegl-nv,

Thanks for your excellent work!

I have the same confusion: where does the 15T data come from?

I just cannot see the numbers add up.

Could you please nail down on the numbers and show the breakdown and sources of the 15T data?

mkliegl-nv

Paper author 3 days ago

I'm sorry if this is confusing. Could you elaborate on what you mean by "I just cannot see the numbers add up?" I suspect there may be some basic misunderstanding. For example, we are not talking about 15T tokens of unique data here. We are talking about a long-horizon training run for 15T tokens. Some of the higher quality datasets will be trained on for multiple epochs, as is common practice. Please see the references I linked to, too. Hope this helps.

MikeDd2025

3 days ago

Thanks a lot Mkliegl-nv!
Sorry for my confusion. I have been trying to figure out the cross linkage of your papers.
Lets start from the beginning:

How did the original 6.3TB Nemotron-CC dataset got expanded into 7.2TB?
Then, 15TB - 7.2TB = 7.8TB, where is this data?
I cannot seem to match within the other two papers.
I have seen your update in the latest version of the Nemotron-CC paper. But that directly jumped to 9T, without explaining its linkage to the original 6.3TB and 7.2TB of Nemotron-CC dataset.
How is the 9T related to 6.3TB and 7.2TB?
It is okay to have 4-8 epochs of high quality data, but which sub-portion of the dataset and how much is it? 200GB? 500GB? 1TB?
Thanks a lot!

mkliegl-nv

Paper author 3 days ago

•

edited 3 days ago

So I think, basically, you're asking for the exact datasets, blend and curriculum used for the 15T token run. That's a valid question but beyond the scope of this paper and what I can share. As mentioned, various internal datasets were used, but this paper was focused only on English Common Crawl data which we have publicly released. For the Crawl portion, I have given the general idea above in 1(a), the key point being 4-8 epochs of high quality Nemotron-CC data (real and synthetic - e.g., 2 epochs of real HQ + 2 epochs of synthetic diverse QA HQ + 1 epoch of synthetic distill HQ would count as 2 epochs of real + 3 epochs of synthetic). In any case the curriculum used for that model has been superseded. See, e.g., the curriculum described in the Nemotron-H paper, Figure 4, for a more recent high-level curriculum used. The Nemotron-H models have also been released publicly for research purposes.

MikeDd2025

3 days ago

Hi mkliegl-nv,
Thanks so much for your prompt reply!
I am not asking for the exact content of the dataset.
I just try to replicate the results in your paper as close as possible.
Please feel free to withhold some information that you cannot share, which I fully understand.
Could you please:

help me make the numbers add up (at least), starting from the 6.3T and 7.2T;
point me to how to replicate the results in your paper as close as possible.
Thanks a lot again for your kind help!

mkliegl-nv

Paper author 2 days ago

Here's the exact Nemotron-CC subset breakdown for the 8B-15T run described in the Nemotron-CC paper. So roughly 5 epochs of HQ real data and 5.8 epochs of HQ synthetic data of various types in total.

AshleyLL

1 day ago

Below is a summary of my current understanding. Could you please confirm whether these understanding and calculations are accurate? If not, I would be grateful for any clarification or corrections.

Data Labeling and Categorization
The publicly downloadable version of the Nemotron-CC dataset has been scored and classified into five quality labels:

High / Medium-High / Medium / Medium-Low / Low, using Mistral 8x22B-instruct, Nemotron-340B-instruct, and the DCLM classifier.

Specific categories like HQ-real have already been pre-sorted into their respective folders in the released dataset.

Multi-pass Training Strategy
As shown in the provided table, when referring to multi-round training on a specific subset (e.g., "HQ-real"), it implies that all the data within that subset is used multiple times during training.

Based on this interpretation, I have calculated the total amount of token usage from the Nemotron-CC dataset across all categories (Table 1).

PS: Could you please confirm if the subset 'HQ-wrap_medium' in the table corresponds to High-Wikipedia (synthetic 372.9B tokens) as mentioned in the paper?

The ~7.17T tokens I calculated from the Nemotron-CC dataset represent exactly the same amount as the ~7.2T tokens attributed to Nemotron-CC in the paper.

Non-Nemotron-CC Data
The remaining ~7.8T tokens (52.2% of total) come from other non-Nemotron-CC sources, including:

Books & Patents
Papers
Code
Conversational Data
Wikipedia
These categories maintain the same relative proportions as observed in the 1T token ablation study, where the distribution was approximately:

Books & Patents : Papers : Code : Conversational Data : Wikipedia = 9 : 9 : 5 : 3 : 1

From this, I inferred the breakdown(Table 2).

Distribution Across Stages
In Stage 1 (9T total), Nemotron-CC contributes ~5.31T tokens, while non-Nemotron-CC data contributes ~3.69T tokens.

In Stage 2 (6T total), Nemotron-CC contributes ~1.86T tokens, and non-Nemotron-CC data contributes ~4.14T tokens.

Using the fixed ratio mentioned above, I derived the per-category token usage for each stage(Table 3, 4).

Thank you very much for your time and consideration.

MikeDd2025

2 days ago

Great! Thanks a lot for your help miliegl-nv!

AshleyLL

1 day ago

•

edited 1 day ago

Thank you for your earlier clarification regarding the Nemotron-CC dataset — it's an extremely valuable resource.

I have a couple of follow-up questions regarding the training curriculum:

The table you kindly provided outlines various subsets (e.g., HQ-real, HQ-extract_knowledge) used in different phases of pretraining. However, the token counts mentioned do not seem to align with the total tokens available in the publicly available Nemotron CC dataset. Could you clarify how the categories in the publicly available Nemotron CC dataset were processed to form these specific subsets? Specifically:
Does the publicly available Nemotron CC dataset include explicit quality labels (e.g., "High," "Medium-High," "Medium") for each sample?
Are there data type labels distinguishing real vs. synthetic data? If synthetic, are subtypes like "Distill" or "Diverse QA Pairs" specified?
Were certain categories filtered out, upsampled, downsampled, or reweighted to achieve the token counts shown in the table?

Thank you in advance for any insights — your input would be greatly appreciated.

mkliegl-nv

Paper author 1 day ago

The public version of Nemotron-CC is the complete dataset we used in the paper. Token counts match up for the tokenizer we used (same as the Nemotron-H tokenizer, you can find it on HuggingFace). If you used a different tokenizer, things can of course look a bit different. For example, you will find in the paper (Table 2) that we have 553B high-quality real tokens in the dataset, and in the table above 2.743T total "HQ-real" tokens were used => 2.743T / 553B = 4.96 epochs, or 5x upsampling if you prefer. "HQ-real" is the "contrib/Nemotron/Nemotron-CC/data-jsonl/quality=high/kind=actual/kind2=actual/" part of the public dataset.
I think this is a great question but, as mentioned, this is beyond the scope of the Nemotron-CC paper and unfortunately I really can't be of much help here. Something like the OLMO2 effort may be more interesting for you if you need a completely open recipe. And of course it's worth keeping an eye on HuggingFace, as NVIDIA and others in the community are constantly contributing valuable new datasets.

AshleyLL

1 day ago

This comment has been hidden (marked as Off-Topic)

AshleyLL

1 day ago

Based on the interpretation of your earlier responses — that the non-Nemotron-CC portion follows an approximate distribution of Books & Patents : Papers : Code : Conversational Data : Wikipedia in the ratio 9 : 9 : 5 : 3 : 1 — a proxy dataset has been constructed to reflect this proportion.

While the relative proportions of the different data categories remain consistent across the two pre-training stages, the total number of tokens used differs significantly between them. Could you please clarify whether this difference arises from different processing or weighting applied to the same underlying dataset, or whether distinct datasets were used in each stage?

It would be helpful to hear your perspective on whether specific choices of data sources used and their proportional distribution are broadly aligned with what might be considered effective for such a training corpus.

MikeDd2025

1 day ago

Here's the exact Nemotron-CC subset breakdown for the 8B-15T run described in the Nemotron-CC paper. So roughly 5 epochs of HQ real data and 5.8 epochs of HQ synthetic data of various types in total.

HI mkliegl-ny,

I appreciate your insights, mkliegl-nv.

I'm trying to better understand the dataset progression across your Nemotron research. The documentation shows some variations in reported dataset sizes that I'd like to clarify:

Could you explain how the Nemotron-CC dataset grew from the initially reported 6.3T token to 7.2T token?

From the table you gave, I have tried to do the reconciliation but failed, maybe I am missing something here?

Is the 7.2T token the one labelled "7169805149903" in your table?

If this could be confirmed, then we will be able to understand the basic starting point.

In referencing the 15T token total training data, I notice a gap (15T token - 7.2T token = 7.8T token) that isn't clearly attributed in the publications. Where does this additional data originate from?

I'm not seeking confidential dataset details, but rather trying to understand the methodology sufficiently to align my research with yours and also replicate your work as much as possible. Any clarification you can provide would be valuable for my replication efforts.

Thank you for your consideration.

mkliegl-nv

Paper author about 23 hours ago

•

edited about 23 hours ago

Copy&pasting from one of my other responses above. "Token counts match up for the tokenizer we used (same as the Nemotron-H tokenizer, you can find it on HuggingFace). If you used a different tokenizer, things can of course look a bit different. For example, you will find in the paper (Table 2) that we have 553B high-quality real tokens in the dataset, and in the table above 2.743T total "HQ-real" tokens were used => 2.743T / 553B = 4.96 epochs, or 5x upsampling if you prefer. "HQ-real" is the "contrib/Nemotron/Nemotron-CC/data-jsonl/quality=high/kind=actual/kind2=actual/" part of the public dataset."
Yes.
The remaining 7.8T tokens are from datasets other than Nemotron-CC. You can see the curriculum and Nemotron-H papers to get some sense of the categories.

AshleyLL

about 10 hours ago

Thank you for providing detailed data.

1.I calculated the values based on your previously provided table and noticed a slight discrepancy.
For example, summing up and rounding the "Total tokens/Total Epochs (Rounded value)", column gives 4626.1B, whereas summing up the corresponding data in the paper results in 4616.8B.
Could you please clarify the source of this minor difference and how it can be resolved?

Regarding the 15T pre-training dataset, based on my understanding from your previous explanation, could you confirm whether the relative proportions of the five categories (books and patents, papers, code, dialogue-like data, and Wikipedia) in the 7.8T non-Nemotron data are indeed 9:9:5:3:1? Accordingly, would their token counts be approximately 2.6T, 2.6T, 1.444T, 0.867T, and 0.289T, respectively?

Thanks again for your time and clarification.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Abstract

Community

Models citing this paper 4

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 1