Update README.md
Browse files
README.md
CHANGED
@@ -33,7 +33,7 @@ Hyperparameters can be found in our [lingua config file](https://huggingface.co/
|
|
33 |
|
34 |
Comma v0.1-1T was trained only on English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
|
35 |
It will likely have poor performance on other languages or programming languages.
|
36 |
-
While we aimed to train solely on openly licensed data, license laundering and inaccurate metadata can result in erroneous license information in the Common Pile (for further discussion of this limitation, please see [our paper](https://
|
37 |
Consequently, we cannot make a guarantee that Comma v0.1-1T was trained exclusively on openly licensed text.
|
38 |
When preparing Comma v0.1's pre-training data, we made use of the Toxicity tagger from [Dolma](https://github.com/allenai/dolma) to attempt to remove problematic content.
|
39 |
However, Comma v0.1-1T may nevertheless reflect social biases present in its training data.
|
|
|
33 |
|
34 |
Comma v0.1-1T was trained only on English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
|
35 |
It will likely have poor performance on other languages or programming languages.
|
36 |
+
While we aimed to train solely on openly licensed data, license laundering and inaccurate metadata can result in erroneous license information in the Common Pile (for further discussion of this limitation, please see [our paper](https://huggingface.co/papers/2506.05209)).
|
37 |
Consequently, we cannot make a guarantee that Comma v0.1-1T was trained exclusively on openly licensed text.
|
38 |
When preparing Comma v0.1's pre-training data, we made use of the Toxicity tagger from [Dolma](https://github.com/allenai/dolma) to attempt to remove problematic content.
|
39 |
However, Comma v0.1-1T may nevertheless reflect social biases present in its training data.
|