File size: 2,877 Bytes
bad9871
98c3507
 
 
 
 
 
 
 
 
 
bad9871
867d10a
bad9871
2d27140
bad9871
c0ce1b4
 
867d10a
bad9871
b6503a0
 
 
bad9871
b6503a0
 
 
 
 
 
 
bad9871
49e9c75
bad9871
b6503a0
bad9871
b6503a0
bad9871
b6503a0
bad9871
6c3b367
 
b6503a0
c0ce1b4
bad9871
c0ce1b4
 
 
bad9871
b6503a0
bad9871
b6503a0
 
bad9871
c0ce1b4
867d10a
c0ce1b4
b6503a0
bad9871
b6503a0
bad9871
b6503a0
bad9871
b6503a0
f6df42f
 
 
 
 
 
b6503a0
c0ce1b4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73

---
license: apache-2.0
datasets:
- oscar-corpus/OSCAR-2109
language:
- pl
- en
pipeline_tag: text-generation
library_name: transformers
---

# B-GPT_pl_en_simultaneous

This is a bilingual GPT-2 style model. For the first half of training, this model was trained only on Polish data. In the second half of training, the model was trained on a 50%-50% mix of Polish and English data. At the end of training, 75% of training data seen by the model is Polish and 25% is English. The tokenizer was trained on the same overall proportions of data as the language model at the final step. 

This model was released alongside the paper [On the Acquisition of Shared Grammatical Representations in Bilingual Language Models](https://arxiv.org/abs/2503.03962), which contains more details about the models. Additionally, the [OSF page](https://osf.io/5cw2e/) provides all code and data related to the project. 

## Model details:

All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences.
For best results, make sure that [CLS] is prepended to your input sequence (see sample usage linked above)!
Details for this model specifically:

* Architecture: gpt2
* Parameters: 124770816
* Maximum sequence length: 512 tokens
* Training tokens: 12B
* Vocabulary size: 50000
* Compute cost: ~9 NVIDIA A6000 GPU hours
* CO2 Emission: 1.17 kg

Training dataset: [OSCAR 2021/09](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109)

Checkpoints are taken at training steps: 0, 10000, 20000, 30000, 40000, 50000, 64000, 64010, 64020, 64030, 64040, 64050, 64060, 64070, 64080, 64090, 64100, 64110, 64120, 64130, 64140, 64150, 64160, 64170, 64180, 64190, 64200, 64300, 64400, 64500, 64600, 64700, 64800, 64900, 65000, 66000, 67000, 68000, 69000, 70000, 80000, 90000, 100000, 110000, 120000, 128000.

## Use This Model

Load the model:

Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.

```
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_en_nl_sequential")
model = AutoModelForCausalLM.from_pretrained("catherinearnett/B-GPT_en_nl_sequential", revision = "128000")
```

Text Generation:

```
from transformers import pipeline

pipe = pipeline("text-generation", model="catherinearnett/B-GPT_en_nl_sequential")
    
print(pipe("I am a", max_length=20)[0]["generated_text"])
```

## Citation

If you use this model, please cite:

```
@article{arnett2025acquisition,
  title={On the Acquisition of Shared Grammatical Representations in Bilingual Language Models},
  author={Arnett, Catherine and Chang, Tyler A and Michaelov, James A and Bergen, Benjamin K},
  journal={arXiv preprint arXiv:2503.03962},
  year={2025}
}
```