radinplaid commited on
Commit
3c69df2
·
verified ·
1 Parent(s): 4a094ac

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -21,34 +21,40 @@ model-index:
21
  metrics:
22
  - name: BLEU
23
  type: bleu
24
- value: 28.58
25
  - name: CHRF
26
  type: chrf
27
- value: 57.46
28
  ---
29
 
30
 
31
  # `quickmt-zh-en` Neural Machine Translation Model
32
 
33
- # Usage
34
 
35
- ## Install `quickmt`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ```bash
38
  git clone https://github.com/quickmt/quickmt.git
39
  pip install ./quickmt/
40
- ```
41
 
42
- ## Download model
43
-
44
- ```bash
45
  quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
46
  ```
47
 
48
- ## Use model
49
-
50
- Inference with `quickmt`:
51
-
52
  ```python
53
  from quickmt import Translator
54
 
@@ -63,125 +69,20 @@ t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但
63
  t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
64
  ```
65
 
66
- The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
67
 
68
- # Model Information
69
-
70
- * Trained using [`eole`](https://github.com/eole-nlp/eole)
71
- - It took about 1 day on a single RTX 4090 on [vast.ai](https://cloud.vast.ai)
72
- * Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
73
- * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
74
 
75
  ## Metrics
76
 
77
- BLEU and CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("zho_Hans"->"eng_Latn").
78
-
79
- "Time" is the time to translate the following input with a single CPU core:
80
-
81
- > 2019冠状病毒病(英語:Coronavirus disease 2019,缩写:COVID-19[17][18]),是一種由嚴重急性呼吸系統綜合症冠狀病毒2型(縮寫:SARS-CoV-2)引發的傳染病,导致了一场持续的疫情,成为人類歷史上致死人數最多的流行病之一。
82
-
83
- | Model | bleu | chrf2 | Time (s) |
84
- | -------------------------------- | ----- | ----- | ---- |
85
- | quickmt/quickmt-zh-en | 28.58 | 57.46 | 0.670 |
86
- | Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 | 0.838 |
87
- | facebook/m2m100_418M | 18.96 | 50.06 | 11.5 |
88
- | facebook/nllb-200-distilled-600M | 26.22 | 55.17 | 13.2 |
89
- | facebook/nllb-200-distilled-1.3B | 28.54 | 57.34 | 23.6 |
90
- | facebook/m2m100_1.2B | 24.68 | 54.68 | 25.7 |
91
- | google/madlad400-3b-mt | 28.74 | 58.01 | ??? |
92
-
93
- `quickmt-zh-en` is the fastest and delivers fairly high quality.
94
-
95
- Helsinki-NLP/opus-mt-zh-en is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate *and* a bit faster.
96
-
97
-
98
- ## Training Configuration
99
-
100
- ```yaml
101
- ### Vocab
102
- src_vocab_size: 20000
103
- tgt_vocab_size: 20000
104
- share_vocab: False
105
-
106
- data:
107
- corpus_1:
108
- path_src: hf://quickmt/quickmt-train-zh-en/zh
109
- path_tgt: hf://quickmt/quickmt-train-zh-en/en
110
- path_sco: hf://quickmt/quickmt-train-zh-en/sco
111
- valid:
112
- path_src: zh-en/dev.zho
113
- path_tgt: zh-en/dev.eng
114
-
115
- transforms: [sentencepiece, filtertoolong]
116
- transforms_configs:
117
- sentencepiece:
118
- src_subword_model: "zh-en/src.spm.model"
119
- tgt_subword_model: "zh-en/tgt.spm.model"
120
- filtertoolong:
121
- src_seq_length: 512
122
- tgt_seq_length: 512
123
-
124
- training:
125
- # Run configuration
126
- model_path: quickmt-zh-en
127
- keep_checkpoint: 4
128
- save_checkpoint_steps: 1000
129
- train_steps: 104000
130
- valid_steps: 1000
131
-
132
- # Train on a single GPU
133
- world_size: 1
134
- gpu_ranks: [0]
135
-
136
- # Batching
137
- batch_type: "tokens"
138
- batch_size: 13312
139
- valid_batch_size: 13312
140
- batch_size_multiple: 8
141
- accum_count: [4]
142
- accum_steps: [0]
143
-
144
- # Optimizer & Compute
145
- compute_dtype: "bfloat16"
146
- optim: "pagedadamw8bit"
147
- learning_rate: 1.0
148
- warmup_steps: 10000
149
- decay_method: "noam"
150
- adam_beta2: 0.998
151
-
152
- # Data loading
153
- bucket_size: 262144
154
- num_workers: 4
155
- prefetch_factor: 100
156
-
157
- # Hyperparams
158
- dropout_steps: [0]
159
- dropout: [0.1]
160
- attention_dropout: [0.1]
161
- max_grad_norm: 0
162
- label_smoothing: 0.1
163
- average_decay: 0.0001
164
- param_init_method: xavier_uniform
165
- normalization: "tokens"
166
-
167
- model:
168
- architecture: "transformer"
169
- layer_norm: standard
170
- share_embeddings: false
171
- share_decoder_embeddings: true
172
- add_ffnbias: true
173
- mlp_activation_fn: gated-silu
174
- add_estimator: false
175
- add_qkvbias: false
176
- norm_eps: 1e-6
177
- hidden_size: 1024
178
- encoder:
179
- layers: 8
180
- decoder:
181
- layers: 2
182
- heads: 16
183
- transformer_ff: 4096
184
- embeddings:
185
- word_vec_size: 1024
186
- position_encoding_type: "SinusoidalInterleaved"
187
- ```
 
21
  metrics:
22
  - name: BLEU
23
  type: bleu
24
+ value: 29.36
25
  - name: CHRF
26
  type: chrf
27
+ value: 58.10
28
  ---
29
 
30
 
31
  # `quickmt-zh-en` Neural Machine Translation Model
32
 
33
+ `quickmt-zh-en` is a reasonably fast and reasonably accurate neural machine translation model for translation from `zh` into `en`.
34
 
35
+
36
+ ## Model Information
37
+
38
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
39
+ * 200M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
40
+ * Separate source and target Sentencepiece tokenizers
41
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
42
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
43
+
44
+ See the `eole` model configuration in this repository for further details.
45
+
46
+
47
+ ## Usage with `quickmt`
48
+
49
+ First, install `quickmt` and download the model
50
 
51
  ```bash
52
  git clone https://github.com/quickmt/quickmt.git
53
  pip install ./quickmt/
 
54
 
 
 
 
55
  quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
56
  ```
57
 
 
 
 
 
58
  ```python
59
  from quickmt import Translator
60
 
 
69
  t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
70
  ```
71
 
72
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
73
 
 
 
 
 
 
 
74
 
75
  ## Metrics
76
 
77
+ BLEU and CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("zho_Hans"->"eng_Latn"). COMET22 with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32 except for `madlad400-3b-mt` which used a batch size of 1.
78
+
79
+ | Model | bleu | chrf2 | comet22 | Time (s) |
80
+ | -------------------------------- | ----- | ----- | ---- | ---- |
81
+ | quickmt/quickmt-zh-en | 29.36 | 58.10 | 0.8655 | 0.88 |
82
+ | Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 | 0.8426 | 3.78 |
83
+ | facebook/m2m100_418M | 15.99 | 50.13 | 0.7881 | 16.61 |
84
+ | facebook/nllb-200-distilled-600M | 26.22 | 55.18 | 0.8507 | 20.89 |
85
+ | facebook/m2m100_1.2B | 20.30 | 54.23 | 0.8206 | 33.12 |
86
+ | facebook/nllb-200-distilled-1.3B | 28.56 | 57.35 | 0.8620 | 36.64 |
87
+
88
+ `quickmt-zh-en` is the fastest *and* highest quality.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eole-config.yaml ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: zh_en/data_spm
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: zh-en/src.eole.vocab
12
+ tgt_vocab: zh-en/tgt.eole.vocab
13
+ src_vocab_size: 32000
14
+ tgt_vocab_size: 32000
15
+ vocab_size_multiple: 8
16
+ share_vocab: False
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ path_src: hf://quickmt/quickmt-train-zh-en/zh
22
+ path_tgt: hf://quickmt/quickmt-train-zh-en/en
23
+ path_sco: hf://quickmt/quickmt-train-zh-en/sco
24
+ valid:
25
+ path_src: zh-en/dev.zho
26
+ path_tgt: zh-en/dev.eng
27
+
28
+ transforms: [sentencepiece, filtertoolong]
29
+ transforms_configs:
30
+ sentencepiece:
31
+ src_subword_model: "zh-en/src.spm.model"
32
+ tgt_subword_model: "zh-en/tgt.spm.model"
33
+ filtertoolong:
34
+ src_seq_length: 256
35
+ tgt_seq_length: 256
36
+
37
+ training:
38
+ # Run configuration
39
+ model_path: zh-en/model
40
+ keep_checkpoint: 4
41
+ save_checkpoint_steps: 2000
42
+ train_steps: 100000
43
+ valid_steps: 2000
44
+
45
+ # Train on a single GPU
46
+ world_size: 1
47
+ gpu_ranks: [0]
48
+
49
+ # Batching
50
+ batch_type: "tokens"
51
+ batch_size: 8192
52
+ valid_batch_size: 8192
53
+ batch_size_multiple: 8
54
+ accum_count: [16]
55
+ accum_steps: [0]
56
+
57
+ # Optimizer & Compute
58
+ compute_dtype: "bf16"
59
+ optim: "pagedadamw8bit"
60
+ learning_rate: 2.0
61
+ warmup_steps: 10000
62
+ decay_method: "noam"
63
+ adam_beta2: 0.998
64
+
65
+ # Data loading
66
+ bucket_size: 128000
67
+ num_workers: 4
68
+ prefetch_factor: 100
69
+
70
+ # Hyperparams
71
+ dropout_steps: [0]
72
+ dropout: [0.1]
73
+ attention_dropout: [0.1]
74
+ max_grad_norm: 2
75
+ label_smoothing: 0.1
76
+ average_decay: 0.0001
77
+ param_init_method: xavier_uniform
78
+ normalization: "tokens"
79
+
80
+ model:
81
+ architecture: "transformer"
82
+ layer_norm: standard
83
+ share_embeddings: false
84
+ share_decoder_embeddings: true
85
+ add_ffnbias: true
86
+ mlp_activation_fn: gelu
87
+ add_estimator: false
88
+ add_qkvbias: false
89
+ norm_eps: 1e-6
90
+ hidden_size: 1024
91
+ encoder:
92
+ layers: 8
93
+ decoder:
94
+ layers: 2
95
+ heads: 16
96
+ transformer_ff: 4096
97
+ embeddings:
98
+ word_vec_size: 1024
99
+ position_encoding_type: "SinusoidalInterleaved"
eole_model/config.json ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "save_data": "zh_en/data_spm",
3
+ "src_vocab": "zh-en-benchmark/src.eole.vocab",
4
+ "report_every": 100,
5
+ "share_vocab": false,
6
+ "tgt_vocab": "zh-en-benchmark/tgt.eole.vocab",
7
+ "vocab_size_multiple": 8,
8
+ "tensorboard_log_dir_dated": "tensorboard/Feb-12_13-34-26",
9
+ "src_vocab_size": 32000,
10
+ "tensorboard": true,
11
+ "n_sample": 0,
12
+ "tgt_vocab_size": 32000,
13
+ "valid_metrics": [
14
+ "BLEU"
15
+ ],
16
+ "tensorboard_log_dir": "tensorboard",
17
+ "seed": 1234,
18
+ "overwrite": true,
19
+ "transforms": [
20
+ "sentencepiece",
21
+ "filtertoolong"
22
+ ],
23
+ "training": {
24
+ "accum_count": [
25
+ 16
26
+ ],
27
+ "train_steps": 100000,
28
+ "gpu_ranks": [
29
+ 0
30
+ ],
31
+ "save_checkpoint_steps": 2000,
32
+ "decay_method": "noam",
33
+ "bucket_size": 128000,
34
+ "world_size": 1,
35
+ "accum_steps": [
36
+ 0
37
+ ],
38
+ "optim": "pagedadamw8bit",
39
+ "prefetch_factor": 100,
40
+ "compute_dtype": "torch.bfloat16",
41
+ "normalization": "tokens",
42
+ "label_smoothing": 0.1,
43
+ "batch_size_multiple": 8,
44
+ "dropout_steps": [
45
+ 0
46
+ ],
47
+ "average_decay": 0.0001,
48
+ "dropout": [
49
+ 0.1
50
+ ],
51
+ "batch_type": "tokens",
52
+ "valid_batch_size": 8192,
53
+ "param_init_method": "xavier_uniform",
54
+ "adam_beta2": 0.998,
55
+ "model_path": "zh-en-benchmark/model",
56
+ "keep_checkpoint": 4,
57
+ "num_workers": 0,
58
+ "batch_size": 8192,
59
+ "attention_dropout": [
60
+ 0.1
61
+ ],
62
+ "warmup_steps": 10000,
63
+ "valid_steps": 2000,
64
+ "max_grad_norm": 2.0,
65
+ "learning_rate": 2.0
66
+ },
67
+ "data": {
68
+ "corpus_1": {
69
+ "path_align": null,
70
+ "path_src": "zh-en/train.ready.zh",
71
+ "path_tgt": "zh-en/train.ready.en",
72
+ "transforms": [
73
+ "sentencepiece",
74
+ "filtertoolong"
75
+ ]
76
+ },
77
+ "valid": {
78
+ "path_align": null,
79
+ "path_src": "zh-en-benchmark/dev.zho",
80
+ "path_tgt": "zh-en-benchmark/dev.eng",
81
+ "transforms": [
82
+ "sentencepiece",
83
+ "filtertoolong"
84
+ ]
85
+ }
86
+ },
87
+ "transforms_configs": {
88
+ "sentencepiece": {
89
+ "tgt_subword_model": "${MODEL_PATH}/tgt.spm.model",
90
+ "src_subword_model": "${MODEL_PATH}/src.spm.model"
91
+ },
92
+ "filtertoolong": {
93
+ "tgt_seq_length": 256,
94
+ "src_seq_length": 256
95
+ }
96
+ },
97
+ "model": {
98
+ "share_decoder_embeddings": true,
99
+ "position_encoding_type": "SinusoidalInterleaved",
100
+ "add_qkvbias": false,
101
+ "architecture": "transformer",
102
+ "add_ffnbias": true,
103
+ "hidden_size": 1024,
104
+ "transformer_ff": 4096,
105
+ "mlp_activation_fn": "gelu",
106
+ "norm_eps": 1e-06,
107
+ "layer_norm": "standard",
108
+ "heads": 16,
109
+ "add_estimator": false,
110
+ "share_embeddings": false,
111
+ "decoder": {
112
+ "heads": 16,
113
+ "decoder_type": "transformer",
114
+ "position_encoding_type": "SinusoidalInterleaved",
115
+ "add_qkvbias": false,
116
+ "layers": 2,
117
+ "add_ffnbias": true,
118
+ "hidden_size": 1024,
119
+ "n_positions": null,
120
+ "transformer_ff": 4096,
121
+ "rope_config": null,
122
+ "mlp_activation_fn": "gelu",
123
+ "norm_eps": 1e-06,
124
+ "layer_norm": "standard",
125
+ "tgt_word_vec_size": 1024
126
+ },
127
+ "embeddings": {
128
+ "word_vec_size": 1024,
129
+ "position_encoding_type": "SinusoidalInterleaved",
130
+ "tgt_word_vec_size": 1024,
131
+ "src_word_vec_size": 1024
132
+ },
133
+ "encoder": {
134
+ "heads": 16,
135
+ "position_encoding_type": "SinusoidalInterleaved",
136
+ "add_qkvbias": false,
137
+ "layers": 8,
138
+ "add_ffnbias": true,
139
+ "hidden_size": 1024,
140
+ "n_positions": null,
141
+ "src_word_vec_size": 1024,
142
+ "transformer_ff": 4096,
143
+ "rope_config": null,
144
+ "mlp_activation_fn": "gelu",
145
+ "norm_eps": 1e-06,
146
+ "layer_norm": "standard",
147
+ "encoder_type": "transformer"
148
+ }
149
+ }
150
+ }
eole_model/model.00.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6b78dad9fa4e560ce0abe0fc7c2b317ebe560fc23aa1897419253a1b334872d
3
+ size 820042008
eole_model/src.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23d03d562fc3f8fe57e497dac0ece4827c254675a80c103fc4bb4040638ceb67
3
+ size 733978
eole_model/tgt.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c373f1d78753313b0dbc337058bf8450e1fdd6fe662a49e0941affce44ec14c5
3
+ size 800955
eole_model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:408f5484e3d983d52cabf867241f10e0159e4017b7cb05718fa580ab0f081b86
3
- size 444765910
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:494518d6282fcd01f850ab4ab096e6a5c937aa834290a62e5efd435275828c9d
3
+ size 409972810
source_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
src.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0631c1a3d400ac4f42c8d63fb94ae71c69ee00acd4648c05eb02d952e7f7d0ef
3
- size 538185
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23d03d562fc3f8fe57e497dac0ece4827c254675a80c103fc4bb4040638ceb67
3
+ size 733978
target_vocabulary.json CHANGED
The diff for this file is too large to render. See raw diff
 
tgt.spm.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:83dcd0d44ad898117ae6c7fe24d996f186940d97265c4e91a78e3e07f657bc9e
3
- size 589008
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c373f1d78753313b0dbc337058bf8450e1fdd6fe662a49e0941affce44ec14c5
3
+ size 800955