ZhiyuanChen commited on
Commit
f99a197
·
verified ·
1 Parent(s): 67e1bb6

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: rna
3
+ tags:
4
+ - Biology
5
+ - RNA
6
+ license: agpl-3.0
7
+ datasets:
8
+ - multimolecule/gencode
9
+ library_name: multimolecule
10
+ ---
11
+
12
+ # SpliceAI
13
+
14
+ Convolutional neural network for predicting mRNA splicing from pre-mRNA sequences.
15
+
16
+ ## Disclaimer
17
+
18
+ This is an UNOFFICIAL implementation of the [Predicting Splicing from Primary Sequence with Deep Learning](https://doi.org/10.1016/j.cell.2018.12.015) by Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou and Jeremy F. McRae.
19
+
20
+ The OFFICIAL repository of SpliceAI is at [Illumina/SpliceAI](https://github.com/Illumina/SpliceAI).
21
+
22
+ > [!TIP]
23
+ > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
24
+
25
+ **The team releasing SpliceAI did not write this model card for this model so this model card has been written by the MultiMolecule team.**
26
+
27
+ ## Model Details
28
+
29
+ SpliceAI is a convolutional neural network (CNN) trained to predict mRNA splicing site locations (acceptor and donor) from primary pre-mRNA sequences. The model was trained in a supervised manner using annotated splice junctions from human reference transcripts. It processes input RNA sequences and, for each nucleotide, predicts the probability of it being a splice acceptor, a splice donor, or neither. This allows for the identification of canonical splice sites and the prediction of cryptic splice sites potentially activated or inactivated by sequence variants. Please refer to the [Training Details](#training-details) section for more information on the training process.
30
+
31
+ ### Model Specification
32
+
33
+ | Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
34
+ | ---------- | ----------- | ------------------ | --------- | -------- |
35
+ | 16 | 32 | 3.48 | 70.39 | 35.11 |
36
+
37
+ ### Links
38
+
39
+ - **Code**: [multimolecule.spliceai](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/spliceai)
40
+ - **Weights**: [multimolecule/spliceai](https://huggingface.co/multimolecule/spliceai)
41
+ - **Paper**: [Predicting Splicing from Primary Sequence with Deep Learning](https://doi.org/10.1016/j.cell.2018.12.015)
42
+ - **Developed by**: Kishore Jaganathan, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F. McRae, Siavash Fazel Darbandi, David Knowles, Yang I. Li, Jack A. Kosmicki, Juan Arbelaez, Wenwu Cui, Grace B. Schwartz, Eric D. Chow, Efstathios Kanterakis, Hong Gao, Amirali Kia, Serafim Batzoglou, Stephan J. Sanders, Kyle Kai-How Farh
43
+ - **Original Repository**: [Illumina/SpliceAI](https://github.com/Illumina/SpliceAI)
44
+
45
+ ## Usage
46
+
47
+ The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
48
+
49
+ ```bash
50
+ pip install multimolecule
51
+ ```
52
+
53
+ ### Direct Use
54
+
55
+ You can use this model directly to predict the splicing sites of an RNA sequence:
56
+
57
+ ```python
58
+ >>> from multimolecule import RnaTokenizer, SpliceAiModel
59
+
60
+ >>> tokenizer = RnaTokenizer.from_pretrained("multimolecule/spliceai")
61
+ >>> model = SpliceAiModel.from_pretrained("multimolecule/spliceai")
62
+ >>> output = model(tokenizer("agcagucauuauggcgaa", return_tensors="pt")["input_ids"])
63
+
64
+ >>> output.keys()
65
+ odict_keys(['logits'])
66
+
67
+ >>> output.logits.squeeze()
68
+ tensor([[ 8.5123, -4.9607, -7.6787],
69
+ [ 8.6559, -4.4936, -8.6357],
70
+ [ 5.8514, -1.9375, -6.8030],
71
+ [ 7.3739, -5.3444, -5.2559],
72
+ [ 8.6336, -5.3187, -7.5741],
73
+ [ 6.1947, -1.5497, -7.6286],
74
+ [ 9.0482, -6.1002, -7.1229],
75
+ [ 7.9647, -5.6973, -6.5327],
76
+ [ 8.8795, -6.3714, -7.0204],
77
+ [ 7.9459, -5.4744, -6.0865],
78
+ [ 8.4272, -5.2556, -7.9027],
79
+ [ 7.7523, -5.8517, -6.9109],
80
+ [ 7.3027, -4.6946, -5.9420],
81
+ [ 8.1432, -4.3085, -7.7892],
82
+ [ 7.9060, -4.9454, -7.0091],
83
+ [ 8.9770, -5.3971, -7.3313],
84
+ [ 8.4292, -5.7455, -6.7811],
85
+ [ 8.2709, -6.1388, -6.6784]], grad_fn=<SqueezeBackward0>)
86
+ ```
87
+
88
+ ## Training Details
89
+
90
+ SpliceAI was trained to predict the location of splice donor and acceptor sites from primary DNA sequence.
91
+
92
+ ### Training Data
93
+
94
+ The SpliceAI model was trained on human reference transcripts obtained from [GENCODE](https://multimolecule.danling.org/datasets/gencode) (release 24, GRCh38).
95
+ This dataset comprises both protein-coding and non-protein-coding transcripts.
96
+
97
+ For training, a sequence window of 10,000 base pairs (bp) was used for each nucleotide whose splicing status was to be predicted, including 5,000 bp upstream and 5,000 bp downstream.
98
+ Sequences near transcript ends were padded with 'N' (unknown nucleotide) characters to maintain a consistent input length.
99
+ Annotated splice donor and acceptor sites from GENCODE served as positive labels for their respective classes.
100
+ All other intronic and exonic positions within these transcripts were considered negative (non-splice site) labels.
101
+
102
+ The data was partitioned by chromosome:
103
+ Chromosomes 1-19, X, and Y were designated for the training set.
104
+ Chromosome 20 was reserved as a test set.
105
+ A validation set, comprising 5% of transcripts from each training chromosome, was used for model selection and to monitor for overfitting.
106
+ Positions within 50 bp of a masked interval (an interval of >10 'N's) or within 50 bp of a transcript end were excluded from the training and validation datasets.
107
+
108
+ To address class imbalance, training examples were weighted such that the total loss contribution from positive examples (acceptor or donor sites) equaled that from negative examples (non-splice sites).
109
+ Within positive examples, acceptor and donor sites were weighted equally.
110
+
111
+ ### Training Procedure
112
+
113
+ #### Pre-training
114
+
115
+ The model was trained to minimize a cross-entropy loss, comparing its predicted splice site probabilities against the ground truth labels from GENCODE.
116
+
117
+ - Batch Size:64
118
+ - Epochs: 4
119
+ - Optimizer: Adam
120
+ - Learning rate: 1e-3
121
+ - Learning rate scheduler: Exponential
122
+ - Minimum learning rate: 1e-5
123
+
124
+ ## Citation
125
+
126
+ **BibTeX**:
127
+
128
+ ```bibtex
129
+ @article{jaganathan2019the,
130
+ abstract = {The splicing of pre-mRNAs into mature transcripts is remarkable for its precision, but the mechanisms by which the cellular machinery achieves such specificity are incompletely understood. Here, we describe a deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling precise prediction of noncoding genetic variants that cause cryptic splicing. Synonymous and intronic mutations with predicted splice-altering consequence validate at a high rate on RNA-seq and are strongly deleterious in the human population. De novo mutations with predicted splice-altering consequence are significantly enriched in patients with autism and intellectual disability compared to healthy controls and validate against RNA-seq in 21 out of 28 of these patients. We estimate that 9\%-11\% of pathogenic mutations in patients with rare genetic disorders are caused by this previously underappreciated class of disease variation.},
131
+ author = {Jaganathan, Kishore and Kyriazopoulou Panagiotopoulou, Sofia and McRae, Jeremy F and Darbandi, Siavash Fazel and Knowles, David and Li, Yang I and Kosmicki, Jack A and Arbelaez, Juan and Cui, Wenwu and Schwartz, Grace B and Chow, Eric D and Kanterakis, Efstathios and Gao, Hong and Kia, Amirali and Batzoglou, Serafim and Sanders, Stephan J and Farh, Kyle Kai-How},
132
+ copyright = {http://www.elsevier.com/open-access/userlicense/1.0/},
133
+ journal = {Cell},
134
+ keywords = {artificial intelligence; deep learning; genetics; splicing},
135
+ language = {en},
136
+ month = jan,
137
+ number = 3,
138
+ pages = {535--548.e24},
139
+ publisher = {Elsevier BV},
140
+ title = {Predicting splicing from primary sequence with deep learning},
141
+ volume = 176,
142
+ year = 2019
143
+ }
144
+ ```
145
+
146
+ ## Contact
147
+
148
+ Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
149
+
150
+ Please contact the authors of the [SpliceAI paper](https://doi.org/10.1016/j.cell.2018.12.015) for questions or comments on the paper/model.
151
+
152
+ ## License
153
+
154
+ This model is licensed under the [AGPL-3.0 License](https://www.gnu.org/licenses/agpl-3.0.html) and the [CC-BY-NC-4.0 License](https://creativecommons.org/licenses/by-nc/4.0/).
155
+
156
+ ```spdx
157
+ SPDX-License-Identifier: AGPL-3.0-or-later AND CC-BY-NC-4.0
158
+ ```
config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "SpliceAiModel"
4
+ ],
5
+ "batch_norm_eps": 0.001,
6
+ "batch_norm_momentum": 0.01,
7
+ "bos_token_id": 1,
8
+ "context": 10000,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout": 0.1,
12
+ "hidden_size": 32,
13
+ "id2label": null,
14
+ "label2id": null,
15
+ "mask_token_id": 4,
16
+ "model_type": "spliceai",
17
+ "null_token_id": 5,
18
+ "num_labels": 3,
19
+ "output_contexts": false,
20
+ "pad_token_id": 0,
21
+ "stages": [
22
+ {
23
+ "dilation": 1,
24
+ "kernel_size": 11,
25
+ "num_blocks": 4
26
+ },
27
+ {
28
+ "dilation": 4,
29
+ "kernel_size": 11,
30
+ "num_blocks": 4
31
+ },
32
+ {
33
+ "dilation": 10,
34
+ "kernel_size": 21,
35
+ "num_blocks": 4
36
+ },
37
+ {
38
+ "dilation": 25,
39
+ "kernel_size": 41,
40
+ "num_blocks": 4
41
+ }
42
+ ],
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.50.0",
45
+ "unk_token_id": 3,
46
+ "vocab_size": 4
47
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c92fac0d306016a2302e14b0d3b6b79bd9931aee05cd2bb2b3406b0cfd86a8e8
3
+ size 14114652
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9baa9e2b9308c5f590734ccd6ef7a71584cfe33aff0da5c1f2213dacbe77981
3
+ size 14367008
special_tokens_map.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "pad_token": "N",
3
+ "unk_token": "N"
4
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "4": {
4
+ "content": "N",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ }
11
+ },
12
+ "additional_special_tokens": [],
13
+ "bos_token": null,
14
+ "clean_up_tokenization_spaces": true,
15
+ "cls_token": null,
16
+ "codon": false,
17
+ "eos_token": null,
18
+ "extra_special_tokens": {},
19
+ "mask_token": null,
20
+ "model_max_length": 1000000000000000019884624838656,
21
+ "nmers": 1,
22
+ "pad_token": "N",
23
+ "replace_T_with_U": true,
24
+ "sep_token": null,
25
+ "tokenizer_class": "RnaTokenizer",
26
+ "unk_token": "N"
27
+ }
vocab.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ A
2
+ C
3
+ G
4
+ U
5
+ N