z-coder commited on
Commit
773c9ba
·
verified ·
1 Parent(s): aa074e1

Upload 14 files

Browse files
README.md CHANGED
@@ -1,3 +1,486 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: image-text-to-text
4
+ license: mit
5
+ ---
6
+
7
+ # Model Card for Magma-8B
8
+
9
+ <!-- Provide a quick summary of what the model is/does. -->
10
+
11
+ <div align="center">
12
+ <h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
13
+
14
+ [Jianwei Yang](https://jwyang.github.io/)<sup>*</sup><sup>1</sup><sup>†</sup>&nbsp;
15
+ [Reuben Tan](https://cs-people.bu.edu/rxtan/)<sup>1</sup><sup>†</sup>&nbsp;
16
+ [Qianhui Wu](https://qianhuiwu.github.io/)<sup>1</sup><sup>†</sup>&nbsp;
17
+ [Ruijie Zheng](https://ruijiezheng.com/)<sup>2</sup><sup>‡</sup>&nbsp;
18
+ [Baolin Peng](https://scholar.google.com/citations?user=u1CNjgwAAAAJ&hl=en&oi=ao)<sup>1</sup><sup>‡</sup>&nbsp;
19
+ [Yongyuan Liang](https://cheryyunl.github.io)<sup>2</sup><sup>‡</sup>
20
+
21
+ [Yu Gu](http://yu-gu.me/)<sup>1</sup>&nbsp;
22
+ [Mu Cai](https://pages.cs.wisc.edu/~mucai/)<sup>3</sup>&nbsp;
23
+ [Seonghyeon Ye](https://seonghyeonye.github.io/)<sup>4</sup>&nbsp;
24
+ [Joel Jang](https://joeljang.github.io/)<sup>5</sup>&nbsp;
25
+ [Yuquan Deng](https://scholar.google.com/citations?user=LTC0Q6YAAAAJ&hl=en)<sup>5</sup>&nbsp;
26
+ [Lars Liden](https://sites.google.com/site/larsliden)<sup>1</sup>&nbsp;
27
+ [Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/)<sup>1</sup><sup>▽</sup>
28
+
29
+ <sup>1</sup> Microsoft Research; <sup>2</sup> University of Maryland; <sup>3</sup> University of Wisconsin-Madison
30
+ <sup>4</sup> KAIST; <sup>5</sup> University of Washington
31
+
32
+ <sup>*</sup> Project lead <sup>†</sup> First authors <sup>‡</sup> Second authors <sup>▽</sup> Leadership
33
+
34
+ \[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Hugging Face Paper](https://huggingface.co/papers/2502.13130)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Video](https://www.youtube.com/watch?v=SbfzvUU5yM8)\]
35
+
36
+ </div>
37
+
38
+ ## Agents
39
+
40
+ ### UI Navigation
41
+ <div align="center">
42
+ <div align="center" style="display: inline-block; width: 48%;">
43
+ <video autoplay muted loop controls playsinline style="margin-bottom: 2px;">
44
+ <source src="https://microsoft.github.io/Magma/static/videos/ui_weather_and_flight_mode.mp4" type="video/mp4">
45
+ </video>
46
+ <p class="is-5 has-text-centered" style="font-size: 14px;">What's weather in Seattle? & turn on flight mode</p>
47
+ </div>
48
+ <div align="center" style="display: inline-block; width: 48%;">
49
+ <video autoplay muted loop controls playsinline style="margin-bottom: 2px;">
50
+ <source src="https://microsoft.github.io/Magma/static/videos/ui_wordle.mp4" type="video/mp4">
51
+ </video>
52
+ <p class="is-5 has-text-centered" style="font-size: 14px;">Share and message this to Bob Steve. Click send button</p>
53
+ </div>
54
+ </div>
55
+
56
+ ### Robot Manipulation
57
+ <div align="center">
58
+ <div align="center">
59
+ <div style="display: flex; justify-content: space-between; gap: 1%;">
60
+ <div style="width: 32%;">
61
+ <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden; margin-bottom: 5px;">
62
+ <source src="https://microsoft.github.io/Magma/static/videos/magma_hotdog.mp4" type="video/mp4">
63
+ </video>
64
+ </div>
65
+ <div style="width: 32%;">
66
+ <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden; margin-bottom: 5px;">
67
+ <source src="https://microsoft.github.io/Magma/static/videos/magma_mushroom.mp4" type="video/mp4">
68
+ </video>
69
+ </div>
70
+ <div style="width: 32%;">
71
+ <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden; margin-bottom: 5px;">
72
+ <source src="https://microsoft.github.io/Magma/static/videos/magma_left.mp4" type="video/mp4">
73
+ </video>
74
+ </div>
75
+ </div>
76
+ </div>
77
+ <div align="center">
78
+ <div style="display: flex; justify-content: space-between; gap: 1%;">
79
+ <div style="width: 32%;">
80
+ <p style="text-align: center;font-size: 14px;margin-top: 0;">Pick Place Hotdog Sausage</p>
81
+ </div>
82
+ <div style="width: 32%;">
83
+ <p style="text-align: center;font-size: 14px;margin-top: 0;">Put Mushroom Place Pot</p>
84
+ </div>
85
+ <div style="width: 32%;">
86
+ <p style="text-align: center;font-size: 14px;margin-top: 0;">Push Cloth Left to Right (Out-of-Dist.)</p>
87
+ </div>
88
+ </div>
89
+ </div>
90
+ </div>
91
+
92
+ ### Gaming
93
+
94
+ Task: Model controls the robot to collect green blocks.
95
+
96
+ <div align="center">
97
+ <div align="center" style="display: inline-block; width: 48%;">
98
+ <video autoplay muted loop controls playsinline style="margin-bottom: 2px;">
99
+ <source src="https://microsoft.github.io/Magma/static/videos/magma_vs_llava.mp4" type="video/mp4">
100
+ </video>
101
+ <p class="is-5 has-text-centered" style="font-size: 14px;">Magma v.s. LLaVA-OneVision</p>
102
+ </div>
103
+ <div align="center" style="display: inline-block; width: 48%;">
104
+ <video autoplay muted loop controls playsinline style="margin-bottom: 2px;">
105
+ <source src="https://microsoft.github.io/Magma/static/videos/magma_vs_gpt4omini.mp4" type="video/mp4">
106
+ </video>
107
+ <p class="is-5 has-text-centered" style="font-size: 14px;">Magma v.s. GPT4o-minni</p>
108
+ </div>
109
+ </div>
110
+
111
+ ## Model Details
112
+
113
+ <div align="center">
114
+ <img src="https://github.com/microsoft/Magma/blob/main/assets/images/magma_teaser.png?raw=true" width="100%">
115
+ </div>
116
+
117
+ ### Model Description
118
+
119
+ <!-- Provide a longer summary of what this model is. -->
120
+
121
+ Magma is a multimodal agentic AI model that can generate text based on the input text and image. The model is designed for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, in particular the multimodal agentic AI. The main innovation of this model lies on the introduction of two technical innovations: **Set-of-Mark** and **Trace-of-Mark**, and the leverage of a **large amount of unlabeled video data** to learn the spatial-temporal grounding and planning. Please refer to our paper for more technical details.
122
+
123
+ ### Highlights
124
+ * **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
125
+ * **Versatile Capabilities:** Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
126
+ * **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
127
+ * **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
128
+
129
+
130
+ ## License
131
+
132
+ The model is developed by Microsoft and is funded by Microsoft Research. The model is shared by Microsoft Research and is licensed under the MIT License.
133
+
134
+ <!-- {{ model_description | default("", true) }}
135
+
136
+ - **Developed by:** {{ developers | default("[More Information Needed]", true)}}
137
+ - **Funded by [optional]:** {{ funded_by | default("[More Information Needed]", true)}}
138
+ - **Shared by [optional]:** {{ shared_by | default("[More Information Needed]", true)}}
139
+ - **Model type:** {{ model_type | default("[More Information Needed]", true)}}
140
+ - **Language(s) (NLP):** {{ language | default("[More Information Needed]", true)}}
141
+ - **License:** {{ license | default("[More Information Needed]", true)}}
142
+ - **Finetuned from model [optional]:** {{ base_model | default("[More Information Needed]", true)}} -->
143
+
144
+ ## How to Get Started with the Model
145
+
146
+ <!-- {{ get_started_code | default("[More Information Needed]", true)}} -->
147
+
148
+ To get started with the model, you first need to make sure that `transformers` and `torch` are installed, as well as installing the following dependencies:
149
+
150
+ ```bash
151
+ pip install torchvision Pillow open_clip_torch
152
+ ```
153
+
154
+ ⚠️ Please note that you need to install our customized transformers lib:
155
+ ```bash
156
+ pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2
157
+ ```
158
+ See [here](https://github.com/microsoft/Magma?tab=readme-ov-file#installation) for the reason why you need this.
159
+
160
+ Then you can run the following code:
161
+
162
+ ```python
163
+ import torch
164
+ from PIL import Image
165
+ from io import BytesIO
166
+ import requests
167
+
168
+ from transformers import AutoModelForCausalLM, AutoProcessor
169
+
170
+ # Load the model and processor
171
+ dtype = torch.bfloat16
172
+ model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype)
173
+ processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
174
+ model.to("cuda")
175
+
176
+ # Inference
177
+ url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
178
+ image = Image.open(BytesIO(requests.get(url, stream=True).content))
179
+ image = image.convert("RGB")
180
+
181
+ convs = [
182
+ {"role": "system", "content": "You are agent that can see, talk and act."},
183
+ {"role": "user", "content": "<image_start><image><image_end>\nWhat is in this image?"},
184
+ ]
185
+ prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
186
+ inputs = processor(images=[image], texts=prompt, return_tensors="pt")
187
+ inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
188
+ inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
189
+ inputs = inputs.to("cuda").to(dtype)
190
+
191
+ generation_args = {
192
+ "max_new_tokens": 128,
193
+ "temperature": 0.0,
194
+ "do_sample": False,
195
+ "use_cache": True,
196
+ "num_beams": 1,
197
+ }
198
+
199
+ with torch.inference_mode():
200
+ generate_ids = model.generate(**inputs, **generation_args)
201
+
202
+ generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
203
+ response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
204
+ print(response)
205
+ ```
206
+
207
+ ## Training Details
208
+
209
+ ### Training Data
210
+
211
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
212
+
213
+ <!-- {{ training_data | default("[More Information Needed]", true)}} -->
214
+
215
+ Our training data consists of:
216
+
217
+ * Generic Image SFT Data: [LLaVA-Next](https://llava-vl.github.io/blog/2024-01-30-llava-next/), [InfoGrpahicVQA](https://www.docvqa.org/datasets/infographicvqa), [ChartQA_Augmented](https://github.com/vis-nlp/ChartQA), [FigureQA](https://www.microsoft.com/en-us/research/project/figureqa-dataset/), [TQA](https://paperswithcode.com/dataset/tqa), [ScienceQA](https://scienceqa.github.io/).
218
+
219
+ * Generic Video SFT Data: [ShareGPT4Video](https://sharegpt4video.github.io/) and [LLaVA-Video](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K).
220
+
221
+ * Instructional Video Data: [Ego4d](https://ego4d-data.org/), [Somethingv2](https://www.qualcomm.com/developer/software/something-something-v-2-dataset), [Epic-Kitchen](https://epic-kitchens.github.io/2025) and other related instructional videos.
222
+
223
+ * Robotics Manipulation Data: [Open-X-Embodiment](https://robotics-transformer-x.github.io/).
224
+
225
+ * UI Grounding Data: [SeeClick](https://github.com/njucckevin/SeeClick).
226
+
227
+ * UI Navigation Data: [Mind2web](https://osu-nlp-group.github.io/Mind2Web/) and [AITW](https://github.com/google-research/google-research/tree/master/android_in_the_wild).
228
+
229
+ The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data.
230
+
231
+ More details can be found in our paper.
232
+
233
+ [Microsoft Privacy Notice](https://go.microsoft.com/fwlink/?LinkId=521839)
234
+
235
+ ### Training Procedure
236
+
237
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
238
+
239
+ #### Preprocessing
240
+
241
+ <!-- {{ preprocessing | default("[More Information Needed]", true)}} -->
242
+ In addition to the text-related preprocessing, we mainly undertake the following image and video preprocessing steps:
243
+
244
+ * UI Grounding and Navigation Data: For each UI screenshot, we extract the bounding boxes for the UI elements, and apply [Set-of-Mark Prompting](https://arxiv.org/abs/2310.11441) to overlay numeric marks on the raw image. The model is trained to generate the UI grounding text based on the image and the Set-of-Mark prompts.
245
+
246
+ * Instruction Video Data: For each video clip, we apply [Co-Tracker](https://co-tracker.github.io/) to extract the grid traces and then apply filtering algorithm to remove the noisy or static points. For videos that bear camera motion, we further apply homography transformation to stabilize the video clips. In the end, we assign a numeric mark for each trace which gives us a set of trace-of-mark. The model is trained to generate the trace-of-mark given the video clips and instructional text.
247
+
248
+ * Robotics Manipulation Data: For robotics data in Open-X Embodiment, we extract the 7 DoF robot gripper state and also extract the trace-of-mark from the video clips. Similar filtering and stabilization steps are applied to the video clips. The model is trained to generate the robot manipulation action as well as the trace-of-mark given the video clips and instructional text.
249
+
250
+ After all these preprocessing, we combine them with existing text annotations to form our final multimodal training data. We refer to our paper for more technical details.
251
+
252
+ #### Training Hyperparameters
253
+
254
+ <!-- - **Training regime:** {{ training_regime | default("[More Information Needed]", true)}} fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
255
+
256
+ We used bf16 mixed precision for training on H100s and MI300s. We used the following hyperparameters for training:
257
+
258
+ * Batch size: 1024
259
+ * Learning rate: 1e-5
260
+ * Max sequence length: 4096
261
+ * Resolution: maximally 1024x1024 for image, 512x512 for video frame.
262
+ * Pretraining Epochs: 3
263
+
264
+
265
+ ## Evaluation
266
+
267
+ <!-- This section describes the evaluation protocols and provides the results. -->
268
+ We evaluate the model in zero-shot manner on a wide range of tasks, mostly agent-related tasks.
269
+
270
+ ### Testing Data, Factors & Metrics
271
+ <!-- This should link to a Dataset Card if possible. -->
272
+
273
+ <!-- {{ testing_data | default("[More Information Needed]", true)}} -->
274
+
275
+ <!-- #### Factors
276
+
277
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
278
+
279
+ <!-- {{ testing_factors | default("[More Information Needed]", true)}} -->
280
+
281
+ #### Zero-shot Testing Data
282
+
283
+ We evaluate the model's zero-shot performance on the following datasets:
284
+
285
+ * UI Grounding: [ScreenSpot](https://huggingface.co/datasets/rootsautomation/ScreenSpot) and [VisualWebArena](https://jykoh.com/vwa).
286
+
287
+ * Robotics Manipulation: [SimplerEnv](https://jykoh.com/vwa) and WidowX real robot.
288
+
289
+ * Spatial Understanding and Reasoning: [VSR](https://github.com/cambridgeltl/visual-spatial-reasoning), [BLINK](https://zeyofu.github.io/blink/) and [SpatialEval](https://spatialeval.github.io/).
290
+
291
+
292
+
293
+ #### Finetuned Testing Data
294
+
295
+ We evaluate the model's performance after finetuning on the following datasets:
296
+
297
+ * UI Navigation: [Mind2Web](https://osu-nlp-group.github.io/Mind2Web/) and [AITW](https://github.com/google-research/google-research/tree/master/android_in_the_wild).
298
+
299
+ * Robotics Manipulation: [SimplerEnv](https://github.com/simpler-env/SimplerEnv) and WidowX real robot.
300
+
301
+ * Multimodal Image Understanding and Reasoning: [VQAv2](https://visualqa.org/), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html), [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), [POPE](https://huggingface.co/datasets/lmms-lab/POPE), [TextVQA](https://textvqa.org/), [ChartQA](https://github.com/vis-nlp/ChartQA), [DocVQA](https://www.docvqa.org/).
302
+
303
+ * Multimodal Video Understanding and Reasoning: [Next-QA](https://github.com/doc-doc/NExT-QA), [VideoMME](https://video-mme.github.io/home_page.html), [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench).
304
+
305
+ #### Metrics
306
+ <!-- {{ testing_metrics | default("[More Information Needed]", true)}} -->
307
+
308
+ We follow the individual dataset's evaluation metrics for the evaluation. Please refer to the original dataset for more details.
309
+
310
+
311
+ ### Results on Agentic Intelligence
312
+
313
+ Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. Magma is the only model that can conduct the full task spectrum.
314
+
315
+ | Model | VQAv2 | TextVQA | POPE | SS-Mobile | SS-Desktop | SS-Web | VWB-Ele-G | VWB-Act-G | SE-Google Robot | SE-Bridge |
316
+ |-----------------------|------|--------|------|----------|-----------|------|----------|----------|---------------|-----------|
317
+ | GPT-4V | 77.2 | 78.0 | n/a | 23.6 | 16.0 | 9.0 | 67.5 | 75.7 | - | - |
318
+ | GPT-4V-OmniParser | n/a | n/a | n/a | 71.1 | 45.6 | 58.5 | - | - | - | - |
319
+ | LLava-1.5 | 78.5 | 58.2 | 85.9 | - | - | - | 12.1 | 13.6 | - | - |
320
+ | LLava-Next | 81.3 | 64.9 | 86.5 | - | - | - | 15.0 | 8.7 | - | - |
321
+ | Qwen-VL | 78.8 | 63.8 | n/a | 6.2 | 6.3 | 3.0 | 14.0 | 0.7 | - | - |
322
+ | Qwen-VL-Chat | 78.2 | 61.5 | n/a | - | - | - | - | - | - | - |
323
+ | Fuyu | 74.2 | n/a | n/a | 21.2 | 20.8 | 19.2 | 19.4 | 15.5 | - | - |
324
+ | SeeClick | - | - | - | 65.0 | 51.1 | 44.1 | 9.9 | 1.9 | - | - |
325
+ | Octo | - | - | - | - | - | - | - | - | - | - |
326
+ | RT-1-X | - | - | - | - | - | - | - | - | 6.0 | 15.9 |
327
+ | OpenVLA | - | - | - | - | - | - | - | - | 34.2 | 1.1 |
328
+ | Magma-8B | 80.0 | 66.5 | 87.4 | 59.5 | 64.1 | 60.6 | 96.3 | 71.8 | 52.3 | 35.4 |
329
+
330
+ *Notes: SS - ScreenSpot, VWB - VisualWebArena, SE - SimplerEnv*
331
+ <!-- {{ results | default("[More Information Needed]", true)}} -->
332
+
333
+ <!-- {{ results_summary | default("", true) }} -->
334
+
335
+
336
+ ## Technical Specifications
337
+
338
+
339
+ ### Model Architecture and Objective
340
+
341
+ <!-- {{ model_specs | default("[More Information Needed]", true)}} -->
342
+
343
+ * Language Model: We use [Meta LLama-3](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the backbone LLM.
344
+ * Vision Encoder: We use [CLIP-ConvneXt-XXLarge](https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) trained by LAION team as the vision encoder to tokenize the images and videos.
345
+
346
+ The whole pipeline follows the common practice in the multimodal LLMs, where the vision encoder is used to tokenize the images and videos, and then the visual tokens are fed into the LLM along with the textual tokens to generate the text outputs.
347
+
348
+
349
+ ### Compute Infrastructure
350
+ <!-- {{ compute_infrastructure | default("[More Information Needed]", true)}} -->
351
+
352
+ We used [Azure ML](https://azure.microsoft.com/en-us/products/machine-learning) for our model training.
353
+
354
+
355
+ #### Hardware
356
+ <!-- {{ hardware_requirements | default("[More Information Needed]", true)}} -->
357
+
358
+ Our model is trained on two GPUs:
359
+
360
+ * Nvidia H100
361
+ * AMD MI300
362
+
363
+
364
+
365
+ #### Software
366
+ <!-- {{ software | default("[More Information Needed]", true)}} -->
367
+
368
+ Our model is built based on:
369
+
370
+ * [Pytorch](https://pytorch.org/)
371
+ * [Transformers](https://huggingface.co/transformers/)
372
+ * [TorchVision](https://pytorch.org/vision/stable/index.html)
373
+ * [DeepSpeed](https://www.deepspeed.ai/)
374
+ * [FlashAttention](https://github.com/HazyResearch/flash-attention)
375
+
376
+
377
+ ## Intended Uses
378
+
379
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
380
+
381
+ This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, particularly in multimodal agentic AI. It is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
382
+
383
+ ### Direct Use
384
+
385
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
386
+
387
+ The model takes images and text as inputs, and produces the textual outputs for the following uses:
388
+
389
+ * **Image/Video-Conditioned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
390
+
391
+ * **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
392
+
393
+ * **Agentic Capabilities:** The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).
394
+
395
+
396
+
397
+ ### Downstream Use
398
+
399
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
400
+
401
+ <!-- {{ downstream_use | default("[More Information Needed]", true)}} -->
402
+
403
+ <!-- ### Out-of-Scope Use -->
404
+
405
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
406
+
407
+ <!-- {{ out_of_scope_use | default("[More Information Needed]", true)}} -->
408
+
409
+ The model can be further finetuned for different downstream tasks, such as:
410
+
411
+ * **Image Captioning and QA:** We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.
412
+
413
+ * **Video Captioning and QA:** We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.
414
+
415
+ * **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
416
+
417
+ * **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperforms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
418
+
419
+
420
+ ## Bias, Risks, and Limitations
421
+
422
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
423
+
424
+ <!-- {{ bias_risks_limitations | default("[More Information Needed]", true)}} -->
425
+
426
+ Please note that this model is not specifically designed or evaluated for all downstream purposes.
427
+
428
+ The model is not intended to be deployed in production settings. It should not be used in high-risk scenarios, such as military and defense, financial services, and critical infrastructure systems.
429
+
430
+ Developers should consider common limitations of multimodal models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case.
431
+
432
+ Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Like other multimodal models, Magma can potentially behave in ways that are unfair, unreliable, or offensive.
433
+
434
+ The models' outputs do not reflect the opinions of Microsoft.
435
+
436
+ Some of the limiting behaviors to be aware of include:
437
+
438
+ * **Quality of Service:** The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Magma is not intended to support multilingual use.
439
+
440
+ * **Representation of Harms & Perpetuation of Stereotypes:** These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
441
+
442
+ * **Inappropriate or Offensive Content:** These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.
443
+
444
+ * **Information Reliability:** Multimodal models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
445
+
446
+ Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like [Azure AI Content Safety](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety) that have advanced guardrails is highly recommended.
447
+
448
+
449
+ ### Recommendations
450
+
451
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
452
+
453
+ <!-- {{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}} -->
454
+
455
+ Magma was developed for research purposes only. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
456
+
457
+ The recommended usage for the finetuned models is within the research settings they were trained on — namely,
458
+ - an android simulator running on a computer for UI manipulation.
459
+ - an enclosure equipped with a robotic arm and everyday objects for Robotic manipulation
460
+
461
+ For UI navigation task, researchers should make sure a human is in the loop and in control for every action the agentic system generates. Since the model cannot act by itself, the sub-module a researcher uses to actually perform the UI navigation action should ensure no unintended consequences can occur as a result of performing the UI action proposed by the model.
462
+
463
+ For the robotic manipulation task, some mitigation strategies to use for human safety when operating robotic arms include:
464
+
465
+ * **Safety Zones and Barriers:** Establish physical barriers or safety zones around robotic workspaces to prevent unauthorized access.
466
+ * **Emergency Stop Systems:** Equip robotic arms with easily accessible emergency stop buttons. Implement a fail-safe mechanism that triggers an immediate stop of operations in case of an emergency
467
+ * **Safety Standards and Compliance:** Adhere to established safety standards (e.g., ISO 10218, ISO/TS 15066) for industrial robots and collaborative robots.
468
+ * **User Training and Awareness:** Provide comprehensive training for all personnel working around robotic arms to understand their functions, safety features, and emergency procedures. Promote awareness of the potential risks associated with robotic manipulation.
469
+
470
+
471
+ ## Citation
472
+
473
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
474
+
475
+ ```bibtex
476
+ @misc{yang2025magmafoundationmodelmultimodal,
477
+ title={Magma: A Foundation Model for Multimodal AI Agents},
478
+ author={Jianwei Yang and Reuben Tan and Qianhui Wu and Ruijie Zheng and Baolin Peng and Yongyuan Liang and Yu Gu and Mu Cai and Seonghyeon Ye and Joel Jang and Yuquan Deng and Lars Liden and Jianfeng Gao},
479
+ year={2025},
480
+ eprint={2502.13130},
481
+ archivePrefix={arXiv},
482
+ primaryClass={cs.CV},
483
+ url={https://arxiv.org/abs/2502.13130},
484
+ }
485
+ ```
486
+ <!-- {{ citation_bibtex | default("[More Information Needed]", true)}} -->
config.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "microsoft/Magma-8B--configuration_magma.MagmaConfig",
9
+ "AutoModelForCausalLM": "microsoft/Magma-8B--modeling_magma.MagmaForCausalLM"
10
+ },
11
+ "hidden_act": "silu",
12
+ "hidden_size": 4096,
13
+ "image_token_id": 128257,
14
+ "img_size": 512,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 14336,
17
+ "max_position_embeddings": 8192,
18
+ "mlp_bias": false,
19
+ "mm_use_image_history": true,
20
+ "mm_use_image_start_end": true,
21
+ "mm_use_trace_speed": true,
22
+ "mm_use_trace_start_end": false,
23
+ "model_type": "casual_lm",
24
+ "num_attention_heads": 32,
25
+ "num_hidden_layers": 32,
26
+ "num_key_value_heads": 8,
27
+ "pretraining_tp": 1,
28
+ "remove_static_trace_pts": true,
29
+ "rms_norm_eps": 1e-05,
30
+ "rope_scaling": null,
31
+ "rope_theta": 500000.0,
32
+ "spatial_quant_size": 256,
33
+ "text_config": {
34
+ "_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
35
+ "architectures": [
36
+ "LlamaForCausalLM"
37
+ ],
38
+ "attention_bias": false,
39
+ "attention_dropout": 0.0,
40
+ "bos_token_id": 128000,
41
+ "eos_token_id": 128009,
42
+ "head_dim": 128,
43
+ "hidden_act": "silu",
44
+ "hidden_size": 4096,
45
+ "initializer_range": 0.02,
46
+ "intermediate_size": 14336,
47
+ "max_position_embeddings": 8192,
48
+ "mlp_bias": false,
49
+ "model_type": "llama",
50
+ "num_attention_heads": 32,
51
+ "num_hidden_layers": 32,
52
+ "num_key_value_heads": 8,
53
+ "pad_token_id": 128256,
54
+ "pretraining_tp": 1,
55
+ "rms_norm_eps": 1e-05,
56
+ "rope_scaling": null,
57
+ "rope_theta": 500000.0,
58
+ "torch_dtype": "bfloat16",
59
+ "use_cache": true,
60
+ "vocab_size": 128261
61
+ },
62
+ "tie_word_embeddings": false,
63
+ "torch_dtype": "bfloat16",
64
+ "transformers_version": "4.52.0.dev0",
65
+ "use_cache": false,
66
+ "vision_config": {
67
+ "attention_bias": false,
68
+ "attention_dropout": 0.0,
69
+ "bos_token_id": 128000,
70
+ "eos_token_id": 128009,
71
+ "feature_outs": "encoder",
72
+ "freeze_mm_mlp_adapter": false,
73
+ "hidden_act": "silu",
74
+ "hidden_size": 4096,
75
+ "image_aspect_ratio": "square",
76
+ "img_anyres_strategy": "crop",
77
+ "img_size": 512,
78
+ "initializer_range": 0.02,
79
+ "intermediate_size": 14336,
80
+ "max_num_crops": 4,
81
+ "max_position_embeddings": 8192,
82
+ "mm_hidden_size": 3072,
83
+ "mm_projector_lr": null,
84
+ "mm_projector_type": "mlp2x_gelu",
85
+ "mm_use_im_patch_token": false,
86
+ "mm_use_im_start_end": false,
87
+ "mm_use_row_seperator": true,
88
+ "mm_vision_select_feature": "patch",
89
+ "mm_vision_select_layer": -2,
90
+ "model_type": "casual_lm",
91
+ "num_attention_heads": 32,
92
+ "num_hidden_layers": 32,
93
+ "num_key_value_heads": 8,
94
+ "pretraining_tp": 1,
95
+ "proj_vis_to_txt_tokens": false,
96
+ "rms_norm_eps": 1e-05,
97
+ "rope_scaling": null,
98
+ "rope_theta": 500000.0,
99
+ "tie_word_embeddings": false,
100
+ "tokenizer_model_max_length": 4096,
101
+ "tokenizer_padding_side": "right",
102
+ "torch_dtype": "bfloat16",
103
+ "tune_mm_mlp_adapter": false,
104
+ "tune_vision_tokenizer": "all",
105
+ "use_cache": true,
106
+ "use_mm_proj": true,
107
+ "vision_backbone": "convnextxxlarge",
108
+ "vision_feature_layer": "clip_vis_dense",
109
+ "vision_tokenizer_lr": null,
110
+ "vocab_size": 128257
111
+ },
112
+ "vocab_size": 128320
113
+ }
configuration_magma.py ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """Magma model configuration"""
21
+
22
+ from transformers.configuration_utils import PretrainedConfig
23
+ from transformers.utils import logging
24
+ from transformers.models.auto import CONFIG_MAPPING
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+
29
+ class MagmaConfig(PretrainedConfig):
30
+ r"""
31
+ This is the configuration class to store the configuration of a [`MagmaModel`]. It is used to instantiate an Magma
32
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
33
+ defaults will yield a similar configuration to that of the Magma-7B.
34
+
35
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
36
+ documentation from [`PretrainedConfig`] for more information.
37
+
38
+
39
+ Args:
40
+ vocab_size (`int`, *optional*, defaults to 32000):
41
+ Vocabulary size of the Magma model. Defines the number of different tokens that can be represented by the
42
+ `inputs_ids` passed when calling [`MagmaModel`]
43
+ hidden_size (`int`, *optional*, defaults to 4096):
44
+ Dimension of the hidden representations.
45
+ intermediate_size (`int`, *optional*, defaults to 11008):
46
+ Dimension of the MLP representations.
47
+ num_hidden_layers (`int`, *optional*, defaults to 32):
48
+ Number of hidden layers in the Transformer decoder.
49
+ num_attention_heads (`int`, *optional*, defaults to 32):
50
+ Number of attention heads for each attention layer in the Transformer decoder.
51
+ num_key_value_heads (`int`, *optional*):
52
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
53
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
54
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
55
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
56
+ by meanpooling all the original heads within that group. For more details checkout [this
57
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
58
+ `num_attention_heads`.
59
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
60
+ The non-linear activation function (function or string) in the decoder.
61
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
62
+ The maximum sequence length that this model might ever be used with. Magma 1 supports up to 2048 tokens,
63
+ Magma 2 up to 4096, CodeMagma up to 16384.
64
+ initializer_range (`float`, *optional*, defaults to 0.02):
65
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
66
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
67
+ The epsilon used by the rms normalization layers.
68
+ use_cache (`bool`, *optional*, defaults to `True`):
69
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
70
+ relevant if `config.is_decoder=True`.
71
+ pad_token_id (`int`, *optional*):
72
+ Padding token id.
73
+ bos_token_id (`int`, *optional*, defaults to 1):
74
+ Beginning of stream token id.
75
+ eos_token_id (`int`, *optional*, defaults to 2):
76
+ End of stream token id.
77
+ pretraining_tp (`int`, *optional*, defaults to 1):
78
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
79
+ document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is
80
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
81
+ issue](https://github.com/pytorch/pytorch/issues/76232).
82
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
83
+ Whether to tie weight embeddings
84
+ rope_theta (`float`, *optional*, defaults to 10000.0):
85
+ The base period of the RoPE embeddings.
86
+ rope_scaling (`Dict`, *optional*):
87
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
88
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
89
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
90
+ `max_position_embeddings` to the expected new maximum.
91
+ attention_bias (`bool`, *optional*, defaults to `False`):
92
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
93
+ attention_dropout (`float`, *optional*, defaults to 0.0):
94
+ The dropout ratio for the attention probabilities.
95
+ mlp_bias (`bool`, *optional*, defaults to `False`):
96
+ Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
97
+
98
+ ```python
99
+ >>> from transformers import MagmaModel, MagmaConfig
100
+
101
+ >>> # Initializing a Magma magma-7b style configuration
102
+ >>> configuration = MagmaConfig()
103
+
104
+ >>> # Initializing a model from the magma-7b style configuration
105
+ >>> model = MagmaModel(configuration)
106
+
107
+ >>> # Accessing the model configuration
108
+ >>> configuration = model.config
109
+ ```"""
110
+
111
+ model_type = "magma"
112
+ keys_to_ignore_at_inference = ["past_key_values"]
113
+
114
+ def __init__(
115
+ self,
116
+ vision_config=None,
117
+ text_config=None,
118
+ image_token_id=None,
119
+ tie_word_embeddings=False,
120
+ **kwargs,
121
+ ):
122
+ self.vision_config = vision_config
123
+ self.image_token_index = image_token_id
124
+
125
+ if isinstance(text_config, dict):
126
+ text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
127
+ text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
128
+ elif text_config is None:
129
+ if "model_type" in kwargs:
130
+ text_config = CONFIG_MAPPING[kwargs["model_type"]](**kwargs)
131
+
132
+ if text_config is not None:
133
+ # copy all variables in text_config to self
134
+ for key, value in text_config.__dict__.items():
135
+ if not key.startswith("_") and not key.startswith("__"):
136
+ setattr(self, key, value)
137
+ self.text_config = text_config
138
+ else:
139
+ self.text_config = None
140
+
141
+ super().__init__(
142
+ tie_word_embeddings=tie_word_embeddings,
143
+ **kwargs,
144
+ )
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 128000,
4
+ "eos_token_id": 128009,
5
+ "pad_token_id": 128256,
6
+ "transformers_version": "4.52.0.dev0"
7
+ }
image_processing_magma.py ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ """Image processor class for Magma."""
17
+
18
+ from typing import List, Optional, Union
19
+ import ast
20
+ import numpy as np
21
+ import torchvision
22
+ from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
23
+ from transformers.image_transforms import (
24
+ convert_to_rgb,
25
+ )
26
+ from transformers.image_utils import (
27
+ OPENAI_CLIP_MEAN,
28
+ OPENAI_CLIP_STD,
29
+ ImageInput,
30
+ make_list_of_images,
31
+ valid_images,
32
+ )
33
+ from transformers.utils import TensorType, is_vision_available, logging
34
+
35
+ from transformers import AutoImageProcessor
36
+
37
+ logger = logging.get_logger(__name__)
38
+
39
+
40
+ if is_vision_available():
41
+ from PIL import Image
42
+
43
+ import torch
44
+ import torchvision
45
+
46
+ def select_best_resolution(original_size, possible_resolutions):
47
+ """
48
+ Selects the best resolution from a list of possible resolutions based on the original size.
49
+
50
+ Args:
51
+ original_size (tuple): The original size of the image in the format (width, height).
52
+ possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
53
+
54
+ Returns:
55
+ tuple: The best fit resolution in the format (width, height).
56
+ """
57
+ original_width, original_height = original_size
58
+ best_fit = None
59
+ max_effective_resolution = 0
60
+ min_wasted_resolution = float('inf')
61
+
62
+ for width, height in possible_resolutions:
63
+ scale = min(width / original_width, height / original_height)
64
+ downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
65
+ effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
66
+ wasted_resolution = (width * height) - effective_resolution
67
+
68
+ if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
69
+ max_effective_resolution = effective_resolution
70
+ min_wasted_resolution = wasted_resolution
71
+ best_fit = (width, height)
72
+
73
+ return best_fit
74
+
75
+ def process_anyres_image(image, max_num_crops=None, base_width=768, base_height=768):
76
+ """
77
+ Process an image with variable resolutions.
78
+
79
+ Args:
80
+ image (torch.Tensor): The input image to be processed.
81
+ max_num_crops (int): Maximum number of crops
82
+
83
+ Returns:
84
+ torch.Tensor: A tensor containing the processed image patches.
85
+ """
86
+ assert max_num_crops is not None
87
+ grid_pinpoints = []
88
+ for i in range(1, max_num_crops+1):
89
+ for j in range(1, max_num_crops // i + 1):
90
+ grid_pinpoints.append((i, j))
91
+ grid_pinpoints = [(int(res[0] * base_width), int(res[1] * base_height)) for res in grid_pinpoints]
92
+
93
+ if type(grid_pinpoints) is list:
94
+ possible_resolutions = grid_pinpoints
95
+ else:
96
+ possible_resolutions = ast.literal_eval(grid_pinpoints)
97
+
98
+ best_resolution = select_best_resolution((image.shape[2], image.shape[1]), possible_resolutions)
99
+ # NOTE: reverse best_resolution from (width, height) to (height, width)
100
+ best_resolution = (best_resolution[1], best_resolution[0])
101
+ best_resolution_grid = (best_resolution[0] // base_height, best_resolution[1] // base_width)
102
+
103
+ # resize image tensor to best resolution
104
+ image = torch.nn.functional.interpolate(image[None,:,:,:], size=best_resolution, mode='bilinear')
105
+ # divide image tensor into patches
106
+ patches = image.unfold(2, base_height, base_height).unfold(3, base_width, base_width)
107
+ patches = patches.permute(0, 2, 3, 1, 4, 5).reshape(best_resolution_grid[0]*best_resolution_grid[1], -1, base_height, base_width)
108
+ return (patches, best_resolution_grid)
109
+
110
+ def process_anyres_image_global(image, max_num_crops=None, base_width=768, base_height=768):
111
+ """
112
+ Process an image with variable resolutions.
113
+
114
+ Args:
115
+ image (torch.Tensor): The input image to be processed.
116
+ max_num_crops (int): Maximum number of crops
117
+
118
+ Returns:
119
+ torch.Tensor: A tensor containing the processed image patches.
120
+ """
121
+ assert max_num_crops is not None
122
+ grid_pinpoints = []
123
+ for i in range(1, max_num_crops+1):
124
+ for j in range(1, max_num_crops // i + 1):
125
+ grid_pinpoints.append((i, j))
126
+ grid_pinpoints = [(int(res[0] * base_width), int(res[1] * base_height)) for res in grid_pinpoints]
127
+
128
+ if type(grid_pinpoints) is list:
129
+ possible_resolutions = grid_pinpoints
130
+ else:
131
+ possible_resolutions = ast.literal_eval(grid_pinpoints)
132
+
133
+ best_resolution = select_best_resolution((image.shape[2], image.shape[1]), possible_resolutions)
134
+ # NOTE: reverse best_resolution from (width, height) to (height, width)
135
+ best_resolution = (best_resolution[1], best_resolution[0])
136
+ best_resolution_grid = (best_resolution[0] // base_height, best_resolution[1] // base_width)
137
+
138
+ # resize image tensor to best resolution
139
+ image = torch.nn.functional.interpolate(image[None,:,:,:], size=best_resolution, mode='bilinear')
140
+ return image
141
+
142
+ class preprocessor():
143
+ def __init__(self, image_preprocessor, base_resolution=(256, 256)):
144
+ self.image_preprocessor = image_preprocessor
145
+ self.crop_size = {
146
+ 'height': base_resolution[0],
147
+ 'width': base_resolution[1]
148
+ }
149
+ self.image_mean = image_preprocessor.transforms[-1].mean
150
+
151
+ def preprocess(self, image, return_tensors='pt'):
152
+ image = self.image_preprocessor(image).unsqueeze(0)
153
+ return {
154
+ 'pixel_values': image,
155
+ }
156
+
157
+ class MagmaImageProcessor(BaseImageProcessor):
158
+ r"""
159
+ Constructs a Magma image processor. Based on [`CLIPImageProcessor`] with incorporation of additional techniques
160
+ for processing high resolution images as explained in the [InternLM-XComposer2-4KHD](https://arxiv.org/pdf/2404.06512)
161
+
162
+ Args:
163
+ anyres_strategy (`str`):
164
+ strategy to cope with high-resolution images. one conventional way is multi-crop and many other works to accomadate clip-vit models.
165
+ however, since we are using convnext, which is essentially convnet, so we can use arbitary resolution images. as such, we use global strategy by defualt,
166
+ i.e., directly resize image holistically to a certain resolution.
167
+ base_img_size (int, *optional*, defaults to 768):
168
+ as convnext has 1/32 downsample rate, we use 768 as the base resolution so that the resulted feature map is 24x24.
169
+ num_crops (int, *optional*, defaults to 1):
170
+ number of effective crops when coping with images with higher resolution than 768x768. note that num_crops > 1 does not mean we are cropping the image.
171
+ """
172
+
173
+ model_input_names = ["pixel_values"]
174
+
175
+ def __init__(
176
+ self,
177
+ anyres_strategy: str = 'global',
178
+ base_img_size: int = 768,
179
+ num_crops: int = 1,
180
+ do_convert_rgb: bool = True,
181
+ image_mean: List[float] = OPENAI_CLIP_MEAN,
182
+ image_std: List[float] = OPENAI_CLIP_STD,
183
+ **kwargs,
184
+ ) -> None:
185
+ super().__init__(**kwargs)
186
+ self.base_img_size = base_img_size
187
+ self.anyres_strategy = anyres_strategy
188
+ self.num_crops = num_crops
189
+ self.do_convert_rgb = do_convert_rgb
190
+ self.image_mean = image_mean
191
+ self.image_std = image_std
192
+
193
+ def preprocess(
194
+ self,
195
+ images: Union[ImageInput, List[ImageInput]],
196
+ do_pad: bool = False,
197
+ do_convert_rgb: bool = None,
198
+ return_tensors: Optional[Union[str, TensorType]] = None,
199
+ num_crops: int = None,
200
+ ):
201
+ """
202
+ Args:
203
+ images (`ImageInput` or `List[ImageInput]`):
204
+ Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
205
+ passing in images with pixel values between 0 and 1, set `do_rescale=False`.
206
+ image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
207
+ Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
208
+ image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
209
+ Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
210
+ `True`.
211
+ do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
212
+ Whether to convert the image to RGB.
213
+ return_tensors (`str` or `TensorType`, *optional*):
214
+ The type of tensors to return. Can be one of:
215
+ - Unset: Return a list of `np.ndarray`.
216
+ - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
217
+ - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
218
+ - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
219
+ - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
220
+ """
221
+ images = make_list_of_images(images)
222
+
223
+ if not valid_images(images):
224
+ raise ValueError(
225
+ "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
226
+ "torch.Tensor, tf.Tensor or jax.ndarray."
227
+ )
228
+
229
+ if do_convert_rgb:
230
+ images = [convert_to_rgb(image) for image in images]
231
+
232
+ # tensor transform and normalize
233
+ img_processor = torchvision.transforms.Compose([
234
+ torchvision.transforms.ToTensor(),
235
+ torchvision.transforms.Normalize(self.image_mean, self.image_std)
236
+ ])
237
+
238
+ images = [img_processor(image) for image in images]
239
+ image_data_type = 'half' if images[0].type() == 'torch.HalfTensor' else 'float'
240
+ images = [image.float() for image in images]
241
+
242
+ # crop images to the same size
243
+ image_patches = [process_anyres_image(image, self.num_crops if num_crops is None else num_crops, base_width=self.base_img_size, base_height=self.base_img_size) for image in images]
244
+ pixel_values = torch.cat([image[0] for image in image_patches], dim=0)
245
+ # pixel_values = [image[0] for image in image_patches]
246
+ image_sizes = [image_patch[1] for image_patch in image_patches]
247
+
248
+ if image_data_type == 'half':
249
+ pixel_values = pixel_values.half()
250
+
251
+ data = {
252
+ "pixel_values": pixel_values,
253
+ "image_sizes": image_sizes,
254
+ }
255
+ return BatchFeature(data=data, tensor_type=return_tensors)
256
+
257
+ AutoImageProcessor.register("MagmaImageProcessor", MagmaImageProcessor)
image_tower_magma.py ADDED
@@ -0,0 +1,371 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ """Image processor class for Magma."""
17
+
18
+ from typing import List, Optional, Union
19
+ import logging
20
+
21
+ # Configure root logger
22
+ logging.basicConfig(level=logging.INFO)
23
+
24
+ import numpy as np
25
+ import torchvision
26
+ from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
27
+ from transformers.image_transforms import (
28
+ convert_to_rgb,
29
+ )
30
+ from transformers.image_utils import (
31
+ OPENAI_CLIP_MEAN,
32
+ OPENAI_CLIP_STD,
33
+ ImageInput,
34
+ make_list_of_images,
35
+ valid_images,
36
+ )
37
+
38
+ from transformers.utils import TensorType, is_vision_available, logging
39
+ logger = logging.get_logger(__name__)
40
+
41
+
42
+ if is_vision_available():
43
+ from PIL import Image
44
+
45
+ import torchvision
46
+
47
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
48
+ # All rights reserved.
49
+
50
+ # This source code is licensed under the license found in the
51
+ # LICENSE file in the root directory of this source tree.
52
+ import json
53
+ import torch
54
+ import torch.nn as nn
55
+ import torch.nn.functional as F
56
+
57
+ import open_clip
58
+ from open_clip.transform import image_transform_v2, AugmentationCfg, PreprocessCfg, merge_preprocess_dict, merge_preprocess_kwargs
59
+ from open_clip.pretrained import is_pretrained_cfg, get_pretrained_cfg, download_pretrained,\
60
+ list_pretrained_tags_by_model, download_pretrained_from_hf
61
+ from open_clip.model import CLIP, CustomTextCLIP, convert_weights_to_lp, convert_to_custom_text_state_dict,\
62
+ resize_pos_embed, get_cast_dtype, resize_text_pos_embed, set_model_preprocess_cfg
63
+ from pathlib import Path
64
+ from typing import Optional, Tuple, Type
65
+ from functools import partial
66
+ import torch.utils.checkpoint as checkpoint
67
+ from typing import Any, Dict, Optional, Tuple, Union
68
+ from dataclasses import asdict
69
+ HF_HUB_PREFIX = 'hf-hub:'
70
+
71
+ def _get_hf_config(model_id, cache_dir=None):
72
+ config_path = download_pretrained_from_hf(model_id, filename='open_clip_config.json', cache_dir=cache_dir)
73
+ with open(config_path, 'r', encoding='utf-8') as f:
74
+ config = json.load(f)
75
+ return config
76
+
77
+ def create_model(
78
+ model_name: str,
79
+ pretrained: Optional[str] = None,
80
+ precision: str = 'fp32',
81
+ device: Union[str, torch.device] = 'cpu',
82
+ jit: bool = False,
83
+ force_quick_gelu: bool = False,
84
+ force_custom_text: bool = False,
85
+ force_patch_dropout: Optional[float] = None,
86
+ force_path_dropout: Optional[float] = None,
87
+ force_image_size: Optional[Union[int, Tuple[int, int]]] = None,
88
+ force_preprocess_cfg: Optional[Dict[str, Any]] = None,
89
+ pretrained_image: bool = False,
90
+ pretrained_hf: bool = True,
91
+ cache_dir: Optional[str] = None,
92
+ output_dict: Optional[bool] = None,
93
+ require_pretrained: bool = False,
94
+ **model_kwargs,
95
+ ):
96
+ force_preprocess_cfg = force_preprocess_cfg or {}
97
+ preprocess_cfg = asdict(PreprocessCfg())
98
+ has_hf_hub_prefix = model_name.startswith(HF_HUB_PREFIX)
99
+ if has_hf_hub_prefix:
100
+ model_id = model_name[len(HF_HUB_PREFIX):]
101
+ checkpoint_path = download_pretrained_from_hf(model_id, cache_dir=cache_dir)
102
+ config = _get_hf_config(model_id, cache_dir)
103
+ preprocess_cfg = merge_preprocess_dict(preprocess_cfg, config['preprocess_cfg'])
104
+ model_cfg = config['model_cfg']
105
+ pretrained_hf = False # override, no need to load original HF text weights
106
+ else:
107
+ model_name = model_name.replace('/', '-') # for callers using old naming with / in ViT names
108
+ checkpoint_path = None
109
+ model_cfg = None
110
+
111
+ if device == "auto":
112
+ device = {'': device}
113
+ else:
114
+ device = torch.device(device)
115
+
116
+ if pretrained and pretrained.lower() == 'openai':
117
+ logger.info(f'Loading pretrained {model_name} from OpenAI.')
118
+ model = load_openai_model(
119
+ model_name,
120
+ precision=precision,
121
+ device=device,
122
+ cache_dir=cache_dir,
123
+ )
124
+ else:
125
+ model_cfg = model_cfg or get_model_config(model_name)
126
+ if model_cfg is not None:
127
+ logger.info(f'Loaded {model_name} model config.')
128
+ else:
129
+ logger.error(f'Model config for {model_name} not found; available models {list_models()}.')
130
+ raise RuntimeError(f'Model config for {model_name} not found.')
131
+
132
+ if force_quick_gelu:
133
+ # override for use of QuickGELU on non-OpenAI transformer models
134
+ model_cfg["quick_gelu"] = True
135
+
136
+ if force_patch_dropout is not None:
137
+ # override the default patch dropout value
138
+ model_cfg["vision_cfg"]["patch_dropout"] = force_patch_dropout
139
+
140
+ if force_path_dropout is not None:
141
+ # override the default patch dropout value
142
+ model_cfg["vision_cfg"]["timm_drop_path"] = force_path_dropout
143
+
144
+ if force_image_size is not None:
145
+ # override model config's image size
146
+ model_cfg["vision_cfg"]["image_size"] = force_image_size
147
+
148
+ is_timm_model = 'timm_model_name' in model_cfg.get('vision_cfg', {})
149
+ if pretrained_image:
150
+ if is_timm_model:
151
+ # pretrained weight loading for timm models set via vision_cfg
152
+ model_cfg['vision_cfg']['timm_model_pretrained'] = True
153
+ else:
154
+ assert False, 'pretrained image towers currently only supported for timm models'
155
+
156
+ # cast_dtype set for fp16 and bf16 (manual mixed-precision), not set for 'amp' or 'pure' modes
157
+ cast_dtype = get_cast_dtype(precision)
158
+ is_hf_model = 'hf_model_name' in model_cfg.get('text_cfg', {})
159
+ if is_hf_model:
160
+ # load pretrained weights for HF text model IFF no CLIP weights being loaded
161
+ model_cfg['text_cfg']['hf_model_pretrained'] = pretrained_hf and not pretrained
162
+ custom_text = model_cfg.pop('custom_text', False) or force_custom_text or is_hf_model
163
+
164
+ # model_cfg = dict(model_cfg, **model_kwargs) # merge cfg dict w/ kwargs (kwargs overrides cfg)
165
+ if custom_text:
166
+ if "multimodal_cfg" in model_cfg:
167
+ model = CoCa(**model_cfg, cast_dtype=cast_dtype)
168
+ else:
169
+ model = CustomTextCLIP(**model_cfg, cast_dtype=cast_dtype)
170
+ else:
171
+ model = CLIP(**model_cfg, cast_dtype=cast_dtype)
172
+
173
+ if precision in ("fp16", "bf16"):
174
+ dtype = torch.float16 if 'fp16' in precision else torch.bfloat16
175
+ # manual mixed precision that matches original OpenAI behaviour
176
+ if is_timm_model:
177
+ # FIXME this is a bit janky, create timm based model in low-precision and
178
+ # then cast only LayerNormFp32 instances back to float32 so they don't break.
179
+ # Why? The convert_weights_to_lp fn only works with native models.
180
+ if device != {'':'auto'}:
181
+ model.to(device=device, dtype=dtype)
182
+ else:
183
+ model.to(dtype=dtype)
184
+ else:
185
+ model.to(device=device)
186
+ convert_weights_to_lp(model, dtype=dtype)
187
+ elif precision in ("pure_fp16", "pure_bf16"):
188
+ dtype = torch.float16 if 'fp16' in precision else torch.bfloat16
189
+ model.to(device=device, dtype=dtype)
190
+ # else:
191
+ # model.to(device=device)
192
+
193
+ pretrained_loaded = False
194
+ if pretrained:
195
+ checkpoint_path = ''
196
+ pretrained_cfg = get_pretrained_cfg(model_name, pretrained)
197
+ if pretrained_cfg:
198
+ checkpoint_path = download_pretrained(pretrained_cfg, cache_dir=cache_dir)
199
+ preprocess_cfg = merge_preprocess_dict(preprocess_cfg, pretrained_cfg)
200
+ elif os.path.exists(pretrained):
201
+ checkpoint_path = pretrained
202
+
203
+ # if checkpoint_path:
204
+ # logger.info(f'Loading pretrained {model_name} weights ({pretrained}).')
205
+ # open_clip.load_checkpoint(model, checkpoint_path)
206
+ # else:
207
+ # error_str = (
208
+ # f'Pretrained weights ({pretrained}) not found for model {model_name}.'
209
+ # f' Available pretrained tags ({list_pretrained_tags_by_model(model_name)}.')
210
+ # logger.warning(error_str)
211
+ # raise RuntimeError(error_str)
212
+ # pretrained_loaded = True
213
+ elif has_hf_hub_prefix and require_pretrained:
214
+ logger.info(f'Loading pretrained {model_name} weights ({checkpoint_path}).')
215
+ print(f'Loading pretrained {model_name} weights ({checkpoint_path}).')
216
+ open_clip.load_checkpoint(model, checkpoint_path)
217
+ pretrained_loaded = True
218
+
219
+ if require_pretrained and not pretrained_loaded:
220
+ # callers of create_model_from_pretrained always expect pretrained weights
221
+ raise RuntimeError(
222
+ f'Pretrained weights were required for (model: {model_name}, pretrained: {pretrained}) but not loaded.')
223
+
224
+ if output_dict and hasattr(model, "output_dict"):
225
+ model.output_dict = True
226
+
227
+ if jit:
228
+ model = torch.jit.script(model)
229
+
230
+ # set image preprocessing configuration in model attributes for convenience
231
+ if getattr(model.visual, 'image_size', None) is not None:
232
+ # use image_size set on model creation (via config or force_image_size arg)
233
+ force_preprocess_cfg['size'] = model.visual.image_size
234
+ set_model_preprocess_cfg(model, merge_preprocess_dict(preprocess_cfg, force_preprocess_cfg))
235
+
236
+ return model
237
+
238
+ def create_model_and_transforms(
239
+ model_name: str,
240
+ pretrained: Optional[str] = None,
241
+ precision: str = 'fp32',
242
+ device: Union[str, torch.device] = 'cpu',
243
+ jit: bool = False,
244
+ force_quick_gelu: bool = False,
245
+ force_custom_text: bool = False,
246
+ force_patch_dropout: Optional[float] = None,
247
+ force_path_dropout: Optional[float] = None,
248
+ force_image_size: Optional[Union[int, Tuple[int, int]]] = None,
249
+ image_mean: Optional[Tuple[float, ...]] = None,
250
+ image_std: Optional[Tuple[float, ...]] = None,
251
+ image_interpolation: Optional[str] = None,
252
+ image_resize_mode: Optional[str] = None, # only effective for inference
253
+ aug_cfg: Optional[Union[Dict[str, Any], AugmentationCfg]] = None,
254
+ pretrained_image: bool = False,
255
+ pretrained_hf: bool = True,
256
+ cache_dir: Optional[str] = None,
257
+ output_dict: Optional[bool] = None,
258
+ **model_kwargs,
259
+ ):
260
+ force_preprocess_cfg = merge_preprocess_kwargs(
261
+ {}, mean=image_mean, std=image_std, interpolation=image_interpolation, resize_mode=image_resize_mode)
262
+
263
+ return create_model(
264
+ model_name,
265
+ pretrained,
266
+ precision=precision,
267
+ device=device,
268
+ jit=jit,
269
+ force_quick_gelu=force_quick_gelu,
270
+ force_custom_text=force_custom_text,
271
+ force_patch_dropout=force_patch_dropout,
272
+ force_path_dropout=force_path_dropout,
273
+ force_image_size=force_image_size,
274
+ force_preprocess_cfg=force_preprocess_cfg,
275
+ pretrained_image=pretrained_image,
276
+ pretrained_hf=pretrained_hf,
277
+ cache_dir=cache_dir,
278
+ output_dict=output_dict,
279
+ **model_kwargs,
280
+ )
281
+
282
+ class D2CLIP_HF(nn.Module):
283
+ def __init__(self, config, **kwargs):
284
+ super().__init__()
285
+ self.model_name = config['vision_backbone']
286
+
287
+ require_pretrained = kwargs.get('require_pretrained', False)
288
+ if self.model_name == "convnextxxlarge":
289
+ clip_model = create_model_and_transforms('hf-hub:laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg', require_pretrained=require_pretrained)
290
+ elif self.model_name == "convnextlarge":
291
+ clip_model = create_model_and_transforms('hf-hub:laion/CLIP-convnext_large-laion2B-s34B-b82K-augreg', require_pretrained=require_pretrained)
292
+
293
+ self.clip_vision_model = clip_model.visual
294
+
295
+ model_name = self.model_name.lower()
296
+ assert 'convnext' in model_name, f"Only convnext backbone is supported for Magma model, but got {model_name}"
297
+ self.model_type = 'convnext'
298
+ if 'xxlarge' in model_name:
299
+ self.output_channels = [384, 384, 768, 1536, 3072]
300
+ elif 'large' in model_name:
301
+ self.output_channels = [192, 192, 384, 768, 1536]
302
+ elif 'base' in model_name:
303
+ self.output_channels = [128, 128, 256, 512, 1024]
304
+
305
+ self._out_feature_strides = {
306
+ "res2": 4,
307
+ "res3": 8,
308
+ "res4": 16,
309
+ "res5": 32,
310
+ }
311
+ self._out_feature_channels = {
312
+ "res2": self.output_channels[1],
313
+ "res3": self.output_channels[2],
314
+ "res4": self.output_channels[3],
315
+ "res5": self.output_channels[4],
316
+ }
317
+
318
+ def extract_features_convnext(self, x, gradient_checkpointing=True):
319
+ out = {}
320
+ x = self.clip_vision_model.trunk.stem(x)
321
+ if gradient_checkpointing:
322
+ x = checkpoint.checkpoint(self.clip_vision_model.trunk.stages, x)
323
+ else:
324
+ x = self.clip_vision_model.trunk.stages(x)
325
+ out['clip_vis_dense'] = x
326
+ return out
327
+
328
+
329
+ def forward(self, x, gradient_checkpointing=True):
330
+ """
331
+ Args:
332
+ x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.
333
+ Returns:
334
+ dict[str->Tensor]: names and the corresponding features
335
+ """
336
+ return self.extract_features_convnext(x, gradient_checkpointing=gradient_checkpointing)
337
+
338
+ @property
339
+ def size_divisibility(self):
340
+ return 32
341
+
342
+ class MagmaImageTower(D2CLIP_HF):
343
+ r"""
344
+ Constructs a Magma image processor. Based on [`CLIPImageProcessor`] with incorporation of additional techniques
345
+ for processing high resolution images as explained in the [InternLM-XComposer2-4KHD](https://arxiv.org/pdf/2404.06512)
346
+
347
+ Args:
348
+ config (dict): Configuration dictionary containing the keys for the image processor.
349
+ """
350
+
351
+ def __init__(
352
+ self,
353
+ config,
354
+ **kwargs
355
+ ) -> None:
356
+ super().__init__(config, **kwargs)
357
+
358
+ @property
359
+ def hidden_size(self):
360
+ return self.output_channels[-1]
361
+
362
+
363
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
364
+ r"""
365
+ Args:
366
+ x (torch.Tensor): A tensor of shape (N, C, H, W) representing an image.
367
+
368
+ Returns:
369
+ torch.Tensor: A tensor of shape (N, C, H, W) representing the processed image.
370
+ """
371
+ return super().forward(x)
model.safetensors.index.json ADDED
@@ -0,0 +1,679 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 17806132992
4
+ },
5
+ "weight_map": {
6
+ "language_model.lm_head.weight": "model-00004-of-00004.safetensors",
7
+ "language_model.model.embed_tokens.weight": "model-00001-of-00004.safetensors",
8
+ "language_model.model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
9
+ "language_model.model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
10
+ "language_model.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
11
+ "language_model.model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
12
+ "language_model.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
13
+ "language_model.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
14
+ "language_model.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
15
+ "language_model.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
16
+ "language_model.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
17
+ "language_model.model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
18
+ "language_model.model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
19
+ "language_model.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
20
+ "language_model.model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
21
+ "language_model.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
22
+ "language_model.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
23
+ "language_model.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
24
+ "language_model.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
25
+ "language_model.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
26
+ "language_model.model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
27
+ "language_model.model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
28
+ "language_model.model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
29
+ "language_model.model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
30
+ "language_model.model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
31
+ "language_model.model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
32
+ "language_model.model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
33
+ "language_model.model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
34
+ "language_model.model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
35
+ "language_model.model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
36
+ "language_model.model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
37
+ "language_model.model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
38
+ "language_model.model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
39
+ "language_model.model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
40
+ "language_model.model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
41
+ "language_model.model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
42
+ "language_model.model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
43
+ "language_model.model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
44
+ "language_model.model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
45
+ "language_model.model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
46
+ "language_model.model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
47
+ "language_model.model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
48
+ "language_model.model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
49
+ "language_model.model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
50
+ "language_model.model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
51
+ "language_model.model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
52
+ "language_model.model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
53
+ "language_model.model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
54
+ "language_model.model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
55
+ "language_model.model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
56
+ "language_model.model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
57
+ "language_model.model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
58
+ "language_model.model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
59
+ "language_model.model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
60
+ "language_model.model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
61
+ "language_model.model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
62
+ "language_model.model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
63
+ "language_model.model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
64
+ "language_model.model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
65
+ "language_model.model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
66
+ "language_model.model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
67
+ "language_model.model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
68
+ "language_model.model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
69
+ "language_model.model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
70
+ "language_model.model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
71
+ "language_model.model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
72
+ "language_model.model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
73
+ "language_model.model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
74
+ "language_model.model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
75
+ "language_model.model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
76
+ "language_model.model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
77
+ "language_model.model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
78
+ "language_model.model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
79
+ "language_model.model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
80
+ "language_model.model.layers.16.input_layernorm.weight": "model-00003-of-00004.safetensors",
81
+ "language_model.model.layers.16.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
82
+ "language_model.model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
83
+ "language_model.model.layers.16.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
84
+ "language_model.model.layers.16.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
85
+ "language_model.model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
86
+ "language_model.model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
87
+ "language_model.model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
88
+ "language_model.model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
89
+ "language_model.model.layers.17.input_layernorm.weight": "model-00003-of-00004.safetensors",
90
+ "language_model.model.layers.17.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
91
+ "language_model.model.layers.17.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
92
+ "language_model.model.layers.17.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
93
+ "language_model.model.layers.17.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
94
+ "language_model.model.layers.17.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
95
+ "language_model.model.layers.17.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
96
+ "language_model.model.layers.17.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
97
+ "language_model.model.layers.17.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
98
+ "language_model.model.layers.18.input_layernorm.weight": "model-00003-of-00004.safetensors",
99
+ "language_model.model.layers.18.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
100
+ "language_model.model.layers.18.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
101
+ "language_model.model.layers.18.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
102
+ "language_model.model.layers.18.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
103
+ "language_model.model.layers.18.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
104
+ "language_model.model.layers.18.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
105
+ "language_model.model.layers.18.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
106
+ "language_model.model.layers.18.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
107
+ "language_model.model.layers.19.input_layernorm.weight": "model-00003-of-00004.safetensors",
108
+ "language_model.model.layers.19.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
109
+ "language_model.model.layers.19.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
110
+ "language_model.model.layers.19.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
111
+ "language_model.model.layers.19.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
112
+ "language_model.model.layers.19.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
113
+ "language_model.model.layers.19.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
114
+ "language_model.model.layers.19.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
115
+ "language_model.model.layers.19.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
116
+ "language_model.model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
117
+ "language_model.model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
118
+ "language_model.model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
119
+ "language_model.model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
120
+ "language_model.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
121
+ "language_model.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
122
+ "language_model.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
123
+ "language_model.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
124
+ "language_model.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
125
+ "language_model.model.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
126
+ "language_model.model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
127
+ "language_model.model.layers.20.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
128
+ "language_model.model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
129
+ "language_model.model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
130
+ "language_model.model.layers.20.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
131
+ "language_model.model.layers.20.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
132
+ "language_model.model.layers.20.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
133
+ "language_model.model.layers.20.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
134
+ "language_model.model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
135
+ "language_model.model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
136
+ "language_model.model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
137
+ "language_model.model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
138
+ "language_model.model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
139
+ "language_model.model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
140
+ "language_model.model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
141
+ "language_model.model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
142
+ "language_model.model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
143
+ "language_model.model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
144
+ "language_model.model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
145
+ "language_model.model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
146
+ "language_model.model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
147
+ "language_model.model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
148
+ "language_model.model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
149
+ "language_model.model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
150
+ "language_model.model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
151
+ "language_model.model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
152
+ "language_model.model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
153
+ "language_model.model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
154
+ "language_model.model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
155
+ "language_model.model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
156
+ "language_model.model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
157
+ "language_model.model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
158
+ "language_model.model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
159
+ "language_model.model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
160
+ "language_model.model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
161
+ "language_model.model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
162
+ "language_model.model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
163
+ "language_model.model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
164
+ "language_model.model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
165
+ "language_model.model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
166
+ "language_model.model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
167
+ "language_model.model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
168
+ "language_model.model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
169
+ "language_model.model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
170
+ "language_model.model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
171
+ "language_model.model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
172
+ "language_model.model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
173
+ "language_model.model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
174
+ "language_model.model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
175
+ "language_model.model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
176
+ "language_model.model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
177
+ "language_model.model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
178
+ "language_model.model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
179
+ "language_model.model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
180
+ "language_model.model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
181
+ "language_model.model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
182
+ "language_model.model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
183
+ "language_model.model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
184
+ "language_model.model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
185
+ "language_model.model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
186
+ "language_model.model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
187
+ "language_model.model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
188
+ "language_model.model.layers.27.input_layernorm.weight": "model-00004-of-00004.safetensors",
189
+ "language_model.model.layers.27.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
190
+ "language_model.model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
191
+ "language_model.model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
192
+ "language_model.model.layers.27.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
193
+ "language_model.model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
194
+ "language_model.model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
195
+ "language_model.model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
196
+ "language_model.model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
197
+ "language_model.model.layers.28.input_layernorm.weight": "model-00004-of-00004.safetensors",
198
+ "language_model.model.layers.28.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
199
+ "language_model.model.layers.28.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
200
+ "language_model.model.layers.28.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
201
+ "language_model.model.layers.28.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
202
+ "language_model.model.layers.28.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
203
+ "language_model.model.layers.28.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
204
+ "language_model.model.layers.28.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
205
+ "language_model.model.layers.28.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
206
+ "language_model.model.layers.29.input_layernorm.weight": "model-00004-of-00004.safetensors",
207
+ "language_model.model.layers.29.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
208
+ "language_model.model.layers.29.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
209
+ "language_model.model.layers.29.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
210
+ "language_model.model.layers.29.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
211
+ "language_model.model.layers.29.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
212
+ "language_model.model.layers.29.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
213
+ "language_model.model.layers.29.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
214
+ "language_model.model.layers.29.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
215
+ "language_model.model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
216
+ "language_model.model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
217
+ "language_model.model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
218
+ "language_model.model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
219
+ "language_model.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
220
+ "language_model.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
221
+ "language_model.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
222
+ "language_model.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
223
+ "language_model.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
224
+ "language_model.model.layers.30.input_layernorm.weight": "model-00004-of-00004.safetensors",
225
+ "language_model.model.layers.30.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
226
+ "language_model.model.layers.30.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
227
+ "language_model.model.layers.30.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
228
+ "language_model.model.layers.30.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
229
+ "language_model.model.layers.30.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
230
+ "language_model.model.layers.30.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
231
+ "language_model.model.layers.30.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
232
+ "language_model.model.layers.30.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
233
+ "language_model.model.layers.31.input_layernorm.weight": "model-00004-of-00004.safetensors",
234
+ "language_model.model.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
235
+ "language_model.model.layers.31.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
236
+ "language_model.model.layers.31.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
237
+ "language_model.model.layers.31.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
238
+ "language_model.model.layers.31.self_attn.k_proj.weight": "model-00004-of-00004.safetensors",
239
+ "language_model.model.layers.31.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
240
+ "language_model.model.layers.31.self_attn.q_proj.weight": "model-00004-of-00004.safetensors",
241
+ "language_model.model.layers.31.self_attn.v_proj.weight": "model-00004-of-00004.safetensors",
242
+ "language_model.model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
243
+ "language_model.model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
244
+ "language_model.model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
245
+ "language_model.model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
246
+ "language_model.model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
247
+ "language_model.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
248
+ "language_model.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
249
+ "language_model.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
250
+ "language_model.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
251
+ "language_model.model.layers.5.input_layernorm.weight": "model-00002-of-00004.safetensors",
252
+ "language_model.model.layers.5.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
253
+ "language_model.model.layers.5.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
254
+ "language_model.model.layers.5.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
255
+ "language_model.model.layers.5.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
256
+ "language_model.model.layers.5.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
257
+ "language_model.model.layers.5.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
258
+ "language_model.model.layers.5.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
259
+ "language_model.model.layers.5.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
260
+ "language_model.model.layers.6.input_layernorm.weight": "model-00002-of-00004.safetensors",
261
+ "language_model.model.layers.6.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
262
+ "language_model.model.layers.6.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
263
+ "language_model.model.layers.6.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
264
+ "language_model.model.layers.6.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
265
+ "language_model.model.layers.6.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
266
+ "language_model.model.layers.6.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
267
+ "language_model.model.layers.6.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
268
+ "language_model.model.layers.6.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
269
+ "language_model.model.layers.7.input_layernorm.weight": "model-00002-of-00004.safetensors",
270
+ "language_model.model.layers.7.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
271
+ "language_model.model.layers.7.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
272
+ "language_model.model.layers.7.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
273
+ "language_model.model.layers.7.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
274
+ "language_model.model.layers.7.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
275
+ "language_model.model.layers.7.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
276
+ "language_model.model.layers.7.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
277
+ "language_model.model.layers.7.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
278
+ "language_model.model.layers.8.input_layernorm.weight": "model-00002-of-00004.safetensors",
279
+ "language_model.model.layers.8.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
280
+ "language_model.model.layers.8.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
281
+ "language_model.model.layers.8.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
282
+ "language_model.model.layers.8.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
283
+ "language_model.model.layers.8.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
284
+ "language_model.model.layers.8.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
285
+ "language_model.model.layers.8.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
286
+ "language_model.model.layers.8.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
287
+ "language_model.model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
288
+ "language_model.model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
289
+ "language_model.model.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
290
+ "language_model.model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
291
+ "language_model.model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
292
+ "language_model.model.layers.9.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
293
+ "language_model.model.layers.9.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
294
+ "language_model.model.layers.9.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
295
+ "language_model.model.layers.9.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
296
+ "language_model.model.norm.weight": "model-00004-of-00004.safetensors",
297
+ "multi_modal_projector.proj.0.bias": "model-00001-of-00004.safetensors",
298
+ "multi_modal_projector.proj.0.weight": "model-00001-of-00004.safetensors",
299
+ "multi_modal_projector.proj.2.bias": "model-00001-of-00004.safetensors",
300
+ "multi_modal_projector.proj.2.weight": "model-00001-of-00004.safetensors",
301
+ "multi_modal_projector.row_seperator": "model-00001-of-00004.safetensors",
302
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.conv_dw.bias": "model-00001-of-00004.safetensors",
303
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.conv_dw.weight": "model-00001-of-00004.safetensors",
304
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.gamma": "model-00001-of-00004.safetensors",
305
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.mlp.fc1.bias": "model-00001-of-00004.safetensors",
306
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.mlp.fc1.weight": "model-00001-of-00004.safetensors",
307
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.mlp.fc2.bias": "model-00001-of-00004.safetensors",
308
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.mlp.fc2.weight": "model-00001-of-00004.safetensors",
309
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.norm.bias": "model-00001-of-00004.safetensors",
310
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.0.norm.weight": "model-00001-of-00004.safetensors",
311
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.conv_dw.bias": "model-00001-of-00004.safetensors",
312
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.conv_dw.weight": "model-00001-of-00004.safetensors",
313
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.gamma": "model-00001-of-00004.safetensors",
314
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.mlp.fc1.bias": "model-00001-of-00004.safetensors",
315
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.mlp.fc1.weight": "model-00001-of-00004.safetensors",
316
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.mlp.fc2.bias": "model-00001-of-00004.safetensors",
317
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.mlp.fc2.weight": "model-00001-of-00004.safetensors",
318
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.norm.bias": "model-00001-of-00004.safetensors",
319
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.1.norm.weight": "model-00001-of-00004.safetensors",
320
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.conv_dw.bias": "model-00001-of-00004.safetensors",
321
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.conv_dw.weight": "model-00001-of-00004.safetensors",
322
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.gamma": "model-00001-of-00004.safetensors",
323
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.mlp.fc1.bias": "model-00001-of-00004.safetensors",
324
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.mlp.fc1.weight": "model-00001-of-00004.safetensors",
325
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.mlp.fc2.bias": "model-00001-of-00004.safetensors",
326
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.mlp.fc2.weight": "model-00001-of-00004.safetensors",
327
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.norm.bias": "model-00001-of-00004.safetensors",
328
+ "vision_tower.clip_vision_model.trunk.stages.0.blocks.2.norm.weight": "model-00001-of-00004.safetensors",
329
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.conv_dw.bias": "model-00001-of-00004.safetensors",
330
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.conv_dw.weight": "model-00001-of-00004.safetensors",
331
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.gamma": "model-00001-of-00004.safetensors",
332
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.mlp.fc1.bias": "model-00001-of-00004.safetensors",
333
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.mlp.fc1.weight": "model-00001-of-00004.safetensors",
334
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.mlp.fc2.bias": "model-00001-of-00004.safetensors",
335
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.mlp.fc2.weight": "model-00001-of-00004.safetensors",
336
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.norm.bias": "model-00001-of-00004.safetensors",
337
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.0.norm.weight": "model-00001-of-00004.safetensors",
338
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.conv_dw.bias": "model-00001-of-00004.safetensors",
339
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.conv_dw.weight": "model-00001-of-00004.safetensors",
340
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.gamma": "model-00001-of-00004.safetensors",
341
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.mlp.fc1.bias": "model-00001-of-00004.safetensors",
342
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.mlp.fc1.weight": "model-00001-of-00004.safetensors",
343
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.mlp.fc2.bias": "model-00001-of-00004.safetensors",
344
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.mlp.fc2.weight": "model-00001-of-00004.safetensors",
345
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.norm.bias": "model-00001-of-00004.safetensors",
346
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.1.norm.weight": "model-00001-of-00004.safetensors",
347
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.conv_dw.bias": "model-00001-of-00004.safetensors",
348
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.conv_dw.weight": "model-00001-of-00004.safetensors",
349
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.gamma": "model-00001-of-00004.safetensors",
350
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.mlp.fc1.bias": "model-00001-of-00004.safetensors",
351
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.mlp.fc1.weight": "model-00001-of-00004.safetensors",
352
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.mlp.fc2.bias": "model-00001-of-00004.safetensors",
353
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.mlp.fc2.weight": "model-00001-of-00004.safetensors",
354
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.norm.bias": "model-00001-of-00004.safetensors",
355
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.2.norm.weight": "model-00001-of-00004.safetensors",
356
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.conv_dw.bias": "model-00001-of-00004.safetensors",
357
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.conv_dw.weight": "model-00001-of-00004.safetensors",
358
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.gamma": "model-00001-of-00004.safetensors",
359
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.mlp.fc1.bias": "model-00001-of-00004.safetensors",
360
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.mlp.fc1.weight": "model-00001-of-00004.safetensors",
361
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.mlp.fc2.bias": "model-00001-of-00004.safetensors",
362
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.mlp.fc2.weight": "model-00001-of-00004.safetensors",
363
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.norm.bias": "model-00001-of-00004.safetensors",
364
+ "vision_tower.clip_vision_model.trunk.stages.1.blocks.3.norm.weight": "model-00001-of-00004.safetensors",
365
+ "vision_tower.clip_vision_model.trunk.stages.1.downsample.0.bias": "model-00001-of-00004.safetensors",
366
+ "vision_tower.clip_vision_model.trunk.stages.1.downsample.0.weight": "model-00001-of-00004.safetensors",
367
+ "vision_tower.clip_vision_model.trunk.stages.1.downsample.1.bias": "model-00001-of-00004.safetensors",
368
+ "vision_tower.clip_vision_model.trunk.stages.1.downsample.1.weight": "model-00001-of-00004.safetensors",
369
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.conv_dw.bias": "model-00001-of-00004.safetensors",
370
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.conv_dw.weight": "model-00001-of-00004.safetensors",
371
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.gamma": "model-00001-of-00004.safetensors",
372
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.mlp.fc1.bias": "model-00001-of-00004.safetensors",
373
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.mlp.fc1.weight": "model-00001-of-00004.safetensors",
374
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.mlp.fc2.bias": "model-00001-of-00004.safetensors",
375
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.mlp.fc2.weight": "model-00001-of-00004.safetensors",
376
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.norm.bias": "model-00001-of-00004.safetensors",
377
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.0.norm.weight": "model-00001-of-00004.safetensors",
378
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.conv_dw.bias": "model-00001-of-00004.safetensors",
379
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.conv_dw.weight": "model-00001-of-00004.safetensors",
380
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.gamma": "model-00001-of-00004.safetensors",
381
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.mlp.fc1.bias": "model-00001-of-00004.safetensors",
382
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.mlp.fc1.weight": "model-00001-of-00004.safetensors",
383
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.mlp.fc2.bias": "model-00001-of-00004.safetensors",
384
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.mlp.fc2.weight": "model-00001-of-00004.safetensors",
385
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.norm.bias": "model-00001-of-00004.safetensors",
386
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.1.norm.weight": "model-00001-of-00004.safetensors",
387
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.conv_dw.bias": "model-00001-of-00004.safetensors",
388
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.conv_dw.weight": "model-00001-of-00004.safetensors",
389
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.gamma": "model-00001-of-00004.safetensors",
390
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.mlp.fc1.bias": "model-00001-of-00004.safetensors",
391
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.mlp.fc1.weight": "model-00001-of-00004.safetensors",
392
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.mlp.fc2.bias": "model-00001-of-00004.safetensors",
393
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.mlp.fc2.weight": "model-00001-of-00004.safetensors",
394
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.norm.bias": "model-00001-of-00004.safetensors",
395
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.10.norm.weight": "model-00001-of-00004.safetensors",
396
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.conv_dw.bias": "model-00001-of-00004.safetensors",
397
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.conv_dw.weight": "model-00001-of-00004.safetensors",
398
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.gamma": "model-00001-of-00004.safetensors",
399
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.mlp.fc1.bias": "model-00001-of-00004.safetensors",
400
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.mlp.fc1.weight": "model-00001-of-00004.safetensors",
401
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.mlp.fc2.bias": "model-00001-of-00004.safetensors",
402
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.mlp.fc2.weight": "model-00001-of-00004.safetensors",
403
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.norm.bias": "model-00001-of-00004.safetensors",
404
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.11.norm.weight": "model-00001-of-00004.safetensors",
405
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.conv_dw.bias": "model-00001-of-00004.safetensors",
406
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.conv_dw.weight": "model-00001-of-00004.safetensors",
407
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.gamma": "model-00001-of-00004.safetensors",
408
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.mlp.fc1.bias": "model-00001-of-00004.safetensors",
409
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.mlp.fc1.weight": "model-00001-of-00004.safetensors",
410
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.mlp.fc2.bias": "model-00001-of-00004.safetensors",
411
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.mlp.fc2.weight": "model-00001-of-00004.safetensors",
412
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.norm.bias": "model-00001-of-00004.safetensors",
413
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.12.norm.weight": "model-00001-of-00004.safetensors",
414
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.conv_dw.bias": "model-00001-of-00004.safetensors",
415
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.conv_dw.weight": "model-00001-of-00004.safetensors",
416
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.gamma": "model-00001-of-00004.safetensors",
417
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.mlp.fc1.bias": "model-00001-of-00004.safetensors",
418
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.mlp.fc1.weight": "model-00001-of-00004.safetensors",
419
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.mlp.fc2.bias": "model-00001-of-00004.safetensors",
420
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.mlp.fc2.weight": "model-00001-of-00004.safetensors",
421
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.norm.bias": "model-00001-of-00004.safetensors",
422
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.13.norm.weight": "model-00001-of-00004.safetensors",
423
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.conv_dw.bias": "model-00001-of-00004.safetensors",
424
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.conv_dw.weight": "model-00001-of-00004.safetensors",
425
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.gamma": "model-00001-of-00004.safetensors",
426
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.mlp.fc1.bias": "model-00001-of-00004.safetensors",
427
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.mlp.fc1.weight": "model-00001-of-00004.safetensors",
428
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.mlp.fc2.bias": "model-00001-of-00004.safetensors",
429
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.mlp.fc2.weight": "model-00001-of-00004.safetensors",
430
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.norm.bias": "model-00001-of-00004.safetensors",
431
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.14.norm.weight": "model-00001-of-00004.safetensors",
432
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.conv_dw.bias": "model-00001-of-00004.safetensors",
433
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.conv_dw.weight": "model-00001-of-00004.safetensors",
434
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.gamma": "model-00001-of-00004.safetensors",
435
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.mlp.fc1.bias": "model-00001-of-00004.safetensors",
436
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.mlp.fc1.weight": "model-00001-of-00004.safetensors",
437
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.mlp.fc2.bias": "model-00001-of-00004.safetensors",
438
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.mlp.fc2.weight": "model-00001-of-00004.safetensors",
439
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.norm.bias": "model-00001-of-00004.safetensors",
440
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.15.norm.weight": "model-00001-of-00004.safetensors",
441
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.conv_dw.bias": "model-00001-of-00004.safetensors",
442
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.conv_dw.weight": "model-00001-of-00004.safetensors",
443
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.gamma": "model-00001-of-00004.safetensors",
444
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.mlp.fc1.bias": "model-00001-of-00004.safetensors",
445
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.mlp.fc1.weight": "model-00001-of-00004.safetensors",
446
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.mlp.fc2.bias": "model-00001-of-00004.safetensors",
447
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.mlp.fc2.weight": "model-00001-of-00004.safetensors",
448
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.norm.bias": "model-00001-of-00004.safetensors",
449
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.16.norm.weight": "model-00001-of-00004.safetensors",
450
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.conv_dw.bias": "model-00001-of-00004.safetensors",
451
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.conv_dw.weight": "model-00001-of-00004.safetensors",
452
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.gamma": "model-00001-of-00004.safetensors",
453
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.mlp.fc1.bias": "model-00001-of-00004.safetensors",
454
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.mlp.fc1.weight": "model-00001-of-00004.safetensors",
455
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.mlp.fc2.bias": "model-00001-of-00004.safetensors",
456
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.mlp.fc2.weight": "model-00001-of-00004.safetensors",
457
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.norm.bias": "model-00001-of-00004.safetensors",
458
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.17.norm.weight": "model-00001-of-00004.safetensors",
459
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.conv_dw.bias": "model-00001-of-00004.safetensors",
460
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.conv_dw.weight": "model-00001-of-00004.safetensors",
461
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.gamma": "model-00001-of-00004.safetensors",
462
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.mlp.fc1.bias": "model-00001-of-00004.safetensors",
463
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.mlp.fc1.weight": "model-00001-of-00004.safetensors",
464
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.mlp.fc2.bias": "model-00001-of-00004.safetensors",
465
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.mlp.fc2.weight": "model-00001-of-00004.safetensors",
466
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.norm.bias": "model-00001-of-00004.safetensors",
467
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.18.norm.weight": "model-00001-of-00004.safetensors",
468
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.conv_dw.bias": "model-00001-of-00004.safetensors",
469
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.conv_dw.weight": "model-00001-of-00004.safetensors",
470
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.gamma": "model-00001-of-00004.safetensors",
471
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.mlp.fc1.bias": "model-00001-of-00004.safetensors",
472
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.mlp.fc1.weight": "model-00001-of-00004.safetensors",
473
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.mlp.fc2.bias": "model-00001-of-00004.safetensors",
474
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.mlp.fc2.weight": "model-00001-of-00004.safetensors",
475
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.norm.bias": "model-00001-of-00004.safetensors",
476
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.19.norm.weight": "model-00001-of-00004.safetensors",
477
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.conv_dw.bias": "model-00001-of-00004.safetensors",
478
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.conv_dw.weight": "model-00001-of-00004.safetensors",
479
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.gamma": "model-00001-of-00004.safetensors",
480
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.mlp.fc1.bias": "model-00001-of-00004.safetensors",
481
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.mlp.fc1.weight": "model-00001-of-00004.safetensors",
482
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.mlp.fc2.bias": "model-00001-of-00004.safetensors",
483
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.mlp.fc2.weight": "model-00001-of-00004.safetensors",
484
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.norm.bias": "model-00001-of-00004.safetensors",
485
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.2.norm.weight": "model-00001-of-00004.safetensors",
486
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.conv_dw.bias": "model-00001-of-00004.safetensors",
487
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.conv_dw.weight": "model-00001-of-00004.safetensors",
488
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.gamma": "model-00001-of-00004.safetensors",
489
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.mlp.fc1.bias": "model-00001-of-00004.safetensors",
490
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.mlp.fc1.weight": "model-00001-of-00004.safetensors",
491
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.mlp.fc2.bias": "model-00001-of-00004.safetensors",
492
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.mlp.fc2.weight": "model-00001-of-00004.safetensors",
493
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.norm.bias": "model-00001-of-00004.safetensors",
494
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.20.norm.weight": "model-00001-of-00004.safetensors",
495
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.conv_dw.bias": "model-00001-of-00004.safetensors",
496
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.conv_dw.weight": "model-00001-of-00004.safetensors",
497
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.gamma": "model-00001-of-00004.safetensors",
498
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.mlp.fc1.bias": "model-00001-of-00004.safetensors",
499
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.mlp.fc1.weight": "model-00001-of-00004.safetensors",
500
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.mlp.fc2.bias": "model-00001-of-00004.safetensors",
501
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.mlp.fc2.weight": "model-00001-of-00004.safetensors",
502
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.norm.bias": "model-00001-of-00004.safetensors",
503
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.21.norm.weight": "model-00001-of-00004.safetensors",
504
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.conv_dw.bias": "model-00001-of-00004.safetensors",
505
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.conv_dw.weight": "model-00001-of-00004.safetensors",
506
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.gamma": "model-00001-of-00004.safetensors",
507
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.mlp.fc1.bias": "model-00001-of-00004.safetensors",
508
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.mlp.fc1.weight": "model-00001-of-00004.safetensors",
509
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.mlp.fc2.bias": "model-00001-of-00004.safetensors",
510
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.mlp.fc2.weight": "model-00001-of-00004.safetensors",
511
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.norm.bias": "model-00001-of-00004.safetensors",
512
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.22.norm.weight": "model-00001-of-00004.safetensors",
513
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.conv_dw.bias": "model-00001-of-00004.safetensors",
514
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.conv_dw.weight": "model-00001-of-00004.safetensors",
515
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.gamma": "model-00001-of-00004.safetensors",
516
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.mlp.fc1.bias": "model-00001-of-00004.safetensors",
517
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.mlp.fc1.weight": "model-00001-of-00004.safetensors",
518
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.mlp.fc2.bias": "model-00001-of-00004.safetensors",
519
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.mlp.fc2.weight": "model-00001-of-00004.safetensors",
520
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.norm.bias": "model-00001-of-00004.safetensors",
521
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.23.norm.weight": "model-00001-of-00004.safetensors",
522
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.conv_dw.bias": "model-00001-of-00004.safetensors",
523
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.conv_dw.weight": "model-00001-of-00004.safetensors",
524
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.gamma": "model-00001-of-00004.safetensors",
525
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.mlp.fc1.bias": "model-00001-of-00004.safetensors",
526
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.mlp.fc1.weight": "model-00001-of-00004.safetensors",
527
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.mlp.fc2.bias": "model-00001-of-00004.safetensors",
528
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.mlp.fc2.weight": "model-00001-of-00004.safetensors",
529
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.norm.bias": "model-00001-of-00004.safetensors",
530
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.24.norm.weight": "model-00001-of-00004.safetensors",
531
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.conv_dw.bias": "model-00001-of-00004.safetensors",
532
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.conv_dw.weight": "model-00001-of-00004.safetensors",
533
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.gamma": "model-00001-of-00004.safetensors",
534
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.mlp.fc1.bias": "model-00001-of-00004.safetensors",
535
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.mlp.fc1.weight": "model-00001-of-00004.safetensors",
536
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.mlp.fc2.bias": "model-00001-of-00004.safetensors",
537
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.mlp.fc2.weight": "model-00001-of-00004.safetensors",
538
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.norm.bias": "model-00001-of-00004.safetensors",
539
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.25.norm.weight": "model-00001-of-00004.safetensors",
540
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.conv_dw.bias": "model-00001-of-00004.safetensors",
541
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.conv_dw.weight": "model-00001-of-00004.safetensors",
542
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.gamma": "model-00001-of-00004.safetensors",
543
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.mlp.fc1.bias": "model-00001-of-00004.safetensors",
544
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.mlp.fc1.weight": "model-00001-of-00004.safetensors",
545
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.mlp.fc2.bias": "model-00001-of-00004.safetensors",
546
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.mlp.fc2.weight": "model-00001-of-00004.safetensors",
547
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.norm.bias": "model-00001-of-00004.safetensors",
548
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.26.norm.weight": "model-00001-of-00004.safetensors",
549
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.conv_dw.bias": "model-00001-of-00004.safetensors",
550
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.conv_dw.weight": "model-00001-of-00004.safetensors",
551
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.gamma": "model-00001-of-00004.safetensors",
552
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.mlp.fc1.bias": "model-00001-of-00004.safetensors",
553
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.mlp.fc1.weight": "model-00001-of-00004.safetensors",
554
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.mlp.fc2.bias": "model-00001-of-00004.safetensors",
555
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.mlp.fc2.weight": "model-00001-of-00004.safetensors",
556
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.norm.bias": "model-00001-of-00004.safetensors",
557
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.27.norm.weight": "model-00001-of-00004.safetensors",
558
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.conv_dw.bias": "model-00001-of-00004.safetensors",
559
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.conv_dw.weight": "model-00001-of-00004.safetensors",
560
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.gamma": "model-00001-of-00004.safetensors",
561
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.mlp.fc1.bias": "model-00001-of-00004.safetensors",
562
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.mlp.fc1.weight": "model-00001-of-00004.safetensors",
563
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.mlp.fc2.bias": "model-00001-of-00004.safetensors",
564
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.mlp.fc2.weight": "model-00001-of-00004.safetensors",
565
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.norm.bias": "model-00001-of-00004.safetensors",
566
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.28.norm.weight": "model-00001-of-00004.safetensors",
567
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.conv_dw.bias": "model-00001-of-00004.safetensors",
568
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.conv_dw.weight": "model-00001-of-00004.safetensors",
569
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.gamma": "model-00001-of-00004.safetensors",
570
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.mlp.fc1.bias": "model-00001-of-00004.safetensors",
571
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.mlp.fc1.weight": "model-00001-of-00004.safetensors",
572
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.mlp.fc2.bias": "model-00001-of-00004.safetensors",
573
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.mlp.fc2.weight": "model-00001-of-00004.safetensors",
574
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.norm.bias": "model-00001-of-00004.safetensors",
575
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.29.norm.weight": "model-00001-of-00004.safetensors",
576
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.conv_dw.bias": "model-00001-of-00004.safetensors",
577
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.conv_dw.weight": "model-00001-of-00004.safetensors",
578
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.gamma": "model-00001-of-00004.safetensors",
579
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.mlp.fc1.bias": "model-00001-of-00004.safetensors",
580
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.mlp.fc1.weight": "model-00001-of-00004.safetensors",
581
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.mlp.fc2.bias": "model-00001-of-00004.safetensors",
582
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.mlp.fc2.weight": "model-00001-of-00004.safetensors",
583
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.norm.bias": "model-00001-of-00004.safetensors",
584
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.3.norm.weight": "model-00001-of-00004.safetensors",
585
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.conv_dw.bias": "model-00001-of-00004.safetensors",
586
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.conv_dw.weight": "model-00001-of-00004.safetensors",
587
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.gamma": "model-00001-of-00004.safetensors",
588
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.mlp.fc1.bias": "model-00001-of-00004.safetensors",
589
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.mlp.fc1.weight": "model-00001-of-00004.safetensors",
590
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.mlp.fc2.bias": "model-00001-of-00004.safetensors",
591
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.mlp.fc2.weight": "model-00001-of-00004.safetensors",
592
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.norm.bias": "model-00001-of-00004.safetensors",
593
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.4.norm.weight": "model-00001-of-00004.safetensors",
594
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.conv_dw.bias": "model-00001-of-00004.safetensors",
595
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.conv_dw.weight": "model-00001-of-00004.safetensors",
596
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.gamma": "model-00001-of-00004.safetensors",
597
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.mlp.fc1.bias": "model-00001-of-00004.safetensors",
598
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.mlp.fc1.weight": "model-00001-of-00004.safetensors",
599
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.mlp.fc2.bias": "model-00001-of-00004.safetensors",
600
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.mlp.fc2.weight": "model-00001-of-00004.safetensors",
601
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.norm.bias": "model-00001-of-00004.safetensors",
602
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.5.norm.weight": "model-00001-of-00004.safetensors",
603
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.conv_dw.bias": "model-00001-of-00004.safetensors",
604
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.conv_dw.weight": "model-00001-of-00004.safetensors",
605
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.gamma": "model-00001-of-00004.safetensors",
606
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.mlp.fc1.bias": "model-00001-of-00004.safetensors",
607
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.mlp.fc1.weight": "model-00001-of-00004.safetensors",
608
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.mlp.fc2.bias": "model-00001-of-00004.safetensors",
609
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.mlp.fc2.weight": "model-00001-of-00004.safetensors",
610
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.norm.bias": "model-00001-of-00004.safetensors",
611
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.6.norm.weight": "model-00001-of-00004.safetensors",
612
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.conv_dw.bias": "model-00001-of-00004.safetensors",
613
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.conv_dw.weight": "model-00001-of-00004.safetensors",
614
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.gamma": "model-00001-of-00004.safetensors",
615
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.mlp.fc1.bias": "model-00001-of-00004.safetensors",
616
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.mlp.fc1.weight": "model-00001-of-00004.safetensors",
617
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.mlp.fc2.bias": "model-00001-of-00004.safetensors",
618
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.mlp.fc2.weight": "model-00001-of-00004.safetensors",
619
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.norm.bias": "model-00001-of-00004.safetensors",
620
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.7.norm.weight": "model-00001-of-00004.safetensors",
621
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.conv_dw.bias": "model-00001-of-00004.safetensors",
622
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.conv_dw.weight": "model-00001-of-00004.safetensors",
623
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.gamma": "model-00001-of-00004.safetensors",
624
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.mlp.fc1.bias": "model-00001-of-00004.safetensors",
625
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.mlp.fc1.weight": "model-00001-of-00004.safetensors",
626
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.mlp.fc2.bias": "model-00001-of-00004.safetensors",
627
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.mlp.fc2.weight": "model-00001-of-00004.safetensors",
628
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.norm.bias": "model-00001-of-00004.safetensors",
629
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.8.norm.weight": "model-00001-of-00004.safetensors",
630
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.conv_dw.bias": "model-00001-of-00004.safetensors",
631
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.conv_dw.weight": "model-00001-of-00004.safetensors",
632
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.gamma": "model-00001-of-00004.safetensors",
633
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.mlp.fc1.bias": "model-00001-of-00004.safetensors",
634
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.mlp.fc1.weight": "model-00001-of-00004.safetensors",
635
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.mlp.fc2.bias": "model-00001-of-00004.safetensors",
636
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.mlp.fc2.weight": "model-00001-of-00004.safetensors",
637
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.norm.bias": "model-00001-of-00004.safetensors",
638
+ "vision_tower.clip_vision_model.trunk.stages.2.blocks.9.norm.weight": "model-00001-of-00004.safetensors",
639
+ "vision_tower.clip_vision_model.trunk.stages.2.downsample.0.bias": "model-00001-of-00004.safetensors",
640
+ "vision_tower.clip_vision_model.trunk.stages.2.downsample.0.weight": "model-00001-of-00004.safetensors",
641
+ "vision_tower.clip_vision_model.trunk.stages.2.downsample.1.bias": "model-00001-of-00004.safetensors",
642
+ "vision_tower.clip_vision_model.trunk.stages.2.downsample.1.weight": "model-00001-of-00004.safetensors",
643
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.conv_dw.bias": "model-00001-of-00004.safetensors",
644
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.conv_dw.weight": "model-00001-of-00004.safetensors",
645
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.gamma": "model-00001-of-00004.safetensors",
646
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.mlp.fc1.bias": "model-00001-of-00004.safetensors",
647
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.mlp.fc1.weight": "model-00001-of-00004.safetensors",
648
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.mlp.fc2.bias": "model-00001-of-00004.safetensors",
649
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.mlp.fc2.weight": "model-00001-of-00004.safetensors",
650
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.norm.bias": "model-00001-of-00004.safetensors",
651
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.0.norm.weight": "model-00001-of-00004.safetensors",
652
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.conv_dw.bias": "model-00001-of-00004.safetensors",
653
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.conv_dw.weight": "model-00001-of-00004.safetensors",
654
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.gamma": "model-00001-of-00004.safetensors",
655
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.mlp.fc1.bias": "model-00001-of-00004.safetensors",
656
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.mlp.fc1.weight": "model-00001-of-00004.safetensors",
657
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.mlp.fc2.bias": "model-00001-of-00004.safetensors",
658
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.mlp.fc2.weight": "model-00001-of-00004.safetensors",
659
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.norm.bias": "model-00001-of-00004.safetensors",
660
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.1.norm.weight": "model-00001-of-00004.safetensors",
661
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.conv_dw.bias": "model-00001-of-00004.safetensors",
662
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.conv_dw.weight": "model-00001-of-00004.safetensors",
663
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.gamma": "model-00001-of-00004.safetensors",
664
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.mlp.fc1.bias": "model-00001-of-00004.safetensors",
665
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.mlp.fc1.weight": "model-00001-of-00004.safetensors",
666
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.mlp.fc2.bias": "model-00001-of-00004.safetensors",
667
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.mlp.fc2.weight": "model-00001-of-00004.safetensors",
668
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.norm.bias": "model-00001-of-00004.safetensors",
669
+ "vision_tower.clip_vision_model.trunk.stages.3.blocks.2.norm.weight": "model-00001-of-00004.safetensors",
670
+ "vision_tower.clip_vision_model.trunk.stages.3.downsample.0.bias": "model-00001-of-00004.safetensors",
671
+ "vision_tower.clip_vision_model.trunk.stages.3.downsample.0.weight": "model-00001-of-00004.safetensors",
672
+ "vision_tower.clip_vision_model.trunk.stages.3.downsample.1.bias": "model-00001-of-00004.safetensors",
673
+ "vision_tower.clip_vision_model.trunk.stages.3.downsample.1.weight": "model-00001-of-00004.safetensors",
674
+ "vision_tower.clip_vision_model.trunk.stem.0.bias": "model-00001-of-00004.safetensors",
675
+ "vision_tower.clip_vision_model.trunk.stem.0.weight": "model-00001-of-00004.safetensors",
676
+ "vision_tower.clip_vision_model.trunk.stem.1.bias": "model-00001-of-00004.safetensors",
677
+ "vision_tower.clip_vision_model.trunk.stem.1.weight": "model-00001-of-00004.safetensors"
678
+ }
679
+ }
modeling_magma.py ADDED
@@ -0,0 +1,1412 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """PyTorch Magma model."""
16
+
17
+ import math
18
+ import re
19
+ import os
20
+ from dataclasses import dataclass
21
+ from typing import List, Optional, Tuple, Union
22
+
23
+ import numpy as np
24
+ import torch
25
+ import torch.utils.checkpoint
26
+ from torch import nn
27
+ import torch.distributed as dist
28
+ from transformers.modeling_utils import PreTrainedModel
29
+ from transformers.activations import ACT2FN
30
+ from transformers.cache_utils import Cache, DynamicCache
31
+ from transformers.utils import ModelOutput
32
+ from transformers.utils import (
33
+ add_code_sample_docstrings,
34
+ add_start_docstrings,
35
+ add_start_docstrings_to_model_forward,
36
+ logging,
37
+ replace_return_docstrings,
38
+ )
39
+ from transformers import AutoConfig, AutoModelForCausalLM
40
+ from .configuration_magma import MagmaConfig
41
+ from .image_tower_magma import MagmaImageTower
42
+
43
+ logger = logging.get_logger(__name__)
44
+
45
+ _CONFIG_FOR_DOC = "MagmaConfig"
46
+
47
+ @dataclass
48
+ # Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->Magma
49
+ class MagmaCausalLMOutputWithPast(ModelOutput):
50
+ """
51
+ Base class for Magma causal language model (or autoregressive) outputs.
52
+
53
+ Args:
54
+ loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
55
+ Language modeling loss (for next-token prediction).
56
+ logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
57
+ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
58
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
59
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
60
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
61
+
62
+ Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
63
+ `past_key_values` input) to speed up sequential decoding.
64
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
65
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
66
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
67
+
68
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
69
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
70
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
71
+ sequence_length)`.
72
+
73
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
74
+ heads.
75
+ image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
76
+ Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
77
+ sequence_length, hidden_size)`.
78
+
79
+ image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
80
+ """
81
+
82
+ loss: Optional[torch.FloatTensor] = None
83
+ logits: torch.FloatTensor = None
84
+ past_key_values: Optional[List[torch.FloatTensor]] = None
85
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
86
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
87
+ image_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
88
+
89
+
90
+ class MagmaMultiModalProjector(nn.Module):
91
+ def __init__(self, config):
92
+ super().__init__()
93
+ self.config = config
94
+
95
+ dim_vision = {'base': 640, 'large': 768, 'xxlarge': 1024}
96
+ vision_backbone = config.get('vision_backbone', 'convnextxxlarge')
97
+ vision_backbone_size = vision_backbone.replace('convnext', '')
98
+ projector_type = config.get('mm_projector_type', 'linear')
99
+ mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projector_type)
100
+ if mlp_gelu_match:
101
+ mlp_depth = int(mlp_gelu_match.group(1))
102
+ modules = [nn.Linear(config['mm_hidden_size'], config['hidden_size'])]
103
+ for _ in range(1, mlp_depth):
104
+ modules.append(nn.GELU())
105
+ modules.append(nn.Linear(config['hidden_size'], config['hidden_size']))
106
+ self.proj = nn.Sequential(*modules)
107
+
108
+ # define a row seperator
109
+ self.row_seperator = nn.Parameter(torch.zeros(1, 1, config['hidden_size']))
110
+ if config.get('mm_use_im_start_end', False):
111
+ self.img_start_seperator = nn.Parameter(torch.zeros(1, config['hidden_size']))
112
+ self.img_end_seperator = nn.Parameter(torch.zeros(1, config['hidden_size']))
113
+
114
+ def forward(self, x):
115
+ return self.proj(x)
116
+
117
+
118
+ MAGMA_START_DOCSTRING = r"""
119
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
120
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
121
+ etc.)
122
+
123
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
124
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
125
+ and behavior.
126
+
127
+ Parameters:
128
+ config ([`MagmaConfig`] or [`MagmaVisionConfig`]):
129
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
130
+ load the weights associated with the model, only the configuration. Check out the
131
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
132
+ """
133
+
134
+
135
+ @add_start_docstrings(
136
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
137
+ MAGMA_START_DOCSTRING,
138
+ )
139
+
140
+ class MagmaPreTrainedModel(PreTrainedModel):
141
+ config_class = MagmaConfig
142
+ base_model_prefix = "model"
143
+ supports_gradient_checkpointing = True
144
+ _no_split_modules = ["MagmaImageTower"]
145
+ _skip_keys_device_placement = "past_key_values"
146
+ _supports_flash_attn_2 = True
147
+
148
+ def _init_weights(self, module):
149
+ std = (
150
+ self.config.initializer_range
151
+ if hasattr(self.config, "initializer_range")
152
+ else self.config.text_config.initializer_range
153
+ )
154
+
155
+ if hasattr(module, "class_embedding"):
156
+ module.class_embedding.data.normal_(mean=0.0, std=std)
157
+
158
+ if isinstance(module, (nn.Linear, nn.Conv2d)):
159
+ module.weight.data.normal_(mean=0.0, std=std)
160
+ if module.bias is not None:
161
+ module.bias.data.zero_()
162
+ elif isinstance(module, nn.Embedding):
163
+ module.weight.data.normal_(mean=0.0, std=std)
164
+ if module.padding_idx is not None:
165
+ module.weight.data[module.padding_idx].zero_()
166
+
167
+ @property
168
+ def _supports_sdpa(self):
169
+ """
170
+ Retrieve language_model's attribute to check whether the model supports
171
+ SDPA or not.
172
+ """
173
+ return self.language_model._supports_sdpa
174
+
175
+
176
+ MAGMA_INPUTS_DOCSTRING = r"""
177
+ Args:
178
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
179
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
180
+ it.
181
+
182
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
183
+ [`PreTrainedTokenizer.__call__`] for details.
184
+
185
+ [What are input IDs?](../glossary#input-ids)
186
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)):
187
+ The tensors corresponding to the input images. Pixel values can be obtained using
188
+ [`AutoImageProcessor`]. See [`MagmaImageProcessor.__call__`] for details. [`MagmaProcessor`] uses
189
+ [`MagmaImageProcessor`] for processing images.
190
+ image_sizes (`torch.LongTensor` of shape `(batch_size, 2)`, *optional*):
191
+ The sizes of the images in the batch, being (height, width) for each image.
192
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
193
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
194
+
195
+ - 1 for tokens that are **not masked**,
196
+ - 0 for tokens that are **masked**.
197
+
198
+ [What are attention masks?](../glossary#attention-mask)
199
+
200
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
201
+ [`PreTrainedTokenizer.__call__`] for details.
202
+
203
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
204
+ `past_key_values`).
205
+
206
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
207
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
208
+ information on the default strategy.
209
+
210
+ - 1 indicates the head is **not masked**,
211
+ - 0 indicates the head is **masked**.
212
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
213
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
214
+ config.n_positions - 1]`. [What are position IDs?](../glossary#position-ids)
215
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
216
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
217
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
218
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
219
+
220
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
221
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
222
+
223
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
224
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
225
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
226
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
227
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
228
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
229
+ model's internal embedding lookup matrix.
230
+ vision_feature_layer (`int`, *optional*, defaults to -2):
231
+ The index of the layer to select the vision feature.
232
+ vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
233
+ The feature selection strategy used to select the vision feature from the vision backbone.
234
+ Can be one of `"default"` or `"full"`. If `"default"`, the CLS token is removed from the vision features.
235
+ If `"full"`, the full vision features are used.
236
+ use_cache (`bool`, *optional*):
237
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
238
+ `past_key_values`).
239
+ output_attentions (`bool`, *optional*):
240
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
241
+ tensors for more detail.
242
+ output_hidden_states (`bool`, *optional*):
243
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
244
+ more detail.
245
+ return_dict (`bool`, *optional*):
246
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
247
+ """
248
+
249
+ @add_start_docstrings(
250
+ """The Magma model which consists of a vision backbone and a language model.""",
251
+ MAGMA_START_DOCSTRING,
252
+ )
253
+ class MagmaForCausalLM(MagmaPreTrainedModel):
254
+ def __init__(self, config: MagmaConfig):
255
+ super().__init__(config)
256
+
257
+ self.vision_tower = MagmaImageTower(config.vision_config, require_pretrained=False)
258
+ config.vision_config['mm_hidden_size'] = config.vision_config['mm_hidden_size'] \
259
+ if 'mm_hidden_size' in config.vision_config else self.vision_tower.hidden_size
260
+ config.vision_config['hidden_size'] = config.vision_config['hidden_size'] \
261
+ if 'hidden_size' in config.vision_config else self.config.text_config.hidden_size
262
+ self.multi_modal_projector = MagmaMultiModalProjector(config.vision_config)
263
+
264
+ self.vocab_size = config.text_config.vocab_size
265
+ if hasattr(config.text_config, 'auto_map'):
266
+ del config.text_config.auto_map
267
+
268
+ try:
269
+ self.language_model = AutoModelForCausalLM.from_config(
270
+ config.text_config,
271
+ # attn_implementation=config._attn_implementation,
272
+ trust_remote_code=True
273
+ )
274
+ except:
275
+ self.language_model = AutoModelForCausalLM.from_pretrained(
276
+ config.text_config._name_or_path,
277
+ # attn_implementation=config._attn_implementation,
278
+ trust_remote_code=True
279
+ )
280
+
281
+ self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
282
+ self._padding_side = "left" # set it to left by default, user can use setter to change padding_sides
283
+
284
+ self.post_init()
285
+
286
+ # def from_pretrained(self, pretrained_model_name_or_path, *model_args, **kwargs):
287
+ # import pdb; pdb.set_trace()
288
+ # kwargs["_from_auto"] = True
289
+ # return super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
290
+
291
+ @property
292
+ def padding_side(self):
293
+ return self._padding_side
294
+
295
+ @padding_side.setter
296
+ def padding_side(self, padding_side: str):
297
+ if padding_side not in ["left", "right"]:
298
+ raise ValueError(f"{padding_side} is not `left` or `right`.")
299
+ self._padding_side = padding_side
300
+
301
+ def get_input_embeddings(self):
302
+ return self.language_model.get_input_embeddings()
303
+
304
+ def set_input_embeddings(self, value):
305
+ self.language_model.set_input_embeddings(value)
306
+
307
+ def get_output_embeddings(self):
308
+ return self.language_model.get_output_embeddings()
309
+
310
+ def set_output_embeddings(self, new_embeddings):
311
+ self.language_model.set_output_embeddings(new_embeddings)
312
+
313
+ def set_decoder(self, decoder):
314
+ self.language_model.set_decoder(decoder)
315
+
316
+ def get_decoder(self):
317
+ return self.language_model.get_decoder()
318
+
319
+ def tie_weights(self):
320
+ return self.language_model.tie_weights()
321
+
322
+ def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
323
+ model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
324
+ # update vocab size
325
+ self.config.text_config.vocab_size = model_embeds.num_embeddings
326
+ self.vocab_size = model_embeds.num_embeddings
327
+ return model_embeds
328
+
329
+ def _merge_input_ids_with_image_features(
330
+ self,
331
+ image_features,
332
+ feature_lens,
333
+ inputs_embeds,
334
+ input_ids,
335
+ attention_mask,
336
+ position_ids=None,
337
+ labels=None,
338
+ image_token_index=None,
339
+ ignore_index=-100,
340
+ ):
341
+ """
342
+ Merge input_ids with with image features into final embeddings
343
+
344
+ Args:
345
+ image_features (`torch.Tensor` of shape `(all_feature_lens, embed_dim)`):
346
+ All vision vectors of all images in the batch
347
+ feature_lens (`torch.LongTensor` of shape `(num_images)`):
348
+ The length of visual embeddings of each image as stacked in `image_features`
349
+ inputs_embeds (`torch.Tensor` of shape `(batch_size, sequence_length, embed_dim)`):
350
+ Token embeddings before merging with visual embeddings
351
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
352
+ Input_ids of tokens, possibly filled with image token
353
+ attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
354
+ Mask to avoid performing attention on padding token indices.
355
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
356
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
357
+ config.n_positions - 1]`.
358
+ labels (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*)
359
+ :abels need to be recalculated to support training (if provided)
360
+ image_token_index (`int`, *optional*)
361
+ Token id used to indicate the special "image" token. Defaults to `config.image_token_index`
362
+ ignore_index (`int`, *optional*)
363
+ Value that is used to pad `labels` and will be ignored when calculated loss. Default: -100.
364
+ Returns:
365
+ final_embedding, final_attention_mask, position_ids, final_labels
366
+
367
+ Explanation:
368
+ each image has variable length embeddings, with length specified by feature_lens
369
+ image_features is concatenation of all visual embed vectors
370
+ task: fill each <image> with the correct number of visual embeddings
371
+ Example:
372
+ X (5 patches), Y (3 patches), Z (8)
373
+ X, Y are in the same sequence (in-context learning)
374
+ if right padding
375
+ input_ids: [
376
+ a b c d e f X g h i j k Y l m
377
+ o p q r Z s t u v _ _ _ _ _ _
378
+ ]
379
+ input_ids should be: [
380
+ a b c d e f X X X X X g h i j k Y Y Y l m
381
+ o p q r Z Z Z Z Z Z Z Z s t u v _ _ _ _ _
382
+ ]
383
+ labels should be: [
384
+ a b c d e f _ _ _ _ _ g h i j k _ _ _ l m
385
+ o p q r _ _ _ _ _ _ _ _ s t u v _ _ _ _ _
386
+ ]
387
+ elif left padding
388
+ input_ids: [
389
+ a b c d e f X g h i j k Y l m
390
+ _ _ _ _ _ _ o p q r Z s t u v
391
+ ]
392
+ input_ids should be: [
393
+ a b c d e f X X X X X g h i j k Y Y Y l m
394
+ _ _ _ _ _ o p q r Z Z Z Z Z Z Z Z s t u v
395
+ ]
396
+ labels should be: [
397
+ a b c d e f _ _ _ _ _ g h i j k _ _ _ l m
398
+ _ _ _ _ _ o p q r _ _ _ _ _ _ _ _ s t u v
399
+ ]
400
+ Edge cases:
401
+ * If tokens are same but image token sizes are different, then cannot infer left or right padding
402
+
403
+ input_ids: [
404
+ a b c d X g h
405
+ i j Y k l m n
406
+ ]
407
+ where X is 3 tokens while Y is 5, this mean after merge
408
+ if left-padding (batched generation)
409
+ input_ids should be: [
410
+ _ _ a b c d X X X g h
411
+ i j Y Y Y Y Y k l m n
412
+ ]
413
+ elif (right padding) (training)
414
+ input_ids should be: [
415
+ a b c d X X X g h _ _
416
+ i j Y Y Y Y Y k l m n
417
+ ]
418
+ """
419
+ image_token_index = image_token_index if image_token_index is not None else self.config.image_token_index
420
+ ignore_index = ignore_index if ignore_index is not None else self.config.ignore_index
421
+
422
+ with torch.no_grad():
423
+ num_images = feature_lens.size(0)
424
+ num_image_features, embed_dim = image_features.shape
425
+ if feature_lens.sum() != num_image_features:
426
+ raise ValueError(f"{feature_lens=} / {feature_lens.sum()} != {image_features.shape=}")
427
+ batch_size = input_ids.shape[0]
428
+ _left_padding = torch.any(attention_mask[:, 0] == 0)
429
+ _right_padding = torch.any(attention_mask[:, -1] == 0)
430
+
431
+ left_padding = True
432
+ if batch_size > 1:
433
+ if _left_padding and not _right_padding:
434
+ left_padding = True
435
+ elif not _left_padding and _right_padding:
436
+ left_padding = False
437
+ elif not _left_padding and not _right_padding:
438
+ # both side is 1, so cannot tell
439
+ left_padding = self.padding_side == "left"
440
+ else:
441
+ # invalid attention_mask
442
+ raise ValueError(f"both side of attention_mask has zero, invalid. {attention_mask}")
443
+
444
+ # Whether to turn off right padding
445
+ # 1. Create a mask to know where special image tokens are
446
+ special_image_token_mask = input_ids == image_token_index
447
+ # special_image_token_mask: [bsz, seqlen]
448
+ num_special_image_tokens = torch.sum(special_image_token_mask, dim=-1)
449
+ # num_special_image_tokens: [bsz]
450
+ # Reserve for padding of num_images
451
+ total_num_special_image_tokens = torch.sum(special_image_token_mask)
452
+ if total_num_special_image_tokens != num_images:
453
+ raise ValueError(
454
+ f"Number of image tokens in input_ids ({total_num_special_image_tokens}) different from num_images ({num_images})."
455
+ )
456
+ # Compute the maximum embed dimension
457
+ # max_image_feature_lens is max_feature_lens per batch
458
+ feature_lens_batch = feature_lens.split(num_special_image_tokens.tolist(), dim=0)
459
+ feature_lens_batch_sum = torch.tensor([x.sum() for x in feature_lens_batch], device=feature_lens.device)
460
+ embed_sequence_lengths = (
461
+ (attention_mask == 1).long().sum(-1) - num_special_image_tokens + feature_lens_batch_sum
462
+ )
463
+ max_embed_dim = embed_sequence_lengths.max()
464
+
465
+ batch_indices, non_image_indices = torch.where((input_ids != image_token_index) & (attention_mask == 1))
466
+ # 2. Compute the positions where text should be written
467
+ # Calculate new positions for text tokens in merged image-text sequence.
468
+ # `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images` text tokens.
469
+ # `torch.cumsum` computes how each image token shifts subsequent text token positions.
470
+ # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
471
+ # ! instead of special_image_token_mask * (num_image_patches - 1)
472
+ # special_image_token_mask * (num_feature_len - 1)
473
+ special_image_token_mask = special_image_token_mask.long()
474
+ special_image_token_mask[special_image_token_mask == 1] = feature_lens - 1
475
+ new_token_positions = torch.cumsum((special_image_token_mask + 1), -1) - 1
476
+ if left_padding:
477
+ # shift right token positions so that they are ending at the same number
478
+ # the below here was incorrect? new_token_positions += new_token_positions[:, -1].max() - new_token_positions[:, -1:]
479
+ new_token_positions += max_embed_dim - 1 - new_token_positions[:, -1:]
480
+
481
+ text_to_overwrite = new_token_positions[batch_indices, non_image_indices]
482
+
483
+ # 3. Create the full embedding, already padded to the maximum position
484
+ final_embedding = torch.zeros(
485
+ batch_size, max_embed_dim, embed_dim, dtype=inputs_embeds.dtype, device=inputs_embeds.device
486
+ )
487
+ final_attention_mask = torch.zeros(
488
+ batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device
489
+ )
490
+ final_labels = None
491
+ if labels is not None:
492
+ # NOTE: this is a bug in the original code!!!
493
+ final_labels = torch.full_like(final_attention_mask.long(), ignore_index).to(torch.long)
494
+ # In case the Vision model or the Language model has been offloaded to CPU, we need to manually
495
+ # set the corresponding tensors into their correct target device.
496
+ target_device = inputs_embeds.device
497
+ batch_indices, non_image_indices, text_to_overwrite = (
498
+ batch_indices.to(target_device),
499
+ non_image_indices.to(target_device),
500
+ text_to_overwrite.to(target_device),
501
+ )
502
+ attention_mask = attention_mask.to(target_device)
503
+
504
+ # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
505
+ # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
506
+ final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
507
+ final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
508
+ if labels is not None:
509
+ final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]
510
+
511
+ # 5. Fill the embeddings corresponding to the images. Anything that is not `text_positions` needs filling (#29835)
512
+ with torch.no_grad():
513
+ image_to_overwrite = torch.full(
514
+ (batch_size, max_embed_dim), True, dtype=torch.bool, device=inputs_embeds.device
515
+ )
516
+ image_to_overwrite[batch_indices, text_to_overwrite] = False
517
+ embed_indices = torch.arange(max_embed_dim).unsqueeze(0).to(target_device)
518
+ embed_indices = embed_indices.expand(batch_size, max_embed_dim)
519
+ embed_seq_lens = embed_sequence_lengths[:, None].to(target_device)
520
+
521
+ if left_padding:
522
+ # exclude padding on the left
523
+ val = (max_embed_dim - embed_indices) <= embed_seq_lens
524
+ else:
525
+ # exclude padding on the right
526
+ val = embed_indices < embed_seq_lens
527
+ image_to_overwrite &= val
528
+
529
+ if image_to_overwrite.sum() != num_image_features:
530
+ raise ValueError(
531
+ f"{image_to_overwrite.sum()=} != {num_image_features=} The input provided to the model are wrong. "
532
+ f"The number of image tokens is {torch.sum(special_image_token_mask)} while"
533
+ f" the number of image given to the model is {num_images}. "
534
+ f"This prevents correct indexing and breaks batch generation."
535
+ )
536
+ final_embedding[image_to_overwrite] = image_features.contiguous().reshape(-1, embed_dim).to(target_device)
537
+ final_attention_mask |= image_to_overwrite
538
+ position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
539
+
540
+ return final_embedding, final_attention_mask, position_ids, final_labels
541
+
542
+ @add_start_docstrings_to_model_forward(MAGMA_INPUTS_DOCSTRING)
543
+ @replace_return_docstrings(output_type=MagmaCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
544
+ def forward(
545
+ self,
546
+ input_ids: torch.LongTensor = None,
547
+ pixel_values: Union[torch.FloatTensor, List[torch.FloatTensor], List[List[torch.FloatTensor]]] = None,
548
+ image_sizes: Union[torch.LongTensor, List[torch.LongTensor], List[List[torch.LongTensor]]] = None,
549
+ attention_mask: Optional[torch.Tensor] = None,
550
+ position_ids: Optional[torch.LongTensor] = None,
551
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
552
+ inputs_embeds: Optional[torch.FloatTensor] = None,
553
+ vision_feature_layer: Optional[int] = None,
554
+ vision_feature_select_strategy: Optional[str] = None,
555
+ labels: Optional[torch.LongTensor] = None,
556
+ use_cache: Optional[bool] = None,
557
+ output_attentions: Optional[bool] = None,
558
+ output_hidden_states: Optional[bool] = None,
559
+ return_dict: Optional[bool] = None,
560
+ ) -> Union[Tuple, MagmaCausalLMOutputWithPast]:
561
+ r"""
562
+ Args:
563
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
564
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
565
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
566
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
567
+
568
+ Returns:
569
+
570
+ Example:
571
+
572
+ ```python
573
+ >>> from PIL import Image
574
+ >>> import requests
575
+ >>> from transformers import AutoProcessor, MagmaForConditionalGeneration
576
+
577
+ >>> model = MagmaForConditionalGeneration.from_pretrained("microsoft/magma-8b-hf")
578
+ >>> processor = AutoProcessor.from_pretrained("microsoft/magma-8b-hf")
579
+
580
+ >>> prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
581
+ >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
582
+ >>> image = Image.open(requests.get(url, stream=True).raw)
583
+
584
+ >>> inputs = processor(text=prompt, images=image, return_tensors="pt")
585
+
586
+ >>> # Generate
587
+ >>> generate_ids = model.generate(**inputs, max_length=30)
588
+ >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
589
+ "[INST] \nWhat is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multi-dimensional plot (...)"
590
+ ```"""
591
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
592
+ output_hidden_states = (
593
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
594
+ )
595
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
596
+ vision_feature_layer = (
597
+ vision_feature_layer if vision_feature_layer is not None else self.config.vision_config['vision_feature_layer']
598
+ )
599
+
600
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
601
+
602
+ if inputs_embeds is None:
603
+ # 1. Extract the input embeddings
604
+ # In case image_token_index is not in the embeddings (extra token but embedding don't have it)
605
+ for_inputs_embeds_ids = input_ids.clone()
606
+ for_inputs_embeds_ids[(input_ids == self.config.image_token_index)] = 0
607
+ inputs_embeds = self.get_input_embeddings()(for_inputs_embeds_ids)
608
+
609
+ # 2. Merge text and images
610
+ if pixel_values is not None and input_ids.shape[1] != 1 and len(pixel_values) > 0:
611
+ # ! infer image_num_patches from image_sizes
612
+ if type(pixel_values) == list:
613
+ # nested list of pixel_values, each element is a list of pixel_values for each training instance, it could be multiple for video or interleaved setting
614
+ # e.g., pixel_values = [[img1, img2], [img1, img2, img3]]
615
+ n_imgs_per_sample = [len(pv) for pv in pixel_values]
616
+ pixels_values_list = sum(pixel_values, [])
617
+ image_sizes_list = sum(image_sizes, [])
618
+ else:
619
+ image_num_patches = [(imsize[imsize.sum(1) > 0,0] * imsize[imsize.sum(1) > 0,1]).tolist() for imsize in image_sizes]
620
+ # image_num_patches = [(imsize[:,0]*imsize[:,1]).tolist() for imsize in image_sizes]
621
+ # figure out if pixel_values is concatenated or stacked
622
+ if pixel_values.dim() == 5:
623
+ # stacking when input is (batch_size, num_patches, num_channels, height, width)
624
+ _pixel_values_list = [
625
+ pix_val[:sum(num_patch)].split(num_patch, dim=0) for pix_val, num_patch in zip(pixel_values, image_num_patches)
626
+ ]
627
+ _image_sizes_list = [image_size[image_size.sum(-1) > 0].tolist() for image_size in image_sizes]
628
+ elif pixel_values.dim() != 4:
629
+ # otherwise has to be stacked from list of (num_patches, num_channels, height, width)
630
+ raise ValueError(f"pixel_values of shape {pixel_values.shape}, expect to be of 4 or 5 dimensions")
631
+
632
+ if self.config.vision_config['img_anyres_strategy'] == "global":
633
+ selected_image_features = []
634
+ # NOTE: both _image_sizes_list and _pixel_values_list are lists of lists, each item represents an training instance with one or multiple images
635
+ for idx, (image_size_for_instance, pixel_values_for_instance) in enumerate(zip(_image_sizes_list, _pixel_values_list)):
636
+ assert len(image_size_for_instance) == len(pixel_values_for_instance), f"{len(image_size_for_instance)} != {len(pixel_values_for_instance)}"
637
+ for image_size, pixel_values_for_image in zip(image_size_for_instance, pixel_values_for_instance):
638
+ pixel_values_for_image = pixel_values_for_image.view(image_size[0], image_size[1], *pixel_values_for_image.shape[1:])
639
+ pixel_values_for_image = pixel_values_for_image.permute(2, 0, 3, 1, 4).flatten(3, 4).flatten(1, 2).unsqueeze(0)
640
+ image_features = self.vision_tower(pixel_values_for_image)
641
+ selected_image_feature = image_features[vision_feature_layer][0].permute(1, 2, 0)
642
+ selected_image_feature = self.multi_modal_projector(selected_image_feature)
643
+ selected_image_feature = torch.cat((selected_image_feature, self.multi_modal_projector.row_seperator.repeat(selected_image_feature.shape[0],1,1)), dim=1)
644
+ selected_image_features.append(selected_image_feature.flatten(0, 1))
645
+ elif self.config.vision_config['img_anyres_strategy'] == "crop":
646
+ # calculate number of crops for each instance in the batch given _image_sizes_list
647
+ _image_sizes_list_temp = sum(_image_sizes_list, [])
648
+ # concate nate all images in _pixel_values_list
649
+ _pixel_values_list_temp = sum(_pixel_values_list, ())
650
+ _pixel_values_list_temp = torch.cat(_pixel_values_list_temp, dim=0)
651
+ image_features = self.vision_tower(_pixel_values_list_temp)[vision_feature_layer].permute(0, 2, 3, 1)
652
+ image_features = self.multi_modal_projector(image_features)
653
+
654
+ num_crops_list = [_image_size[0]*_image_size[1] for _image_size in _image_sizes_list_temp]
655
+ image_features_split = torch.split(image_features, num_crops_list, dim=0)
656
+ selected_image_features = []
657
+ for image_feature, image_size in zip(image_features_split, _image_sizes_list_temp):
658
+ image_feature = image_feature.view(image_size[0], image_size[1], *image_feature.shape[1:])
659
+ image_feature = image_feature.permute(0, 2, 1, 3, 4).flatten(2, 3).flatten(0, 1)
660
+ image_feature = torch.cat((image_feature, self.multi_modal_projector.row_seperator.repeat(image_feature.shape[0],1,1)), dim=1)
661
+ selected_image_features.append(image_feature.flatten(0, 1))
662
+
663
+ # raise NotImplementedError("crop strategy is not implemented yet")
664
+ # image_features = self.vision_tower(pixel_values)
665
+ # selected_image_feature = image_features[vision_feature_layer]
666
+ # image_features = torch.split(image_features, image_num_patches, dim=0)
667
+
668
+ # NOTE we only support multimodal_patch_merge_type == "spatial_unpad"
669
+ feature_lens = [elem.shape[0] for elem in selected_image_features]
670
+ image_features = torch.cat(selected_image_features, 0)
671
+ feature_lens = torch.tensor(feature_lens, dtype=torch.long, device=image_features.device)
672
+
673
+ # inputs_embeds = inputs_embeds.to(image_features.dtype)
674
+ inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_image_features(
675
+ image_features,
676
+ feature_lens,
677
+ inputs_embeds,
678
+ input_ids,
679
+ attention_mask,
680
+ position_ids,
681
+ labels=labels,
682
+ )
683
+
684
+ # pixel_values is not None but is empty ---> text only cases
685
+ elif pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) == 0:
686
+ # there are no images
687
+ pass
688
+
689
+ # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
690
+ # generation with cache
691
+ elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
692
+ # Retrieve the first layer to inspect the logits and mask out the hidden states
693
+ # that are set to 0
694
+ first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
695
+
696
+ # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
697
+ batch_index, non_attended_tokens = torch.where(first_layer_past_key_value.float().sum(-2) == 0)
698
+
699
+ # Get the target length
700
+ target_length = input_ids.shape[1]
701
+ past_length = first_layer_past_key_value.shape[-1]
702
+
703
+ extended_attention_mask = torch.ones(
704
+ (attention_mask.shape[0], past_length),
705
+ dtype=attention_mask.dtype,
706
+ device=attention_mask.device,
707
+ )
708
+
709
+ # Filter out only the tokens that can be un-attended, this can happen
710
+ # if one uses Llava + Fused modules where the cache on the
711
+ # first iteration is already big enough, or if one passes custom cache
712
+ valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
713
+ new_batch_index = batch_index[valid_indices]
714
+ new_non_attended_tokens = non_attended_tokens[valid_indices]
715
+
716
+ # Zero-out the places where we don't need to attend
717
+ extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0
718
+
719
+ attention_mask = torch.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
720
+
721
+ position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
722
+
723
+ # outputs = self.language_model(
724
+ # attention_mask=attention_mask,
725
+ # position_ids=position_ids,
726
+ # past_key_values=past_key_values,
727
+ # inputs_embeds=inputs_embeds,
728
+ # use_cache=use_cache,
729
+ # output_attentions=output_attentions,
730
+ # output_hidden_states=output_hidden_states,
731
+ # return_dict=return_dict,
732
+ # )
733
+
734
+ # logits = outputs[0]
735
+ # loss = None
736
+ # if labels is not None:
737
+ # # Shift so that tokens < n predict n
738
+ # if attention_mask is not None:
739
+ # shift_attention_mask = attention_mask[..., 1:]
740
+ # shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
741
+ # shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()
742
+ # else:
743
+ # shift_logits = logits[..., :-1, :].contiguous()
744
+ # shift_labels = labels[..., 1:].contiguous()
745
+ # # Flatten the tokens
746
+ # loss_fct = nn.CrossEntropyLoss()
747
+ # loss = loss_fct(
748
+ # shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1).to(shift_logits.device)
749
+ # )
750
+
751
+ outputs = self.language_model.model(
752
+ attention_mask=attention_mask,
753
+ position_ids=position_ids,
754
+ past_key_values=past_key_values,
755
+ inputs_embeds=inputs_embeds,
756
+ use_cache=use_cache,
757
+ output_attentions=output_attentions,
758
+ output_hidden_states=output_hidden_states,
759
+ return_dict=return_dict
760
+ )
761
+
762
+ hidden_states = outputs[0]
763
+
764
+ loss = None
765
+
766
+ if labels is not None and self.training:
767
+ valid_mask = labels[..., 1:] != -100
768
+ shift_logits = self.language_model.lm_head(hidden_states[:,:-1][valid_mask]).contiguous()
769
+ shift_logits = shift_logits.view(-1, self.language_model.config.vocab_size)
770
+ logits = shift_logits # dummy logits
771
+ shift_labels = labels[..., 1:][valid_mask].contiguous()
772
+ shift_labels = shift_labels.to(shift_logits.device)
773
+ loss_fct = nn.CrossEntropyLoss()
774
+ loss = loss_fct(shift_logits, shift_labels)
775
+
776
+ # localize the positions for shift_labels where the id is in betweek [config.tokenizer_vocab_size-256, config.tokenizer_vocab_size]
777
+ valid_indices = (shift_labels<self.config.tokenizer_vocab_size) & (shift_labels>=self.config.tokenizer_vocab_size-256)
778
+ if valid_indices.sum() > 0:
779
+ action_labels = shift_labels[valid_indices]
780
+ action_logits = shift_logits[valid_indices]
781
+ # calcualte the accuracy
782
+ action_accuracy = (action_logits.argmax(-1) == action_labels).float().mean()
783
+ # log the action accuracy
784
+ else:
785
+ action_accuracy = torch.tensor(0.0).to(shift_logits.device)
786
+ # torch distributed gather the action accuracy across all devices
787
+ action_accuracy = action_accuracy.unsqueeze(0)
788
+ # gather the action accuracy across all devices
789
+ action_accuracy_gather = [torch.zeros_like(action_accuracy) for _ in range(dist.get_world_size())]
790
+ dist.all_gather(action_accuracy_gather, action_accuracy)
791
+ # concatenate the action accuracy across all devices
792
+ action_accuracy = torch.cat(action_accuracy_gather)
793
+
794
+ else:
795
+ logits = self.language_model.lm_head(hidden_states)
796
+ logits = logits.float()
797
+
798
+ if not return_dict:
799
+ output = (logits,) + outputs[1:]
800
+ return (loss,) + output if loss is not None else output
801
+
802
+ return MagmaCausalLMOutputWithPast(
803
+ loss=loss,
804
+ logits=logits,
805
+ past_key_values=outputs.past_key_values,
806
+ hidden_states=outputs.hidden_states,
807
+ attentions=outputs.attentions,
808
+ )
809
+
810
+ def prepare_inputs_for_generation(
811
+ self,
812
+ input_ids,
813
+ past_key_values=None,
814
+ inputs_embeds=None,
815
+ pixel_values=None,
816
+ image_sizes=None,
817
+ attention_mask=None,
818
+ **kwargs,
819
+ ):
820
+ if past_key_values is not None:
821
+ if isinstance(past_key_values, Cache):
822
+ cache_length = past_key_values.get_seq_length()
823
+ past_length = past_key_values.seen_tokens
824
+ else:
825
+ cache_length = past_length = past_key_values[0][0].shape[2]
826
+
827
+ # Keep only the unprocessed tokens:
828
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
829
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
830
+ # input)
831
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
832
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
833
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
834
+ # input_ids based on the past_length.
835
+ elif past_length < input_ids.shape[1]:
836
+ input_ids = input_ids[:, past_length:]
837
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
838
+ elif self.config.image_token_index in input_ids:
839
+ input_ids = input_ids[:, input_ids.shape[1] - 1 :]
840
+ # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
841
+ # older attention values, as their corresponding values are not part of the input.
842
+ if cache_length < past_length and attention_mask is not None:
843
+ attention_mask = attention_mask[:, -(cache_length + input_ids.shape[1]) :]
844
+
845
+ position_ids = kwargs.get("position_ids", None)
846
+ if attention_mask is not None and position_ids is None:
847
+ # create position_ids on the fly for batch generation
848
+ position_ids = attention_mask.long().cumsum(-1) - 1
849
+ position_ids.masked_fill_(attention_mask == 0, 1)
850
+ if past_key_values:
851
+ position_ids = position_ids[:, -input_ids.shape[1] :]
852
+
853
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
854
+ if inputs_embeds is not None and past_key_values is None:
855
+ model_inputs = {"inputs_embeds": inputs_embeds}
856
+ else:
857
+ model_inputs = {"input_ids": input_ids}
858
+
859
+ model_inputs.update(
860
+ {
861
+ "position_ids": position_ids,
862
+ "past_key_values": past_key_values,
863
+ "use_cache": kwargs.get("use_cache"),
864
+ "attention_mask": attention_mask,
865
+ "pixel_values": pixel_values,
866
+ "image_sizes": image_sizes,
867
+ }
868
+ )
869
+ return model_inputs
870
+
871
+ def _reorder_cache(self, *args, **kwargs):
872
+ return self.language_model._reorder_cache(*args, **kwargs)
873
+
874
+ @add_start_docstrings(
875
+ """The Magma model which consists of a vision backbone and a language model.""",
876
+ MAGMA_START_DOCSTRING,
877
+ )
878
+ class MagmaForConditionalGeneration(MagmaPreTrainedModel):
879
+ def __init__(self, config: MagmaConfig):
880
+ super().__init__(config)
881
+
882
+ self.vision_tower = MagmaImageTower(config.vision_config, require_pretrained=('magma' not in config.name_or_path))
883
+ self.multi_modal_projector = MagmaMultiModalProjector(config.vision_config)
884
+
885
+ self.vocab_size = config.text_config.vocab_size
886
+ self.language_model = AutoModelForCausalLM.from_config(
887
+ config.text_config,
888
+ # attn_implementation=config._attn_implementation,
889
+ trust_remote_code=True
890
+ )
891
+
892
+ self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
893
+ self._padding_side = "left" # set it to left by default, user can use setter to change padding_sides
894
+
895
+ self.post_init()
896
+
897
+ @property
898
+ def padding_side(self):
899
+ return self._padding_side
900
+
901
+ @padding_side.setter
902
+ def padding_side(self, padding_side: str):
903
+ if padding_side not in ["left", "right"]:
904
+ raise ValueError(f"{padding_side} is not `left` or `right`.")
905
+ self._padding_side = padding_side
906
+
907
+ def get_input_embeddings(self):
908
+ return self.language_model.get_input_embeddings()
909
+
910
+ def set_input_embeddings(self, value):
911
+ self.language_model.set_input_embeddings(value)
912
+
913
+ def get_output_embeddings(self):
914
+ return self.language_model.get_output_embeddings()
915
+
916
+ def set_output_embeddings(self, new_embeddings):
917
+ self.language_model.set_output_embeddings(new_embeddings)
918
+
919
+ def set_decoder(self, decoder):
920
+ self.language_model.set_decoder(decoder)
921
+
922
+ def get_decoder(self):
923
+ return self.language_model.get_decoder()
924
+
925
+ def tie_weights(self):
926
+ return self.language_model.tie_weights()
927
+
928
+ def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
929
+ model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
930
+ # update vocab size
931
+ self.config.text_config.vocab_size = model_embeds.num_embeddings
932
+ self.vocab_size = model_embeds.num_embeddings
933
+ return model_embeds
934
+
935
+ def _merge_input_ids_with_image_features(
936
+ self,
937
+ image_features,
938
+ feature_lens,
939
+ inputs_embeds,
940
+ input_ids,
941
+ attention_mask,
942
+ position_ids=None,
943
+ labels=None,
944
+ image_token_index=None,
945
+ ignore_index=-100,
946
+ ):
947
+ """
948
+ Merge input_ids with with image features into final embeddings
949
+
950
+ Args:
951
+ image_features (`torch.Tensor` of shape `(all_feature_lens, embed_dim)`):
952
+ All vision vectors of all images in the batch
953
+ feature_lens (`torch.LongTensor` of shape `(num_images)`):
954
+ The length of visual embeddings of each image as stacked in `image_features`
955
+ inputs_embeds (`torch.Tensor` of shape `(batch_size, sequence_length, embed_dim)`):
956
+ Token embeddings before merging with visual embeddings
957
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
958
+ Input_ids of tokens, possibly filled with image token
959
+ attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
960
+ Mask to avoid performing attention on padding token indices.
961
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
962
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
963
+ config.n_positions - 1]`.
964
+ labels (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*)
965
+ :abels need to be recalculated to support training (if provided)
966
+ image_token_index (`int`, *optional*)
967
+ Token id used to indicate the special "image" token. Defaults to `config.image_token_index`
968
+ ignore_index (`int`, *optional*)
969
+ Value that is used to pad `labels` and will be ignored when calculated loss. Default: -100.
970
+ Returns:
971
+ final_embedding, final_attention_mask, position_ids, final_labels
972
+
973
+ Explanation:
974
+ each image has variable length embeddings, with length specified by feature_lens
975
+ image_features is concatenation of all visual embed vectors
976
+ task: fill each <image> with the correct number of visual embeddings
977
+ Example:
978
+ X (5 patches), Y (3 patches), Z (8)
979
+ X, Y are in the same sequence (in-context learning)
980
+ if right padding
981
+ input_ids: [
982
+ a b c d e f X g h i j k Y l m
983
+ o p q r Z s t u v _ _ _ _ _ _
984
+ ]
985
+ input_ids should be: [
986
+ a b c d e f X X X X X g h i j k Y Y Y l m
987
+ o p q r Z Z Z Z Z Z Z Z s t u v _ _ _ _ _
988
+ ]
989
+ labels should be: [
990
+ a b c d e f _ _ _ _ _ g h i j k _ _ _ l m
991
+ o p q r _ _ _ _ _ _ _ _ s t u v _ _ _ _ _
992
+ ]
993
+ elif left padding
994
+ input_ids: [
995
+ a b c d e f X g h i j k Y l m
996
+ _ _ _ _ _ _ o p q r Z s t u v
997
+ ]
998
+ input_ids should be: [
999
+ a b c d e f X X X X X g h i j k Y Y Y l m
1000
+ _ _ _ _ _ o p q r Z Z Z Z Z Z Z Z s t u v
1001
+ ]
1002
+ labels should be: [
1003
+ a b c d e f _ _ _ _ _ g h i j k _ _ _ l m
1004
+ _ _ _ _ _ o p q r _ _ _ _ _ _ _ _ s t u v
1005
+ ]
1006
+ Edge cases:
1007
+ * If tokens are same but image token sizes are different, then cannot infer left or right padding
1008
+
1009
+ input_ids: [
1010
+ a b c d X g h
1011
+ i j Y k l m n
1012
+ ]
1013
+ where X is 3 tokens while Y is 5, this mean after merge
1014
+ if left-padding (batched generation)
1015
+ input_ids should be: [
1016
+ _ _ a b c d X X X g h
1017
+ i j Y Y Y Y Y k l m n
1018
+ ]
1019
+ elif (right padding) (training)
1020
+ input_ids should be: [
1021
+ a b c d X X X g h _ _
1022
+ i j Y Y Y Y Y k l m n
1023
+ ]
1024
+ """
1025
+ image_token_index = image_token_index if image_token_index is not None else self.config.image_token_index
1026
+ ignore_index = ignore_index if ignore_index is not None else self.config.ignore_index
1027
+
1028
+ with torch.no_grad():
1029
+ num_images = feature_lens.size(0)
1030
+ num_image_features, embed_dim = image_features.shape
1031
+ if feature_lens.sum() != num_image_features:
1032
+ raise ValueError(f"{feature_lens=} / {feature_lens.sum()} != {image_features.shape=}")
1033
+ batch_size = input_ids.shape[0]
1034
+ _left_padding = torch.any(attention_mask[:, 0] == 0)
1035
+ _right_padding = torch.any(attention_mask[:, -1] == 0)
1036
+
1037
+ left_padding = True
1038
+ if batch_size > 1:
1039
+ if _left_padding and not _right_padding:
1040
+ left_padding = True
1041
+ elif not _left_padding and _right_padding:
1042
+ left_padding = False
1043
+ elif not _left_padding and not _right_padding:
1044
+ # both side is 1, so cannot tell
1045
+ left_padding = self.padding_side == "left"
1046
+ else:
1047
+ # invalid attention_mask
1048
+ raise ValueError(f"both side of attention_mask has zero, invalid. {attention_mask}")
1049
+
1050
+ # Whether to turn off right padding
1051
+ # 1. Create a mask to know where special image tokens are
1052
+ special_image_token_mask = input_ids == image_token_index
1053
+ # special_image_token_mask: [bsz, seqlen]
1054
+ num_special_image_tokens = torch.sum(special_image_token_mask, dim=-1)
1055
+ # num_special_image_tokens: [bsz]
1056
+ # Reserve for padding of num_images
1057
+ total_num_special_image_tokens = torch.sum(special_image_token_mask)
1058
+ if total_num_special_image_tokens != num_images:
1059
+ raise ValueError(
1060
+ f"Number of image tokens in input_ids ({total_num_special_image_tokens}) different from num_images ({num_images})."
1061
+ )
1062
+ # Compute the maximum embed dimension
1063
+ # max_image_feature_lens is max_feature_lens per batch
1064
+ feature_lens_batch = feature_lens.split(num_special_image_tokens.tolist(), dim=0)
1065
+ feature_lens_batch_sum = torch.tensor([x.sum() for x in feature_lens_batch], device=feature_lens.device)
1066
+ embed_sequence_lengths = (
1067
+ (attention_mask == 1).long().sum(-1) - num_special_image_tokens + feature_lens_batch_sum
1068
+ )
1069
+ max_embed_dim = embed_sequence_lengths.max()
1070
+
1071
+ batch_indices, non_image_indices = torch.where((input_ids != image_token_index) & (attention_mask == 1))
1072
+ # 2. Compute the positions where text should be written
1073
+ # Calculate new positions for text tokens in merged image-text sequence.
1074
+ # `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images` text tokens.
1075
+ # `torch.cumsum` computes how each image token shifts subsequent text token positions.
1076
+ # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
1077
+ # ! instead of special_image_token_mask * (num_image_patches - 1)
1078
+ # special_image_token_mask * (num_feature_len - 1)
1079
+ special_image_token_mask = special_image_token_mask.long()
1080
+ special_image_token_mask[special_image_token_mask == 1] = feature_lens - 1
1081
+ new_token_positions = torch.cumsum((special_image_token_mask + 1), -1) - 1
1082
+ if left_padding:
1083
+ # shift right token positions so that they are ending at the same number
1084
+ # the below here was incorrect? new_token_positions += new_token_positions[:, -1].max() - new_token_positions[:, -1:]
1085
+ new_token_positions += max_embed_dim - 1 - new_token_positions[:, -1:]
1086
+
1087
+ text_to_overwrite = new_token_positions[batch_indices, non_image_indices]
1088
+
1089
+ # 3. Create the full embedding, already padded to the maximum position
1090
+ final_embedding = torch.zeros(
1091
+ batch_size, max_embed_dim, embed_dim, dtype=inputs_embeds.dtype, device=inputs_embeds.device
1092
+ )
1093
+ final_attention_mask = torch.zeros(
1094
+ batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device
1095
+ )
1096
+ final_labels = None
1097
+ if labels is not None:
1098
+ final_labels = torch.full_like(final_attention_mask, ignore_index).to(torch.long)
1099
+ # In case the Vision model or the Language model has been offloaded to CPU, we need to manually
1100
+ # set the corresponding tensors into their correct target device.
1101
+ target_device = inputs_embeds.device
1102
+ batch_indices, non_image_indices, text_to_overwrite = (
1103
+ batch_indices.to(target_device),
1104
+ non_image_indices.to(target_device),
1105
+ text_to_overwrite.to(target_device),
1106
+ )
1107
+ attention_mask = attention_mask.to(target_device)
1108
+
1109
+ # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
1110
+ # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
1111
+ final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
1112
+ final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
1113
+ if labels is not None:
1114
+ final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]
1115
+
1116
+ # 5. Fill the embeddings corresponding to the images. Anything that is not `text_positions` needs filling (#29835)
1117
+ with torch.no_grad():
1118
+ image_to_overwrite = torch.full(
1119
+ (batch_size, max_embed_dim), True, dtype=torch.bool, device=inputs_embeds.device
1120
+ )
1121
+ image_to_overwrite[batch_indices, text_to_overwrite] = False
1122
+ embed_indices = torch.arange(max_embed_dim).unsqueeze(0).to(target_device)
1123
+ embed_indices = embed_indices.expand(batch_size, max_embed_dim)
1124
+ embed_seq_lens = embed_sequence_lengths[:, None].to(target_device)
1125
+
1126
+ if left_padding:
1127
+ # exclude padding on the left
1128
+ val = (max_embed_dim - embed_indices) <= embed_seq_lens
1129
+ else:
1130
+ # exclude padding on the right
1131
+ val = embed_indices < embed_seq_lens
1132
+ image_to_overwrite &= val
1133
+
1134
+ if image_to_overwrite.sum() != num_image_features:
1135
+ raise ValueError(
1136
+ f"{image_to_overwrite.sum()=} != {num_image_features=} The input provided to the model are wrong. "
1137
+ f"The number of image tokens is {torch.sum(special_image_token_mask)} while"
1138
+ f" the number of image given to the model is {num_images}. "
1139
+ f"This prevents correct indexing and breaks batch generation."
1140
+ )
1141
+ final_embedding[image_to_overwrite] = image_features.contiguous().reshape(-1, embed_dim).to(target_device)
1142
+ final_attention_mask |= image_to_overwrite
1143
+ position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
1144
+
1145
+ return final_embedding, final_attention_mask, position_ids, final_labels
1146
+
1147
+ @add_start_docstrings_to_model_forward(MAGMA_INPUTS_DOCSTRING)
1148
+ @replace_return_docstrings(output_type=MagmaCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1149
+ def forward(
1150
+ self,
1151
+ input_ids: torch.LongTensor = None,
1152
+ pixel_values: torch.FloatTensor = None,
1153
+ image_sizes: Optional[torch.LongTensor] = None,
1154
+ attention_mask: Optional[torch.Tensor] = None,
1155
+ position_ids: Optional[torch.LongTensor] = None,
1156
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1157
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1158
+ vision_feature_layer: Optional[int] = None,
1159
+ vision_feature_select_strategy: Optional[str] = None,
1160
+ labels: Optional[torch.LongTensor] = None,
1161
+ use_cache: Optional[bool] = None,
1162
+ output_attentions: Optional[bool] = None,
1163
+ output_hidden_states: Optional[bool] = None,
1164
+ return_dict: Optional[bool] = None,
1165
+ ) -> Union[Tuple, MagmaCausalLMOutputWithPast]:
1166
+ r"""
1167
+ Args:
1168
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1169
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1170
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1171
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1172
+
1173
+ Returns:
1174
+
1175
+ Example:
1176
+
1177
+ ```python
1178
+ >>> from PIL import Image
1179
+ >>> import requests
1180
+ >>> from transformers import AutoProcessor, MagmaForConditionalGeneration
1181
+
1182
+ >>> model = MagmaForConditionalGeneration.from_pretrained("microsoft/magma-8b-hf")
1183
+ >>> processor = AutoProcessor.from_pretrained("microsoft/magma-8b-hf")
1184
+
1185
+ >>> prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
1186
+ >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
1187
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1188
+
1189
+ >>> inputs = processor(text=prompt, images=image, return_tensors="pt")
1190
+
1191
+ >>> # Generate
1192
+ >>> generate_ids = model.generate(**inputs, max_length=30)
1193
+ >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1194
+ "[INST] \nWhat is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multi-dimensional plot (...)"
1195
+ ```"""
1196
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1197
+ output_hidden_states = (
1198
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1199
+ )
1200
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1201
+ vision_feature_layer = (
1202
+ vision_feature_layer if vision_feature_layer is not None else self.config.vision_config['vision_feature_layer']
1203
+ )
1204
+
1205
+ if inputs_embeds is None:
1206
+ # 1. Extract the input embeddings
1207
+ # In case image_token_index is not in the embeddings (extra token but embedding don't have it)
1208
+ for_inputs_embeds_ids = input_ids.clone()
1209
+ for_inputs_embeds_ids[(input_ids == self.config.image_token_index)] = 0
1210
+ inputs_embeds = self.get_input_embeddings()(for_inputs_embeds_ids)
1211
+
1212
+ # 2. Merge text and images
1213
+ if pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) > 0:
1214
+ # ! infer image_num_patches from image_sizes
1215
+ # figure out if pixel_values is concatenated or stacked
1216
+ if pixel_values.dim() == 5:
1217
+ image_num_patches = [(imsize[:,0]*imsize[:,1]).tolist() for imsize in image_sizes]
1218
+ # stacking when input is (batch_size, num_patches, num_channels, height, width)
1219
+ _pixel_values_list = [
1220
+ pix_val[:num_patch] for pix_val, num_patch in zip(pixel_values, image_num_patches)
1221
+ ]
1222
+ pixel_values = torch.cat(_pixel_values_list, dim=0)
1223
+ elif pixel_values.dim() != 4:
1224
+ # otherwise has to be stacked from list of (num_patches, num_channels, height, width)
1225
+ raise ValueError(f"pixel_values of shape {pixel_values.shape}, expect to be of 4 or 5 dimensions")
1226
+
1227
+ if self.config.vision_config['img_anyres_strategy'] == "global":
1228
+ num_patches_for_images = [(imsize[0]*imsize[1]).item() for imsize in image_sizes]
1229
+ pixel_values_for_images = pixel_values.split(num_patches_for_images, dim=0)
1230
+ selected_image_features = []
1231
+ for idx, (image_size, pixel_values_for_image) in enumerate(zip(image_sizes, pixel_values_for_images)):
1232
+ pixel_values_for_image = pixel_values_for_image.view(image_size[0], image_size[1], *pixel_values_for_image.shape[1:])
1233
+ pixel_values_for_image = pixel_values_for_image.permute(2, 0, 3, 1, 4).flatten(3, 4).flatten(1, 2).unsqueeze(0)
1234
+ image_features = self.vision_tower(pixel_values_for_image)
1235
+ selected_image_feature = image_features[vision_feature_layer][0].permute(1, 2, 0)
1236
+ selected_image_feature = self.multi_modal_projector(selected_image_feature)
1237
+ selected_image_feature = torch.cat((selected_image_feature, self.multi_modal_projector.row_seperator.repeat(selected_image_feature.shape[0],1,1)), dim=1)
1238
+ selected_image_features.append(selected_image_feature)
1239
+ elif self.config.vision_config['img_anyres_strategy'] == "crop":
1240
+ image_features = self.vision_tower(pixel_values)[vision_feature_layer].permute(0, 2, 3, 1)
1241
+ image_features = self.multi_modal_projector(image_features)
1242
+ num_patches_for_images = [(imsize[0]*imsize[1]).item() for imsize in image_sizes]
1243
+ image_features_split = torch.split(image_features, num_patches_for_images, dim=0)
1244
+ selected_image_features = []
1245
+ for image_feature, image_size in zip(image_features_split, image_sizes):
1246
+ image_feature = image_feature.view(image_size[0], image_size[1], *image_feature.shape[1:])
1247
+ image_feature = image_feature.permute(0, 2, 1, 3, 4).flatten(2, 3).flatten(0, 1)
1248
+ image_feature = torch.cat((image_feature, self.multi_modal_projector.row_seperator.repeat(image_feature.shape[0],1,1)), dim=1)
1249
+ selected_image_features.append(image_feature)
1250
+
1251
+ # NOTE we only support multimodal_patch_merge_type == "spatial_unpad"
1252
+ feature_lens = [elem.shape[0]*elem.shape[1] for elem in selected_image_features]
1253
+ image_features = torch.cat([elem.flatten(0, 1) for elem in selected_image_features], 0)
1254
+ feature_lens = torch.tensor(feature_lens, dtype=torch.long, device=image_features.device)
1255
+
1256
+ # inputs_embeds = inputs_embeds.to(image_features.dtype)
1257
+ inputs_embeds, attention_mask, position_ids, labels = self._merge_input_ids_with_image_features(
1258
+ image_features,
1259
+ feature_lens,
1260
+ inputs_embeds,
1261
+ input_ids,
1262
+ attention_mask,
1263
+ position_ids,
1264
+ labels=labels,
1265
+ )
1266
+
1267
+ # pixel_values is not None but is empty ---> text only cases
1268
+ elif pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) == 0:
1269
+ # there are no images
1270
+ pass
1271
+
1272
+ # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
1273
+ # generation with cache
1274
+ elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
1275
+ # Retrieve the first layer to inspect the logits and mask out the hidden states
1276
+ # that are set to 0
1277
+ first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
1278
+
1279
+ # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
1280
+ batch_index, non_attended_tokens = torch.where(first_layer_past_key_value.float().sum(-2) == 0)
1281
+
1282
+ # Get the target length
1283
+ target_length = input_ids.shape[1]
1284
+ past_length = first_layer_past_key_value.shape[-1]
1285
+
1286
+ extended_attention_mask = torch.ones(
1287
+ (attention_mask.shape[0], past_length),
1288
+ dtype=attention_mask.dtype,
1289
+ device=attention_mask.device,
1290
+ )
1291
+
1292
+ # Filter out only the tokens that can be un-attended, this can happen
1293
+ # if one uses Llava + Fused modules where the cache on the
1294
+ # first iteration is already big enough, or if one passes custom cache
1295
+ valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
1296
+ new_batch_index = batch_index[valid_indices]
1297
+ new_non_attended_tokens = non_attended_tokens[valid_indices]
1298
+
1299
+ # Zero-out the places where we don't need to attend
1300
+ extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0
1301
+
1302
+ attention_mask = torch.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
1303
+
1304
+ position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
1305
+
1306
+ outputs = self.language_model(
1307
+ attention_mask=attention_mask,
1308
+ position_ids=position_ids,
1309
+ past_key_values=past_key_values,
1310
+ inputs_embeds=inputs_embeds,
1311
+ use_cache=use_cache,
1312
+ output_attentions=output_attentions,
1313
+ output_hidden_states=output_hidden_states,
1314
+ return_dict=return_dict,
1315
+ )
1316
+
1317
+ logits = outputs[0]
1318
+
1319
+ loss = None
1320
+ if labels is not None:
1321
+ # Shift so that tokens < n predict n
1322
+ if attention_mask is not None:
1323
+ shift_attention_mask = attention_mask[..., 1:]
1324
+ shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
1325
+ shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()
1326
+ else:
1327
+ shift_logits = logits[..., :-1, :].contiguous()
1328
+ shift_labels = labels[..., 1:].contiguous()
1329
+ # Flatten the tokens
1330
+ loss_fct = nn.CrossEntropyLoss()
1331
+ loss = loss_fct(
1332
+ shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1).to(shift_logits.device)
1333
+ )
1334
+
1335
+ if not return_dict:
1336
+ output = (logits,) + outputs[1:]
1337
+ return (loss,) + output if loss is not None else output
1338
+
1339
+ return MagmaCausalLMOutputWithPast(
1340
+ loss=loss,
1341
+ logits=logits,
1342
+ past_key_values=outputs.past_key_values,
1343
+ hidden_states=outputs.hidden_states,
1344
+ attentions=outputs.attentions,
1345
+ )
1346
+
1347
+ def prepare_inputs_for_generation(
1348
+ self,
1349
+ input_ids,
1350
+ past_key_values=None,
1351
+ inputs_embeds=None,
1352
+ pixel_values=None,
1353
+ image_sizes=None,
1354
+ attention_mask=None,
1355
+ **kwargs,
1356
+ ):
1357
+ if past_key_values is not None:
1358
+ if isinstance(past_key_values, Cache):
1359
+ cache_length = past_key_values.get_seq_length()
1360
+ past_length = past_key_values.seen_tokens
1361
+ else:
1362
+ cache_length = past_length = past_key_values[0][0].shape[2]
1363
+
1364
+ # Keep only the unprocessed tokens:
1365
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1366
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
1367
+ # input)
1368
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1369
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1370
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1371
+ # input_ids based on the past_length.
1372
+ elif past_length < input_ids.shape[1]:
1373
+ input_ids = input_ids[:, past_length:]
1374
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1375
+ elif self.config.image_token_index in input_ids:
1376
+ input_ids = input_ids[:, input_ids.shape[1] - 1 :]
1377
+ # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
1378
+ # older attention values, as their corresponding values are not part of the input.
1379
+ if cache_length < past_length and attention_mask is not None:
1380
+ attention_mask = attention_mask[:, -(cache_length + input_ids.shape[1]) :]
1381
+
1382
+ position_ids = kwargs.get("position_ids", None)
1383
+ if attention_mask is not None and position_ids is None:
1384
+ # create position_ids on the fly for batch generation
1385
+ position_ids = attention_mask.long().cumsum(-1) - 1
1386
+ position_ids.masked_fill_(attention_mask == 0, 1)
1387
+ if past_key_values:
1388
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1389
+
1390
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1391
+ if inputs_embeds is not None and past_key_values is None:
1392
+ model_inputs = {"inputs_embeds": inputs_embeds}
1393
+ else:
1394
+ model_inputs = {"input_ids": input_ids}
1395
+
1396
+ model_inputs.update(
1397
+ {
1398
+ "position_ids": position_ids,
1399
+ "past_key_values": past_key_values,
1400
+ "use_cache": kwargs.get("use_cache"),
1401
+ "attention_mask": attention_mask,
1402
+ "pixel_values": pixel_values,
1403
+ "image_sizes": image_sizes,
1404
+ }
1405
+ )
1406
+ return model_inputs
1407
+
1408
+ def _reorder_cache(self, *args, **kwargs):
1409
+ return self.language_model._reorder_cache(*args, **kwargs)
1410
+
1411
+ AutoConfig.register("magma", MagmaConfig)
1412
+ AutoModelForCausalLM.register(MagmaConfig, MagmaForConditionalGeneration)
preprocessor_config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_magma.MagmaProcessor",
4
+ "AutoImageProcessor": "image_processing_magma.MagmaImageProcessor"
5
+ },
6
+ "anyres_strategy": "crop",
7
+ "base_img_size": 512,
8
+ "do_convert_rgb": true,
9
+ "image_mean": [
10
+ 0.48145466,
11
+ 0.4578275,
12
+ 0.40821073
13
+ ],
14
+ "image_processor_type": "MagmaImageProcessor",
15
+ "image_std": [
16
+ 0.26862954,
17
+ 0.26130258,
18
+ 0.27577711
19
+ ],
20
+ "num_crops": 4,
21
+ "processor_class": "MagmaProcessor"
22
+ }
processing_magma.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The HuggingFace Inc. team.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """
16
+ Processor class for Magma.
17
+ """
18
+
19
+ from typing import List, Optional, Union
20
+
21
+ import transformers
22
+ from transformers.feature_extraction_utils import BatchFeature
23
+ from transformers.image_utils import ImageInput
24
+ from transformers.processing_utils import ProcessorMixin
25
+ from transformers.tokenization_utils_base import PaddingStrategy, TextInput, TruncationStrategy
26
+ from transformers.utils import TensorType
27
+ from .configuration_magma import MagmaConfig
28
+
29
+
30
+ class MagmaProcessor(ProcessorMixin):
31
+ r"""
32
+ Constructs a Magma processor which wraps a Magma image processor and a LLaMa tokenizer into a single processor.
33
+
34
+ [`MagmaProcessor`] offers all the functionalities of [`MagmaImageProcessor`] and [`LlamaTokenizerFast`]. See the
35
+ [`~MagmaProcessor.__call__`] and [`~MagmaProcessor.decode`] for more information.
36
+
37
+ Args:
38
+ image_processor ([`MagmaImageProcessor`], *optional*):
39
+ The image processor is a required input.
40
+ tokenizer ([`LlamaTokenizerFast`], *optional*):
41
+ The tokenizer is a required input.
42
+ """
43
+
44
+ attributes = ["image_processor", "tokenizer"]
45
+ image_processor_class = "AutoImageProcessor"
46
+ tokenizer_class = "AutoTokenizer"
47
+
48
+ def __init__(self, image_processor=None, tokenizer=None):
49
+ # super().__init__(image_processor, tokenizer)
50
+ self.image_processor = image_processor
51
+ self.tokenizer = tokenizer
52
+
53
+ def __call__(
54
+ self,
55
+ texts: Union[TextInput, List[TextInput]],
56
+ images: Union[ImageInput, List[ImageInput]],
57
+ padding: Union[bool, str, PaddingStrategy] = False,
58
+ truncation: Union[bool, str, TruncationStrategy] = None,
59
+ max_length: Optional[int] = None,
60
+ do_pad: Optional[bool] = False,
61
+ return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
62
+ ) -> BatchFeature:
63
+ """
64
+ Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
65
+ and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
66
+ the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
67
+ MagmaImageProcessor's [`~MagmaImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
68
+ of the above two methods for more information.
69
+
70
+ Args:
71
+ texts (`str`, `List[str]`, `List[List[str]]`):
72
+ The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
73
+ (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
74
+ `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
75
+ images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
76
+ The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
77
+ tensor. Both channels-first and channels-last formats are supported.
78
+ padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
79
+ Select a strategy to pad the returned sequences (according to the model's padding side and padding
80
+ index) among:
81
+ - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
82
+ sequence if provided).
83
+ - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
84
+ acceptable input length for the model if that argument is not provided.
85
+ - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
86
+ lengths).
87
+ max_length (`int`, *optional*):
88
+ Maximum length of the returned list and optionally padding length (see above).
89
+ do_pad (`bool`, *optional*, defaults to self.do_pad):
90
+ Whether to pad the image. If `True` will pad the images in the batch to the largest image in the batch
91
+ and create a pixel mask. Padding will be applied to the bottom and right of the image with zeros.
92
+ truncation (`bool`, *optional*):
93
+ Activates truncation to cut input sequences longer than `max_length` to `max_length`.
94
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
95
+ If set, will return tensors of a particular framework. Acceptable values are:
96
+
97
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
98
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
99
+ - `'np'`: Return NumPy `np.ndarray` objects.
100
+ - `'jax'`: Return JAX `jnp.ndarray` objects.
101
+
102
+ Returns:
103
+ [`BatchFeature`]: A [`BatchFeature`] with the following fields:
104
+
105
+ - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
106
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
107
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
108
+ `None`).
109
+ - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
110
+ """
111
+ if images is not None:
112
+ image_inputs = self.image_processor(images, do_pad=do_pad, return_tensors=return_tensors)
113
+ else:
114
+ image_inputs = {}
115
+ text_inputs = self.tokenizer(
116
+ texts, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
117
+ )
118
+
119
+ return BatchFeature(data={**text_inputs, **image_inputs})
120
+
121
+ def apply_chat_template(self, *args, **kwargs):
122
+ return self.tokenizer.apply_chat_template(*args, **kwargs)
123
+
124
+ # Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
125
+ def batch_decode(self, *args, **kwargs):
126
+ """
127
+ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
128
+ refer to the docstring of this method for more information.
129
+ """
130
+ return self.tokenizer.batch_decode(*args, **kwargs)
131
+
132
+ # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
133
+ def decode(self, *args, **kwargs):
134
+ """
135
+ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
136
+ the docstring of this method for more information.
137
+ """
138
+ return self.tokenizer.decode(*args, **kwargs)
139
+
140
+ @property
141
+ # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
142
+ def model_input_names(self):
143
+ tokenizer_input_names = self.tokenizer.model_input_names
144
+ image_processor_input_names = self.image_processor.model_input_names
145
+ return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|begin_of_text|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|eot_id|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
test.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ test
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,2108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "128000": {
4
+ "content": "<|begin_of_text|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "128001": {
12
+ "content": "<|end_of_text|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "128002": {
20
+ "content": "<|reserved_special_token_0|>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "128003": {
28
+ "content": "<|reserved_special_token_1|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "128004": {
36
+ "content": "<|reserved_special_token_2|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "128005": {
44
+ "content": "<|reserved_special_token_3|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "128006": {
52
+ "content": "<|start_header_id|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "128007": {
60
+ "content": "<|end_header_id|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "128008": {
68
+ "content": "<|reserved_special_token_4|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "128009": {
76
+ "content": "<|eot_id|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "128010": {
84
+ "content": "<|reserved_special_token_5|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "128011": {
92
+ "content": "<|reserved_special_token_6|>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "128012": {
100
+ "content": "<|reserved_special_token_7|>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "128013": {
108
+ "content": "<|reserved_special_token_8|>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "128014": {
116
+ "content": "<|reserved_special_token_9|>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "128015": {
124
+ "content": "<|reserved_special_token_10|>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "128016": {
132
+ "content": "<|reserved_special_token_11|>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "128017": {
140
+ "content": "<|reserved_special_token_12|>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "128018": {
148
+ "content": "<|reserved_special_token_13|>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "128019": {
156
+ "content": "<|reserved_special_token_14|>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "128020": {
164
+ "content": "<|reserved_special_token_15|>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "128021": {
172
+ "content": "<|reserved_special_token_16|>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": true
178
+ },
179
+ "128022": {
180
+ "content": "<|reserved_special_token_17|>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": true
186
+ },
187
+ "128023": {
188
+ "content": "<|reserved_special_token_18|>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": true
194
+ },
195
+ "128024": {
196
+ "content": "<|reserved_special_token_19|>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": true
202
+ },
203
+ "128025": {
204
+ "content": "<|reserved_special_token_20|>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": true
210
+ },
211
+ "128026": {
212
+ "content": "<|reserved_special_token_21|>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": true
218
+ },
219
+ "128027": {
220
+ "content": "<|reserved_special_token_22|>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "128028": {
228
+ "content": "<|reserved_special_token_23|>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "128029": {
236
+ "content": "<|reserved_special_token_24|>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "128030": {
244
+ "content": "<|reserved_special_token_25|>",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "128031": {
252
+ "content": "<|reserved_special_token_26|>",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "128032": {
260
+ "content": "<|reserved_special_token_27|>",
261
+ "lstrip": false,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "128033": {
268
+ "content": "<|reserved_special_token_28|>",
269
+ "lstrip": false,
270
+ "normalized": false,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": true
274
+ },
275
+ "128034": {
276
+ "content": "<|reserved_special_token_29|>",
277
+ "lstrip": false,
278
+ "normalized": false,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": true
282
+ },
283
+ "128035": {
284
+ "content": "<|reserved_special_token_30|>",
285
+ "lstrip": false,
286
+ "normalized": false,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": true
290
+ },
291
+ "128036": {
292
+ "content": "<|reserved_special_token_31|>",
293
+ "lstrip": false,
294
+ "normalized": false,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": true
298
+ },
299
+ "128037": {
300
+ "content": "<|reserved_special_token_32|>",
301
+ "lstrip": false,
302
+ "normalized": false,
303
+ "rstrip": false,
304
+ "single_word": false,
305
+ "special": true
306
+ },
307
+ "128038": {
308
+ "content": "<|reserved_special_token_33|>",
309
+ "lstrip": false,
310
+ "normalized": false,
311
+ "rstrip": false,
312
+ "single_word": false,
313
+ "special": true
314
+ },
315
+ "128039": {
316
+ "content": "<|reserved_special_token_34|>",
317
+ "lstrip": false,
318
+ "normalized": false,
319
+ "rstrip": false,
320
+ "single_word": false,
321
+ "special": true
322
+ },
323
+ "128040": {
324
+ "content": "<|reserved_special_token_35|>",
325
+ "lstrip": false,
326
+ "normalized": false,
327
+ "rstrip": false,
328
+ "single_word": false,
329
+ "special": true
330
+ },
331
+ "128041": {
332
+ "content": "<|reserved_special_token_36|>",
333
+ "lstrip": false,
334
+ "normalized": false,
335
+ "rstrip": false,
336
+ "single_word": false,
337
+ "special": true
338
+ },
339
+ "128042": {
340
+ "content": "<|reserved_special_token_37|>",
341
+ "lstrip": false,
342
+ "normalized": false,
343
+ "rstrip": false,
344
+ "single_word": false,
345
+ "special": true
346
+ },
347
+ "128043": {
348
+ "content": "<|reserved_special_token_38|>",
349
+ "lstrip": false,
350
+ "normalized": false,
351
+ "rstrip": false,
352
+ "single_word": false,
353
+ "special": true
354
+ },
355
+ "128044": {
356
+ "content": "<|reserved_special_token_39|>",
357
+ "lstrip": false,
358
+ "normalized": false,
359
+ "rstrip": false,
360
+ "single_word": false,
361
+ "special": true
362
+ },
363
+ "128045": {
364
+ "content": "<|reserved_special_token_40|>",
365
+ "lstrip": false,
366
+ "normalized": false,
367
+ "rstrip": false,
368
+ "single_word": false,
369
+ "special": true
370
+ },
371
+ "128046": {
372
+ "content": "<|reserved_special_token_41|>",
373
+ "lstrip": false,
374
+ "normalized": false,
375
+ "rstrip": false,
376
+ "single_word": false,
377
+ "special": true
378
+ },
379
+ "128047": {
380
+ "content": "<|reserved_special_token_42|>",
381
+ "lstrip": false,
382
+ "normalized": false,
383
+ "rstrip": false,
384
+ "single_word": false,
385
+ "special": true
386
+ },
387
+ "128048": {
388
+ "content": "<|reserved_special_token_43|>",
389
+ "lstrip": false,
390
+ "normalized": false,
391
+ "rstrip": false,
392
+ "single_word": false,
393
+ "special": true
394
+ },
395
+ "128049": {
396
+ "content": "<|reserved_special_token_44|>",
397
+ "lstrip": false,
398
+ "normalized": false,
399
+ "rstrip": false,
400
+ "single_word": false,
401
+ "special": true
402
+ },
403
+ "128050": {
404
+ "content": "<|reserved_special_token_45|>",
405
+ "lstrip": false,
406
+ "normalized": false,
407
+ "rstrip": false,
408
+ "single_word": false,
409
+ "special": true
410
+ },
411
+ "128051": {
412
+ "content": "<|reserved_special_token_46|>",
413
+ "lstrip": false,
414
+ "normalized": false,
415
+ "rstrip": false,
416
+ "single_word": false,
417
+ "special": true
418
+ },
419
+ "128052": {
420
+ "content": "<|reserved_special_token_47|>",
421
+ "lstrip": false,
422
+ "normalized": false,
423
+ "rstrip": false,
424
+ "single_word": false,
425
+ "special": true
426
+ },
427
+ "128053": {
428
+ "content": "<|reserved_special_token_48|>",
429
+ "lstrip": false,
430
+ "normalized": false,
431
+ "rstrip": false,
432
+ "single_word": false,
433
+ "special": true
434
+ },
435
+ "128054": {
436
+ "content": "<|reserved_special_token_49|>",
437
+ "lstrip": false,
438
+ "normalized": false,
439
+ "rstrip": false,
440
+ "single_word": false,
441
+ "special": true
442
+ },
443
+ "128055": {
444
+ "content": "<|reserved_special_token_50|>",
445
+ "lstrip": false,
446
+ "normalized": false,
447
+ "rstrip": false,
448
+ "single_word": false,
449
+ "special": true
450
+ },
451
+ "128056": {
452
+ "content": "<|reserved_special_token_51|>",
453
+ "lstrip": false,
454
+ "normalized": false,
455
+ "rstrip": false,
456
+ "single_word": false,
457
+ "special": true
458
+ },
459
+ "128057": {
460
+ "content": "<|reserved_special_token_52|>",
461
+ "lstrip": false,
462
+ "normalized": false,
463
+ "rstrip": false,
464
+ "single_word": false,
465
+ "special": true
466
+ },
467
+ "128058": {
468
+ "content": "<|reserved_special_token_53|>",
469
+ "lstrip": false,
470
+ "normalized": false,
471
+ "rstrip": false,
472
+ "single_word": false,
473
+ "special": true
474
+ },
475
+ "128059": {
476
+ "content": "<|reserved_special_token_54|>",
477
+ "lstrip": false,
478
+ "normalized": false,
479
+ "rstrip": false,
480
+ "single_word": false,
481
+ "special": true
482
+ },
483
+ "128060": {
484
+ "content": "<|reserved_special_token_55|>",
485
+ "lstrip": false,
486
+ "normalized": false,
487
+ "rstrip": false,
488
+ "single_word": false,
489
+ "special": true
490
+ },
491
+ "128061": {
492
+ "content": "<|reserved_special_token_56|>",
493
+ "lstrip": false,
494
+ "normalized": false,
495
+ "rstrip": false,
496
+ "single_word": false,
497
+ "special": true
498
+ },
499
+ "128062": {
500
+ "content": "<|reserved_special_token_57|>",
501
+ "lstrip": false,
502
+ "normalized": false,
503
+ "rstrip": false,
504
+ "single_word": false,
505
+ "special": true
506
+ },
507
+ "128063": {
508
+ "content": "<|reserved_special_token_58|>",
509
+ "lstrip": false,
510
+ "normalized": false,
511
+ "rstrip": false,
512
+ "single_word": false,
513
+ "special": true
514
+ },
515
+ "128064": {
516
+ "content": "<|reserved_special_token_59|>",
517
+ "lstrip": false,
518
+ "normalized": false,
519
+ "rstrip": false,
520
+ "single_word": false,
521
+ "special": true
522
+ },
523
+ "128065": {
524
+ "content": "<|reserved_special_token_60|>",
525
+ "lstrip": false,
526
+ "normalized": false,
527
+ "rstrip": false,
528
+ "single_word": false,
529
+ "special": true
530
+ },
531
+ "128066": {
532
+ "content": "<|reserved_special_token_61|>",
533
+ "lstrip": false,
534
+ "normalized": false,
535
+ "rstrip": false,
536
+ "single_word": false,
537
+ "special": true
538
+ },
539
+ "128067": {
540
+ "content": "<|reserved_special_token_62|>",
541
+ "lstrip": false,
542
+ "normalized": false,
543
+ "rstrip": false,
544
+ "single_word": false,
545
+ "special": true
546
+ },
547
+ "128068": {
548
+ "content": "<|reserved_special_token_63|>",
549
+ "lstrip": false,
550
+ "normalized": false,
551
+ "rstrip": false,
552
+ "single_word": false,
553
+ "special": true
554
+ },
555
+ "128069": {
556
+ "content": "<|reserved_special_token_64|>",
557
+ "lstrip": false,
558
+ "normalized": false,
559
+ "rstrip": false,
560
+ "single_word": false,
561
+ "special": true
562
+ },
563
+ "128070": {
564
+ "content": "<|reserved_special_token_65|>",
565
+ "lstrip": false,
566
+ "normalized": false,
567
+ "rstrip": false,
568
+ "single_word": false,
569
+ "special": true
570
+ },
571
+ "128071": {
572
+ "content": "<|reserved_special_token_66|>",
573
+ "lstrip": false,
574
+ "normalized": false,
575
+ "rstrip": false,
576
+ "single_word": false,
577
+ "special": true
578
+ },
579
+ "128072": {
580
+ "content": "<|reserved_special_token_67|>",
581
+ "lstrip": false,
582
+ "normalized": false,
583
+ "rstrip": false,
584
+ "single_word": false,
585
+ "special": true
586
+ },
587
+ "128073": {
588
+ "content": "<|reserved_special_token_68|>",
589
+ "lstrip": false,
590
+ "normalized": false,
591
+ "rstrip": false,
592
+ "single_word": false,
593
+ "special": true
594
+ },
595
+ "128074": {
596
+ "content": "<|reserved_special_token_69|>",
597
+ "lstrip": false,
598
+ "normalized": false,
599
+ "rstrip": false,
600
+ "single_word": false,
601
+ "special": true
602
+ },
603
+ "128075": {
604
+ "content": "<|reserved_special_token_70|>",
605
+ "lstrip": false,
606
+ "normalized": false,
607
+ "rstrip": false,
608
+ "single_word": false,
609
+ "special": true
610
+ },
611
+ "128076": {
612
+ "content": "<|reserved_special_token_71|>",
613
+ "lstrip": false,
614
+ "normalized": false,
615
+ "rstrip": false,
616
+ "single_word": false,
617
+ "special": true
618
+ },
619
+ "128077": {
620
+ "content": "<|reserved_special_token_72|>",
621
+ "lstrip": false,
622
+ "normalized": false,
623
+ "rstrip": false,
624
+ "single_word": false,
625
+ "special": true
626
+ },
627
+ "128078": {
628
+ "content": "<|reserved_special_token_73|>",
629
+ "lstrip": false,
630
+ "normalized": false,
631
+ "rstrip": false,
632
+ "single_word": false,
633
+ "special": true
634
+ },
635
+ "128079": {
636
+ "content": "<|reserved_special_token_74|>",
637
+ "lstrip": false,
638
+ "normalized": false,
639
+ "rstrip": false,
640
+ "single_word": false,
641
+ "special": true
642
+ },
643
+ "128080": {
644
+ "content": "<|reserved_special_token_75|>",
645
+ "lstrip": false,
646
+ "normalized": false,
647
+ "rstrip": false,
648
+ "single_word": false,
649
+ "special": true
650
+ },
651
+ "128081": {
652
+ "content": "<|reserved_special_token_76|>",
653
+ "lstrip": false,
654
+ "normalized": false,
655
+ "rstrip": false,
656
+ "single_word": false,
657
+ "special": true
658
+ },
659
+ "128082": {
660
+ "content": "<|reserved_special_token_77|>",
661
+ "lstrip": false,
662
+ "normalized": false,
663
+ "rstrip": false,
664
+ "single_word": false,
665
+ "special": true
666
+ },
667
+ "128083": {
668
+ "content": "<|reserved_special_token_78|>",
669
+ "lstrip": false,
670
+ "normalized": false,
671
+ "rstrip": false,
672
+ "single_word": false,
673
+ "special": true
674
+ },
675
+ "128084": {
676
+ "content": "<|reserved_special_token_79|>",
677
+ "lstrip": false,
678
+ "normalized": false,
679
+ "rstrip": false,
680
+ "single_word": false,
681
+ "special": true
682
+ },
683
+ "128085": {
684
+ "content": "<|reserved_special_token_80|>",
685
+ "lstrip": false,
686
+ "normalized": false,
687
+ "rstrip": false,
688
+ "single_word": false,
689
+ "special": true
690
+ },
691
+ "128086": {
692
+ "content": "<|reserved_special_token_81|>",
693
+ "lstrip": false,
694
+ "normalized": false,
695
+ "rstrip": false,
696
+ "single_word": false,
697
+ "special": true
698
+ },
699
+ "128087": {
700
+ "content": "<|reserved_special_token_82|>",
701
+ "lstrip": false,
702
+ "normalized": false,
703
+ "rstrip": false,
704
+ "single_word": false,
705
+ "special": true
706
+ },
707
+ "128088": {
708
+ "content": "<|reserved_special_token_83|>",
709
+ "lstrip": false,
710
+ "normalized": false,
711
+ "rstrip": false,
712
+ "single_word": false,
713
+ "special": true
714
+ },
715
+ "128089": {
716
+ "content": "<|reserved_special_token_84|>",
717
+ "lstrip": false,
718
+ "normalized": false,
719
+ "rstrip": false,
720
+ "single_word": false,
721
+ "special": true
722
+ },
723
+ "128090": {
724
+ "content": "<|reserved_special_token_85|>",
725
+ "lstrip": false,
726
+ "normalized": false,
727
+ "rstrip": false,
728
+ "single_word": false,
729
+ "special": true
730
+ },
731
+ "128091": {
732
+ "content": "<|reserved_special_token_86|>",
733
+ "lstrip": false,
734
+ "normalized": false,
735
+ "rstrip": false,
736
+ "single_word": false,
737
+ "special": true
738
+ },
739
+ "128092": {
740
+ "content": "<|reserved_special_token_87|>",
741
+ "lstrip": false,
742
+ "normalized": false,
743
+ "rstrip": false,
744
+ "single_word": false,
745
+ "special": true
746
+ },
747
+ "128093": {
748
+ "content": "<|reserved_special_token_88|>",
749
+ "lstrip": false,
750
+ "normalized": false,
751
+ "rstrip": false,
752
+ "single_word": false,
753
+ "special": true
754
+ },
755
+ "128094": {
756
+ "content": "<|reserved_special_token_89|>",
757
+ "lstrip": false,
758
+ "normalized": false,
759
+ "rstrip": false,
760
+ "single_word": false,
761
+ "special": true
762
+ },
763
+ "128095": {
764
+ "content": "<|reserved_special_token_90|>",
765
+ "lstrip": false,
766
+ "normalized": false,
767
+ "rstrip": false,
768
+ "single_word": false,
769
+ "special": true
770
+ },
771
+ "128096": {
772
+ "content": "<|reserved_special_token_91|>",
773
+ "lstrip": false,
774
+ "normalized": false,
775
+ "rstrip": false,
776
+ "single_word": false,
777
+ "special": true
778
+ },
779
+ "128097": {
780
+ "content": "<|reserved_special_token_92|>",
781
+ "lstrip": false,
782
+ "normalized": false,
783
+ "rstrip": false,
784
+ "single_word": false,
785
+ "special": true
786
+ },
787
+ "128098": {
788
+ "content": "<|reserved_special_token_93|>",
789
+ "lstrip": false,
790
+ "normalized": false,
791
+ "rstrip": false,
792
+ "single_word": false,
793
+ "special": true
794
+ },
795
+ "128099": {
796
+ "content": "<|reserved_special_token_94|>",
797
+ "lstrip": false,
798
+ "normalized": false,
799
+ "rstrip": false,
800
+ "single_word": false,
801
+ "special": true
802
+ },
803
+ "128100": {
804
+ "content": "<|reserved_special_token_95|>",
805
+ "lstrip": false,
806
+ "normalized": false,
807
+ "rstrip": false,
808
+ "single_word": false,
809
+ "special": true
810
+ },
811
+ "128101": {
812
+ "content": "<|reserved_special_token_96|>",
813
+ "lstrip": false,
814
+ "normalized": false,
815
+ "rstrip": false,
816
+ "single_word": false,
817
+ "special": true
818
+ },
819
+ "128102": {
820
+ "content": "<|reserved_special_token_97|>",
821
+ "lstrip": false,
822
+ "normalized": false,
823
+ "rstrip": false,
824
+ "single_word": false,
825
+ "special": true
826
+ },
827
+ "128103": {
828
+ "content": "<|reserved_special_token_98|>",
829
+ "lstrip": false,
830
+ "normalized": false,
831
+ "rstrip": false,
832
+ "single_word": false,
833
+ "special": true
834
+ },
835
+ "128104": {
836
+ "content": "<|reserved_special_token_99|>",
837
+ "lstrip": false,
838
+ "normalized": false,
839
+ "rstrip": false,
840
+ "single_word": false,
841
+ "special": true
842
+ },
843
+ "128105": {
844
+ "content": "<|reserved_special_token_100|>",
845
+ "lstrip": false,
846
+ "normalized": false,
847
+ "rstrip": false,
848
+ "single_word": false,
849
+ "special": true
850
+ },
851
+ "128106": {
852
+ "content": "<|reserved_special_token_101|>",
853
+ "lstrip": false,
854
+ "normalized": false,
855
+ "rstrip": false,
856
+ "single_word": false,
857
+ "special": true
858
+ },
859
+ "128107": {
860
+ "content": "<|reserved_special_token_102|>",
861
+ "lstrip": false,
862
+ "normalized": false,
863
+ "rstrip": false,
864
+ "single_word": false,
865
+ "special": true
866
+ },
867
+ "128108": {
868
+ "content": "<|reserved_special_token_103|>",
869
+ "lstrip": false,
870
+ "normalized": false,
871
+ "rstrip": false,
872
+ "single_word": false,
873
+ "special": true
874
+ },
875
+ "128109": {
876
+ "content": "<|reserved_special_token_104|>",
877
+ "lstrip": false,
878
+ "normalized": false,
879
+ "rstrip": false,
880
+ "single_word": false,
881
+ "special": true
882
+ },
883
+ "128110": {
884
+ "content": "<|reserved_special_token_105|>",
885
+ "lstrip": false,
886
+ "normalized": false,
887
+ "rstrip": false,
888
+ "single_word": false,
889
+ "special": true
890
+ },
891
+ "128111": {
892
+ "content": "<|reserved_special_token_106|>",
893
+ "lstrip": false,
894
+ "normalized": false,
895
+ "rstrip": false,
896
+ "single_word": false,
897
+ "special": true
898
+ },
899
+ "128112": {
900
+ "content": "<|reserved_special_token_107|>",
901
+ "lstrip": false,
902
+ "normalized": false,
903
+ "rstrip": false,
904
+ "single_word": false,
905
+ "special": true
906
+ },
907
+ "128113": {
908
+ "content": "<|reserved_special_token_108|>",
909
+ "lstrip": false,
910
+ "normalized": false,
911
+ "rstrip": false,
912
+ "single_word": false,
913
+ "special": true
914
+ },
915
+ "128114": {
916
+ "content": "<|reserved_special_token_109|>",
917
+ "lstrip": false,
918
+ "normalized": false,
919
+ "rstrip": false,
920
+ "single_word": false,
921
+ "special": true
922
+ },
923
+ "128115": {
924
+ "content": "<|reserved_special_token_110|>",
925
+ "lstrip": false,
926
+ "normalized": false,
927
+ "rstrip": false,
928
+ "single_word": false,
929
+ "special": true
930
+ },
931
+ "128116": {
932
+ "content": "<|reserved_special_token_111|>",
933
+ "lstrip": false,
934
+ "normalized": false,
935
+ "rstrip": false,
936
+ "single_word": false,
937
+ "special": true
938
+ },
939
+ "128117": {
940
+ "content": "<|reserved_special_token_112|>",
941
+ "lstrip": false,
942
+ "normalized": false,
943
+ "rstrip": false,
944
+ "single_word": false,
945
+ "special": true
946
+ },
947
+ "128118": {
948
+ "content": "<|reserved_special_token_113|>",
949
+ "lstrip": false,
950
+ "normalized": false,
951
+ "rstrip": false,
952
+ "single_word": false,
953
+ "special": true
954
+ },
955
+ "128119": {
956
+ "content": "<|reserved_special_token_114|>",
957
+ "lstrip": false,
958
+ "normalized": false,
959
+ "rstrip": false,
960
+ "single_word": false,
961
+ "special": true
962
+ },
963
+ "128120": {
964
+ "content": "<|reserved_special_token_115|>",
965
+ "lstrip": false,
966
+ "normalized": false,
967
+ "rstrip": false,
968
+ "single_word": false,
969
+ "special": true
970
+ },
971
+ "128121": {
972
+ "content": "<|reserved_special_token_116|>",
973
+ "lstrip": false,
974
+ "normalized": false,
975
+ "rstrip": false,
976
+ "single_word": false,
977
+ "special": true
978
+ },
979
+ "128122": {
980
+ "content": "<|reserved_special_token_117|>",
981
+ "lstrip": false,
982
+ "normalized": false,
983
+ "rstrip": false,
984
+ "single_word": false,
985
+ "special": true
986
+ },
987
+ "128123": {
988
+ "content": "<|reserved_special_token_118|>",
989
+ "lstrip": false,
990
+ "normalized": false,
991
+ "rstrip": false,
992
+ "single_word": false,
993
+ "special": true
994
+ },
995
+ "128124": {
996
+ "content": "<|reserved_special_token_119|>",
997
+ "lstrip": false,
998
+ "normalized": false,
999
+ "rstrip": false,
1000
+ "single_word": false,
1001
+ "special": true
1002
+ },
1003
+ "128125": {
1004
+ "content": "<|reserved_special_token_120|>",
1005
+ "lstrip": false,
1006
+ "normalized": false,
1007
+ "rstrip": false,
1008
+ "single_word": false,
1009
+ "special": true
1010
+ },
1011
+ "128126": {
1012
+ "content": "<|reserved_special_token_121|>",
1013
+ "lstrip": false,
1014
+ "normalized": false,
1015
+ "rstrip": false,
1016
+ "single_word": false,
1017
+ "special": true
1018
+ },
1019
+ "128127": {
1020
+ "content": "<|reserved_special_token_122|>",
1021
+ "lstrip": false,
1022
+ "normalized": false,
1023
+ "rstrip": false,
1024
+ "single_word": false,
1025
+ "special": true
1026
+ },
1027
+ "128128": {
1028
+ "content": "<|reserved_special_token_123|>",
1029
+ "lstrip": false,
1030
+ "normalized": false,
1031
+ "rstrip": false,
1032
+ "single_word": false,
1033
+ "special": true
1034
+ },
1035
+ "128129": {
1036
+ "content": "<|reserved_special_token_124|>",
1037
+ "lstrip": false,
1038
+ "normalized": false,
1039
+ "rstrip": false,
1040
+ "single_word": false,
1041
+ "special": true
1042
+ },
1043
+ "128130": {
1044
+ "content": "<|reserved_special_token_125|>",
1045
+ "lstrip": false,
1046
+ "normalized": false,
1047
+ "rstrip": false,
1048
+ "single_word": false,
1049
+ "special": true
1050
+ },
1051
+ "128131": {
1052
+ "content": "<|reserved_special_token_126|>",
1053
+ "lstrip": false,
1054
+ "normalized": false,
1055
+ "rstrip": false,
1056
+ "single_word": false,
1057
+ "special": true
1058
+ },
1059
+ "128132": {
1060
+ "content": "<|reserved_special_token_127|>",
1061
+ "lstrip": false,
1062
+ "normalized": false,
1063
+ "rstrip": false,
1064
+ "single_word": false,
1065
+ "special": true
1066
+ },
1067
+ "128133": {
1068
+ "content": "<|reserved_special_token_128|>",
1069
+ "lstrip": false,
1070
+ "normalized": false,
1071
+ "rstrip": false,
1072
+ "single_word": false,
1073
+ "special": true
1074
+ },
1075
+ "128134": {
1076
+ "content": "<|reserved_special_token_129|>",
1077
+ "lstrip": false,
1078
+ "normalized": false,
1079
+ "rstrip": false,
1080
+ "single_word": false,
1081
+ "special": true
1082
+ },
1083
+ "128135": {
1084
+ "content": "<|reserved_special_token_130|>",
1085
+ "lstrip": false,
1086
+ "normalized": false,
1087
+ "rstrip": false,
1088
+ "single_word": false,
1089
+ "special": true
1090
+ },
1091
+ "128136": {
1092
+ "content": "<|reserved_special_token_131|>",
1093
+ "lstrip": false,
1094
+ "normalized": false,
1095
+ "rstrip": false,
1096
+ "single_word": false,
1097
+ "special": true
1098
+ },
1099
+ "128137": {
1100
+ "content": "<|reserved_special_token_132|>",
1101
+ "lstrip": false,
1102
+ "normalized": false,
1103
+ "rstrip": false,
1104
+ "single_word": false,
1105
+ "special": true
1106
+ },
1107
+ "128138": {
1108
+ "content": "<|reserved_special_token_133|>",
1109
+ "lstrip": false,
1110
+ "normalized": false,
1111
+ "rstrip": false,
1112
+ "single_word": false,
1113
+ "special": true
1114
+ },
1115
+ "128139": {
1116
+ "content": "<|reserved_special_token_134|>",
1117
+ "lstrip": false,
1118
+ "normalized": false,
1119
+ "rstrip": false,
1120
+ "single_word": false,
1121
+ "special": true
1122
+ },
1123
+ "128140": {
1124
+ "content": "<|reserved_special_token_135|>",
1125
+ "lstrip": false,
1126
+ "normalized": false,
1127
+ "rstrip": false,
1128
+ "single_word": false,
1129
+ "special": true
1130
+ },
1131
+ "128141": {
1132
+ "content": "<|reserved_special_token_136|>",
1133
+ "lstrip": false,
1134
+ "normalized": false,
1135
+ "rstrip": false,
1136
+ "single_word": false,
1137
+ "special": true
1138
+ },
1139
+ "128142": {
1140
+ "content": "<|reserved_special_token_137|>",
1141
+ "lstrip": false,
1142
+ "normalized": false,
1143
+ "rstrip": false,
1144
+ "single_word": false,
1145
+ "special": true
1146
+ },
1147
+ "128143": {
1148
+ "content": "<|reserved_special_token_138|>",
1149
+ "lstrip": false,
1150
+ "normalized": false,
1151
+ "rstrip": false,
1152
+ "single_word": false,
1153
+ "special": true
1154
+ },
1155
+ "128144": {
1156
+ "content": "<|reserved_special_token_139|>",
1157
+ "lstrip": false,
1158
+ "normalized": false,
1159
+ "rstrip": false,
1160
+ "single_word": false,
1161
+ "special": true
1162
+ },
1163
+ "128145": {
1164
+ "content": "<|reserved_special_token_140|>",
1165
+ "lstrip": false,
1166
+ "normalized": false,
1167
+ "rstrip": false,
1168
+ "single_word": false,
1169
+ "special": true
1170
+ },
1171
+ "128146": {
1172
+ "content": "<|reserved_special_token_141|>",
1173
+ "lstrip": false,
1174
+ "normalized": false,
1175
+ "rstrip": false,
1176
+ "single_word": false,
1177
+ "special": true
1178
+ },
1179
+ "128147": {
1180
+ "content": "<|reserved_special_token_142|>",
1181
+ "lstrip": false,
1182
+ "normalized": false,
1183
+ "rstrip": false,
1184
+ "single_word": false,
1185
+ "special": true
1186
+ },
1187
+ "128148": {
1188
+ "content": "<|reserved_special_token_143|>",
1189
+ "lstrip": false,
1190
+ "normalized": false,
1191
+ "rstrip": false,
1192
+ "single_word": false,
1193
+ "special": true
1194
+ },
1195
+ "128149": {
1196
+ "content": "<|reserved_special_token_144|>",
1197
+ "lstrip": false,
1198
+ "normalized": false,
1199
+ "rstrip": false,
1200
+ "single_word": false,
1201
+ "special": true
1202
+ },
1203
+ "128150": {
1204
+ "content": "<|reserved_special_token_145|>",
1205
+ "lstrip": false,
1206
+ "normalized": false,
1207
+ "rstrip": false,
1208
+ "single_word": false,
1209
+ "special": true
1210
+ },
1211
+ "128151": {
1212
+ "content": "<|reserved_special_token_146|>",
1213
+ "lstrip": false,
1214
+ "normalized": false,
1215
+ "rstrip": false,
1216
+ "single_word": false,
1217
+ "special": true
1218
+ },
1219
+ "128152": {
1220
+ "content": "<|reserved_special_token_147|>",
1221
+ "lstrip": false,
1222
+ "normalized": false,
1223
+ "rstrip": false,
1224
+ "single_word": false,
1225
+ "special": true
1226
+ },
1227
+ "128153": {
1228
+ "content": "<|reserved_special_token_148|>",
1229
+ "lstrip": false,
1230
+ "normalized": false,
1231
+ "rstrip": false,
1232
+ "single_word": false,
1233
+ "special": true
1234
+ },
1235
+ "128154": {
1236
+ "content": "<|reserved_special_token_149|>",
1237
+ "lstrip": false,
1238
+ "normalized": false,
1239
+ "rstrip": false,
1240
+ "single_word": false,
1241
+ "special": true
1242
+ },
1243
+ "128155": {
1244
+ "content": "<|reserved_special_token_150|>",
1245
+ "lstrip": false,
1246
+ "normalized": false,
1247
+ "rstrip": false,
1248
+ "single_word": false,
1249
+ "special": true
1250
+ },
1251
+ "128156": {
1252
+ "content": "<|reserved_special_token_151|>",
1253
+ "lstrip": false,
1254
+ "normalized": false,
1255
+ "rstrip": false,
1256
+ "single_word": false,
1257
+ "special": true
1258
+ },
1259
+ "128157": {
1260
+ "content": "<|reserved_special_token_152|>",
1261
+ "lstrip": false,
1262
+ "normalized": false,
1263
+ "rstrip": false,
1264
+ "single_word": false,
1265
+ "special": true
1266
+ },
1267
+ "128158": {
1268
+ "content": "<|reserved_special_token_153|>",
1269
+ "lstrip": false,
1270
+ "normalized": false,
1271
+ "rstrip": false,
1272
+ "single_word": false,
1273
+ "special": true
1274
+ },
1275
+ "128159": {
1276
+ "content": "<|reserved_special_token_154|>",
1277
+ "lstrip": false,
1278
+ "normalized": false,
1279
+ "rstrip": false,
1280
+ "single_word": false,
1281
+ "special": true
1282
+ },
1283
+ "128160": {
1284
+ "content": "<|reserved_special_token_155|>",
1285
+ "lstrip": false,
1286
+ "normalized": false,
1287
+ "rstrip": false,
1288
+ "single_word": false,
1289
+ "special": true
1290
+ },
1291
+ "128161": {
1292
+ "content": "<|reserved_special_token_156|>",
1293
+ "lstrip": false,
1294
+ "normalized": false,
1295
+ "rstrip": false,
1296
+ "single_word": false,
1297
+ "special": true
1298
+ },
1299
+ "128162": {
1300
+ "content": "<|reserved_special_token_157|>",
1301
+ "lstrip": false,
1302
+ "normalized": false,
1303
+ "rstrip": false,
1304
+ "single_word": false,
1305
+ "special": true
1306
+ },
1307
+ "128163": {
1308
+ "content": "<|reserved_special_token_158|>",
1309
+ "lstrip": false,
1310
+ "normalized": false,
1311
+ "rstrip": false,
1312
+ "single_word": false,
1313
+ "special": true
1314
+ },
1315
+ "128164": {
1316
+ "content": "<|reserved_special_token_159|>",
1317
+ "lstrip": false,
1318
+ "normalized": false,
1319
+ "rstrip": false,
1320
+ "single_word": false,
1321
+ "special": true
1322
+ },
1323
+ "128165": {
1324
+ "content": "<|reserved_special_token_160|>",
1325
+ "lstrip": false,
1326
+ "normalized": false,
1327
+ "rstrip": false,
1328
+ "single_word": false,
1329
+ "special": true
1330
+ },
1331
+ "128166": {
1332
+ "content": "<|reserved_special_token_161|>",
1333
+ "lstrip": false,
1334
+ "normalized": false,
1335
+ "rstrip": false,
1336
+ "single_word": false,
1337
+ "special": true
1338
+ },
1339
+ "128167": {
1340
+ "content": "<|reserved_special_token_162|>",
1341
+ "lstrip": false,
1342
+ "normalized": false,
1343
+ "rstrip": false,
1344
+ "single_word": false,
1345
+ "special": true
1346
+ },
1347
+ "128168": {
1348
+ "content": "<|reserved_special_token_163|>",
1349
+ "lstrip": false,
1350
+ "normalized": false,
1351
+ "rstrip": false,
1352
+ "single_word": false,
1353
+ "special": true
1354
+ },
1355
+ "128169": {
1356
+ "content": "<|reserved_special_token_164|>",
1357
+ "lstrip": false,
1358
+ "normalized": false,
1359
+ "rstrip": false,
1360
+ "single_word": false,
1361
+ "special": true
1362
+ },
1363
+ "128170": {
1364
+ "content": "<|reserved_special_token_165|>",
1365
+ "lstrip": false,
1366
+ "normalized": false,
1367
+ "rstrip": false,
1368
+ "single_word": false,
1369
+ "special": true
1370
+ },
1371
+ "128171": {
1372
+ "content": "<|reserved_special_token_166|>",
1373
+ "lstrip": false,
1374
+ "normalized": false,
1375
+ "rstrip": false,
1376
+ "single_word": false,
1377
+ "special": true
1378
+ },
1379
+ "128172": {
1380
+ "content": "<|reserved_special_token_167|>",
1381
+ "lstrip": false,
1382
+ "normalized": false,
1383
+ "rstrip": false,
1384
+ "single_word": false,
1385
+ "special": true
1386
+ },
1387
+ "128173": {
1388
+ "content": "<|reserved_special_token_168|>",
1389
+ "lstrip": false,
1390
+ "normalized": false,
1391
+ "rstrip": false,
1392
+ "single_word": false,
1393
+ "special": true
1394
+ },
1395
+ "128174": {
1396
+ "content": "<|reserved_special_token_169|>",
1397
+ "lstrip": false,
1398
+ "normalized": false,
1399
+ "rstrip": false,
1400
+ "single_word": false,
1401
+ "special": true
1402
+ },
1403
+ "128175": {
1404
+ "content": "<|reserved_special_token_170|>",
1405
+ "lstrip": false,
1406
+ "normalized": false,
1407
+ "rstrip": false,
1408
+ "single_word": false,
1409
+ "special": true
1410
+ },
1411
+ "128176": {
1412
+ "content": "<|reserved_special_token_171|>",
1413
+ "lstrip": false,
1414
+ "normalized": false,
1415
+ "rstrip": false,
1416
+ "single_word": false,
1417
+ "special": true
1418
+ },
1419
+ "128177": {
1420
+ "content": "<|reserved_special_token_172|>",
1421
+ "lstrip": false,
1422
+ "normalized": false,
1423
+ "rstrip": false,
1424
+ "single_word": false,
1425
+ "special": true
1426
+ },
1427
+ "128178": {
1428
+ "content": "<|reserved_special_token_173|>",
1429
+ "lstrip": false,
1430
+ "normalized": false,
1431
+ "rstrip": false,
1432
+ "single_word": false,
1433
+ "special": true
1434
+ },
1435
+ "128179": {
1436
+ "content": "<|reserved_special_token_174|>",
1437
+ "lstrip": false,
1438
+ "normalized": false,
1439
+ "rstrip": false,
1440
+ "single_word": false,
1441
+ "special": true
1442
+ },
1443
+ "128180": {
1444
+ "content": "<|reserved_special_token_175|>",
1445
+ "lstrip": false,
1446
+ "normalized": false,
1447
+ "rstrip": false,
1448
+ "single_word": false,
1449
+ "special": true
1450
+ },
1451
+ "128181": {
1452
+ "content": "<|reserved_special_token_176|>",
1453
+ "lstrip": false,
1454
+ "normalized": false,
1455
+ "rstrip": false,
1456
+ "single_word": false,
1457
+ "special": true
1458
+ },
1459
+ "128182": {
1460
+ "content": "<|reserved_special_token_177|>",
1461
+ "lstrip": false,
1462
+ "normalized": false,
1463
+ "rstrip": false,
1464
+ "single_word": false,
1465
+ "special": true
1466
+ },
1467
+ "128183": {
1468
+ "content": "<|reserved_special_token_178|>",
1469
+ "lstrip": false,
1470
+ "normalized": false,
1471
+ "rstrip": false,
1472
+ "single_word": false,
1473
+ "special": true
1474
+ },
1475
+ "128184": {
1476
+ "content": "<|reserved_special_token_179|>",
1477
+ "lstrip": false,
1478
+ "normalized": false,
1479
+ "rstrip": false,
1480
+ "single_word": false,
1481
+ "special": true
1482
+ },
1483
+ "128185": {
1484
+ "content": "<|reserved_special_token_180|>",
1485
+ "lstrip": false,
1486
+ "normalized": false,
1487
+ "rstrip": false,
1488
+ "single_word": false,
1489
+ "special": true
1490
+ },
1491
+ "128186": {
1492
+ "content": "<|reserved_special_token_181|>",
1493
+ "lstrip": false,
1494
+ "normalized": false,
1495
+ "rstrip": false,
1496
+ "single_word": false,
1497
+ "special": true
1498
+ },
1499
+ "128187": {
1500
+ "content": "<|reserved_special_token_182|>",
1501
+ "lstrip": false,
1502
+ "normalized": false,
1503
+ "rstrip": false,
1504
+ "single_word": false,
1505
+ "special": true
1506
+ },
1507
+ "128188": {
1508
+ "content": "<|reserved_special_token_183|>",
1509
+ "lstrip": false,
1510
+ "normalized": false,
1511
+ "rstrip": false,
1512
+ "single_word": false,
1513
+ "special": true
1514
+ },
1515
+ "128189": {
1516
+ "content": "<|reserved_special_token_184|>",
1517
+ "lstrip": false,
1518
+ "normalized": false,
1519
+ "rstrip": false,
1520
+ "single_word": false,
1521
+ "special": true
1522
+ },
1523
+ "128190": {
1524
+ "content": "<|reserved_special_token_185|>",
1525
+ "lstrip": false,
1526
+ "normalized": false,
1527
+ "rstrip": false,
1528
+ "single_word": false,
1529
+ "special": true
1530
+ },
1531
+ "128191": {
1532
+ "content": "<|reserved_special_token_186|>",
1533
+ "lstrip": false,
1534
+ "normalized": false,
1535
+ "rstrip": false,
1536
+ "single_word": false,
1537
+ "special": true
1538
+ },
1539
+ "128192": {
1540
+ "content": "<|reserved_special_token_187|>",
1541
+ "lstrip": false,
1542
+ "normalized": false,
1543
+ "rstrip": false,
1544
+ "single_word": false,
1545
+ "special": true
1546
+ },
1547
+ "128193": {
1548
+ "content": "<|reserved_special_token_188|>",
1549
+ "lstrip": false,
1550
+ "normalized": false,
1551
+ "rstrip": false,
1552
+ "single_word": false,
1553
+ "special": true
1554
+ },
1555
+ "128194": {
1556
+ "content": "<|reserved_special_token_189|>",
1557
+ "lstrip": false,
1558
+ "normalized": false,
1559
+ "rstrip": false,
1560
+ "single_word": false,
1561
+ "special": true
1562
+ },
1563
+ "128195": {
1564
+ "content": "<|reserved_special_token_190|>",
1565
+ "lstrip": false,
1566
+ "normalized": false,
1567
+ "rstrip": false,
1568
+ "single_word": false,
1569
+ "special": true
1570
+ },
1571
+ "128196": {
1572
+ "content": "<|reserved_special_token_191|>",
1573
+ "lstrip": false,
1574
+ "normalized": false,
1575
+ "rstrip": false,
1576
+ "single_word": false,
1577
+ "special": true
1578
+ },
1579
+ "128197": {
1580
+ "content": "<|reserved_special_token_192|>",
1581
+ "lstrip": false,
1582
+ "normalized": false,
1583
+ "rstrip": false,
1584
+ "single_word": false,
1585
+ "special": true
1586
+ },
1587
+ "128198": {
1588
+ "content": "<|reserved_special_token_193|>",
1589
+ "lstrip": false,
1590
+ "normalized": false,
1591
+ "rstrip": false,
1592
+ "single_word": false,
1593
+ "special": true
1594
+ },
1595
+ "128199": {
1596
+ "content": "<|reserved_special_token_194|>",
1597
+ "lstrip": false,
1598
+ "normalized": false,
1599
+ "rstrip": false,
1600
+ "single_word": false,
1601
+ "special": true
1602
+ },
1603
+ "128200": {
1604
+ "content": "<|reserved_special_token_195|>",
1605
+ "lstrip": false,
1606
+ "normalized": false,
1607
+ "rstrip": false,
1608
+ "single_word": false,
1609
+ "special": true
1610
+ },
1611
+ "128201": {
1612
+ "content": "<|reserved_special_token_196|>",
1613
+ "lstrip": false,
1614
+ "normalized": false,
1615
+ "rstrip": false,
1616
+ "single_word": false,
1617
+ "special": true
1618
+ },
1619
+ "128202": {
1620
+ "content": "<|reserved_special_token_197|>",
1621
+ "lstrip": false,
1622
+ "normalized": false,
1623
+ "rstrip": false,
1624
+ "single_word": false,
1625
+ "special": true
1626
+ },
1627
+ "128203": {
1628
+ "content": "<|reserved_special_token_198|>",
1629
+ "lstrip": false,
1630
+ "normalized": false,
1631
+ "rstrip": false,
1632
+ "single_word": false,
1633
+ "special": true
1634
+ },
1635
+ "128204": {
1636
+ "content": "<|reserved_special_token_199|>",
1637
+ "lstrip": false,
1638
+ "normalized": false,
1639
+ "rstrip": false,
1640
+ "single_word": false,
1641
+ "special": true
1642
+ },
1643
+ "128205": {
1644
+ "content": "<|reserved_special_token_200|>",
1645
+ "lstrip": false,
1646
+ "normalized": false,
1647
+ "rstrip": false,
1648
+ "single_word": false,
1649
+ "special": true
1650
+ },
1651
+ "128206": {
1652
+ "content": "<|reserved_special_token_201|>",
1653
+ "lstrip": false,
1654
+ "normalized": false,
1655
+ "rstrip": false,
1656
+ "single_word": false,
1657
+ "special": true
1658
+ },
1659
+ "128207": {
1660
+ "content": "<|reserved_special_token_202|>",
1661
+ "lstrip": false,
1662
+ "normalized": false,
1663
+ "rstrip": false,
1664
+ "single_word": false,
1665
+ "special": true
1666
+ },
1667
+ "128208": {
1668
+ "content": "<|reserved_special_token_203|>",
1669
+ "lstrip": false,
1670
+ "normalized": false,
1671
+ "rstrip": false,
1672
+ "single_word": false,
1673
+ "special": true
1674
+ },
1675
+ "128209": {
1676
+ "content": "<|reserved_special_token_204|>",
1677
+ "lstrip": false,
1678
+ "normalized": false,
1679
+ "rstrip": false,
1680
+ "single_word": false,
1681
+ "special": true
1682
+ },
1683
+ "128210": {
1684
+ "content": "<|reserved_special_token_205|>",
1685
+ "lstrip": false,
1686
+ "normalized": false,
1687
+ "rstrip": false,
1688
+ "single_word": false,
1689
+ "special": true
1690
+ },
1691
+ "128211": {
1692
+ "content": "<|reserved_special_token_206|>",
1693
+ "lstrip": false,
1694
+ "normalized": false,
1695
+ "rstrip": false,
1696
+ "single_word": false,
1697
+ "special": true
1698
+ },
1699
+ "128212": {
1700
+ "content": "<|reserved_special_token_207|>",
1701
+ "lstrip": false,
1702
+ "normalized": false,
1703
+ "rstrip": false,
1704
+ "single_word": false,
1705
+ "special": true
1706
+ },
1707
+ "128213": {
1708
+ "content": "<|reserved_special_token_208|>",
1709
+ "lstrip": false,
1710
+ "normalized": false,
1711
+ "rstrip": false,
1712
+ "single_word": false,
1713
+ "special": true
1714
+ },
1715
+ "128214": {
1716
+ "content": "<|reserved_special_token_209|>",
1717
+ "lstrip": false,
1718
+ "normalized": false,
1719
+ "rstrip": false,
1720
+ "single_word": false,
1721
+ "special": true
1722
+ },
1723
+ "128215": {
1724
+ "content": "<|reserved_special_token_210|>",
1725
+ "lstrip": false,
1726
+ "normalized": false,
1727
+ "rstrip": false,
1728
+ "single_word": false,
1729
+ "special": true
1730
+ },
1731
+ "128216": {
1732
+ "content": "<|reserved_special_token_211|>",
1733
+ "lstrip": false,
1734
+ "normalized": false,
1735
+ "rstrip": false,
1736
+ "single_word": false,
1737
+ "special": true
1738
+ },
1739
+ "128217": {
1740
+ "content": "<|reserved_special_token_212|>",
1741
+ "lstrip": false,
1742
+ "normalized": false,
1743
+ "rstrip": false,
1744
+ "single_word": false,
1745
+ "special": true
1746
+ },
1747
+ "128218": {
1748
+ "content": "<|reserved_special_token_213|>",
1749
+ "lstrip": false,
1750
+ "normalized": false,
1751
+ "rstrip": false,
1752
+ "single_word": false,
1753
+ "special": true
1754
+ },
1755
+ "128219": {
1756
+ "content": "<|reserved_special_token_214|>",
1757
+ "lstrip": false,
1758
+ "normalized": false,
1759
+ "rstrip": false,
1760
+ "single_word": false,
1761
+ "special": true
1762
+ },
1763
+ "128220": {
1764
+ "content": "<|reserved_special_token_215|>",
1765
+ "lstrip": false,
1766
+ "normalized": false,
1767
+ "rstrip": false,
1768
+ "single_word": false,
1769
+ "special": true
1770
+ },
1771
+ "128221": {
1772
+ "content": "<|reserved_special_token_216|>",
1773
+ "lstrip": false,
1774
+ "normalized": false,
1775
+ "rstrip": false,
1776
+ "single_word": false,
1777
+ "special": true
1778
+ },
1779
+ "128222": {
1780
+ "content": "<|reserved_special_token_217|>",
1781
+ "lstrip": false,
1782
+ "normalized": false,
1783
+ "rstrip": false,
1784
+ "single_word": false,
1785
+ "special": true
1786
+ },
1787
+ "128223": {
1788
+ "content": "<|reserved_special_token_218|>",
1789
+ "lstrip": false,
1790
+ "normalized": false,
1791
+ "rstrip": false,
1792
+ "single_word": false,
1793
+ "special": true
1794
+ },
1795
+ "128224": {
1796
+ "content": "<|reserved_special_token_219|>",
1797
+ "lstrip": false,
1798
+ "normalized": false,
1799
+ "rstrip": false,
1800
+ "single_word": false,
1801
+ "special": true
1802
+ },
1803
+ "128225": {
1804
+ "content": "<|reserved_special_token_220|>",
1805
+ "lstrip": false,
1806
+ "normalized": false,
1807
+ "rstrip": false,
1808
+ "single_word": false,
1809
+ "special": true
1810
+ },
1811
+ "128226": {
1812
+ "content": "<|reserved_special_token_221|>",
1813
+ "lstrip": false,
1814
+ "normalized": false,
1815
+ "rstrip": false,
1816
+ "single_word": false,
1817
+ "special": true
1818
+ },
1819
+ "128227": {
1820
+ "content": "<|reserved_special_token_222|>",
1821
+ "lstrip": false,
1822
+ "normalized": false,
1823
+ "rstrip": false,
1824
+ "single_word": false,
1825
+ "special": true
1826
+ },
1827
+ "128228": {
1828
+ "content": "<|reserved_special_token_223|>",
1829
+ "lstrip": false,
1830
+ "normalized": false,
1831
+ "rstrip": false,
1832
+ "single_word": false,
1833
+ "special": true
1834
+ },
1835
+ "128229": {
1836
+ "content": "<|reserved_special_token_224|>",
1837
+ "lstrip": false,
1838
+ "normalized": false,
1839
+ "rstrip": false,
1840
+ "single_word": false,
1841
+ "special": true
1842
+ },
1843
+ "128230": {
1844
+ "content": "<|reserved_special_token_225|>",
1845
+ "lstrip": false,
1846
+ "normalized": false,
1847
+ "rstrip": false,
1848
+ "single_word": false,
1849
+ "special": true
1850
+ },
1851
+ "128231": {
1852
+ "content": "<|reserved_special_token_226|>",
1853
+ "lstrip": false,
1854
+ "normalized": false,
1855
+ "rstrip": false,
1856
+ "single_word": false,
1857
+ "special": true
1858
+ },
1859
+ "128232": {
1860
+ "content": "<|reserved_special_token_227|>",
1861
+ "lstrip": false,
1862
+ "normalized": false,
1863
+ "rstrip": false,
1864
+ "single_word": false,
1865
+ "special": true
1866
+ },
1867
+ "128233": {
1868
+ "content": "<|reserved_special_token_228|>",
1869
+ "lstrip": false,
1870
+ "normalized": false,
1871
+ "rstrip": false,
1872
+ "single_word": false,
1873
+ "special": true
1874
+ },
1875
+ "128234": {
1876
+ "content": "<|reserved_special_token_229|>",
1877
+ "lstrip": false,
1878
+ "normalized": false,
1879
+ "rstrip": false,
1880
+ "single_word": false,
1881
+ "special": true
1882
+ },
1883
+ "128235": {
1884
+ "content": "<|reserved_special_token_230|>",
1885
+ "lstrip": false,
1886
+ "normalized": false,
1887
+ "rstrip": false,
1888
+ "single_word": false,
1889
+ "special": true
1890
+ },
1891
+ "128236": {
1892
+ "content": "<|reserved_special_token_231|>",
1893
+ "lstrip": false,
1894
+ "normalized": false,
1895
+ "rstrip": false,
1896
+ "single_word": false,
1897
+ "special": true
1898
+ },
1899
+ "128237": {
1900
+ "content": "<|reserved_special_token_232|>",
1901
+ "lstrip": false,
1902
+ "normalized": false,
1903
+ "rstrip": false,
1904
+ "single_word": false,
1905
+ "special": true
1906
+ },
1907
+ "128238": {
1908
+ "content": "<|reserved_special_token_233|>",
1909
+ "lstrip": false,
1910
+ "normalized": false,
1911
+ "rstrip": false,
1912
+ "single_word": false,
1913
+ "special": true
1914
+ },
1915
+ "128239": {
1916
+ "content": "<|reserved_special_token_234|>",
1917
+ "lstrip": false,
1918
+ "normalized": false,
1919
+ "rstrip": false,
1920
+ "single_word": false,
1921
+ "special": true
1922
+ },
1923
+ "128240": {
1924
+ "content": "<|reserved_special_token_235|>",
1925
+ "lstrip": false,
1926
+ "normalized": false,
1927
+ "rstrip": false,
1928
+ "single_word": false,
1929
+ "special": true
1930
+ },
1931
+ "128241": {
1932
+ "content": "<|reserved_special_token_236|>",
1933
+ "lstrip": false,
1934
+ "normalized": false,
1935
+ "rstrip": false,
1936
+ "single_word": false,
1937
+ "special": true
1938
+ },
1939
+ "128242": {
1940
+ "content": "<|reserved_special_token_237|>",
1941
+ "lstrip": false,
1942
+ "normalized": false,
1943
+ "rstrip": false,
1944
+ "single_word": false,
1945
+ "special": true
1946
+ },
1947
+ "128243": {
1948
+ "content": "<|reserved_special_token_238|>",
1949
+ "lstrip": false,
1950
+ "normalized": false,
1951
+ "rstrip": false,
1952
+ "single_word": false,
1953
+ "special": true
1954
+ },
1955
+ "128244": {
1956
+ "content": "<|reserved_special_token_239|>",
1957
+ "lstrip": false,
1958
+ "normalized": false,
1959
+ "rstrip": false,
1960
+ "single_word": false,
1961
+ "special": true
1962
+ },
1963
+ "128245": {
1964
+ "content": "<|reserved_special_token_240|>",
1965
+ "lstrip": false,
1966
+ "normalized": false,
1967
+ "rstrip": false,
1968
+ "single_word": false,
1969
+ "special": true
1970
+ },
1971
+ "128246": {
1972
+ "content": "<|reserved_special_token_241|>",
1973
+ "lstrip": false,
1974
+ "normalized": false,
1975
+ "rstrip": false,
1976
+ "single_word": false,
1977
+ "special": true
1978
+ },
1979
+ "128247": {
1980
+ "content": "<|reserved_special_token_242|>",
1981
+ "lstrip": false,
1982
+ "normalized": false,
1983
+ "rstrip": false,
1984
+ "single_word": false,
1985
+ "special": true
1986
+ },
1987
+ "128248": {
1988
+ "content": "<|reserved_special_token_243|>",
1989
+ "lstrip": false,
1990
+ "normalized": false,
1991
+ "rstrip": false,
1992
+ "single_word": false,
1993
+ "special": true
1994
+ },
1995
+ "128249": {
1996
+ "content": "<|reserved_special_token_244|>",
1997
+ "lstrip": false,
1998
+ "normalized": false,
1999
+ "rstrip": false,
2000
+ "single_word": false,
2001
+ "special": true
2002
+ },
2003
+ "128250": {
2004
+ "content": "<|reserved_special_token_245|>",
2005
+ "lstrip": false,
2006
+ "normalized": false,
2007
+ "rstrip": false,
2008
+ "single_word": false,
2009
+ "special": true
2010
+ },
2011
+ "128251": {
2012
+ "content": "<|reserved_special_token_246|>",
2013
+ "lstrip": false,
2014
+ "normalized": false,
2015
+ "rstrip": false,
2016
+ "single_word": false,
2017
+ "special": true
2018
+ },
2019
+ "128252": {
2020
+ "content": "<|reserved_special_token_247|>",
2021
+ "lstrip": false,
2022
+ "normalized": false,
2023
+ "rstrip": false,
2024
+ "single_word": false,
2025
+ "special": true
2026
+ },
2027
+ "128253": {
2028
+ "content": "<|reserved_special_token_248|>",
2029
+ "lstrip": false,
2030
+ "normalized": false,
2031
+ "rstrip": false,
2032
+ "single_word": false,
2033
+ "special": true
2034
+ },
2035
+ "128254": {
2036
+ "content": "<|reserved_special_token_249|>",
2037
+ "lstrip": false,
2038
+ "normalized": false,
2039
+ "rstrip": false,
2040
+ "single_word": false,
2041
+ "special": true
2042
+ },
2043
+ "128255": {
2044
+ "content": "<|reserved_special_token_250|>",
2045
+ "lstrip": false,
2046
+ "normalized": false,
2047
+ "rstrip": false,
2048
+ "single_word": false,
2049
+ "special": true
2050
+ },
2051
+ "128256": {
2052
+ "content": "<pad>",
2053
+ "lstrip": false,
2054
+ "normalized": false,
2055
+ "rstrip": false,
2056
+ "single_word": false,
2057
+ "special": true
2058
+ },
2059
+ "128257": {
2060
+ "content": "<image>",
2061
+ "lstrip": false,
2062
+ "normalized": false,
2063
+ "rstrip": false,
2064
+ "single_word": false,
2065
+ "special": true
2066
+ },
2067
+ "128258": {
2068
+ "content": "<action>",
2069
+ "lstrip": false,
2070
+ "normalized": false,
2071
+ "rstrip": false,
2072
+ "single_word": false,
2073
+ "special": true
2074
+ },
2075
+ "128259": {
2076
+ "content": "<image_start>",
2077
+ "lstrip": false,
2078
+ "normalized": false,
2079
+ "rstrip": false,
2080
+ "single_word": false,
2081
+ "special": true
2082
+ },
2083
+ "128260": {
2084
+ "content": "<image_end>",
2085
+ "lstrip": false,
2086
+ "normalized": false,
2087
+ "rstrip": false,
2088
+ "single_word": false,
2089
+ "special": true
2090
+ }
2091
+ },
2092
+ "bos_token": "<|begin_of_text|>",
2093
+ "chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",
2094
+ "clean_up_tokenization_spaces": true,
2095
+ "eos_token": "<|eot_id|>",
2096
+ "max_length": null,
2097
+ "model_input_names": [
2098
+ "input_ids",
2099
+ "attention_mask"
2100
+ ],
2101
+ "model_max_length": 3072,
2102
+ "pad_to_multiple_of": null,
2103
+ "pad_token": "<pad>",
2104
+ "pad_token_type_id": 0,
2105
+ "padding_side": "right",
2106
+ "processor_class": "MagmaProcessor",
2107
+ "tokenizer_class": "PreTrainedTokenizerFast"
2108
+ }