|
## Preparing Data for YOLO-World |
|
|
|
### Overview |
|
|
|
For pre-training YOLO-World, we adopt several datasets as listed in the below table: |
|
|
|
| Data | Samples | Type | Boxes | |
|
| :-- | :-----: | :---:| :---: | |
|
| Objects365v1 | 609k | detection | 9,621k | |
|
| GQA | 621k | grounding | 3,681k | |
|
| Flickr | 149k | grounding | 641k | |
|
| CC3M-Lite | 245k | image-text | 821k | |
|
|
|
### Dataset Directory |
|
|
|
We put all data into the `data` directory, such as: |
|
|
|
```bash |
|
βββ coco |
|
β βββ annotations |
|
β βββ lvis |
|
β βββ train2017 |
|
β βββ val2017 |
|
βββ flickr |
|
β βββ annotations |
|
β βββ images |
|
βββ mixed_grounding |
|
β βββ annotations |
|
β βββ images |
|
βββ mixed_grounding |
|
β βββ annotations |
|
β βββ images |
|
βββ objects365v1 |
|
β βββ annotations |
|
β βββ train |
|
β βββ val |
|
``` |
|
**NOTE**: We strongly suggest that you check the directories or paths in the dataset part of the config file, especially for the values `ann_file`, `data_root`, and `data_prefix`. |
|
|
|
We provide the annotations of the pre-training data in the below table: |
|
|
|
| Data | images | Annotation File | |
|
| :--- | :------| :-------------- | |
|
| Objects365v1 | [`Objects365 train`](https://opendatalab.com/OpenDataLab/Objects365_v1) | [`objects365_train.json`](https://opendatalab.com/OpenDataLab/Objects365_v1) | |
|
| MixedGrounding | [`GQA`](https://nlp.stanford.edu/data/gqa/images.zip) | [`final_mixed_train_no_coco.json`](https://huggingface.co/GLIPModel/GLIP/tree/main/mdetr_annotations/final_mixed_train_no_coco.json) | |
|
| Flickr30k | [`Flickr30k`](https://shannon.cs.illinois.edu/DenotationGraph/) |[`final_flickr_separateGT_train.json`](https://huggingface.co/GLIPModel/GLIP/tree/main/mdetr_annotations/final_flickr_separateGT_train.json) | |
|
| LVIS-minival | [`COCO val2017`](https://cocodataset.org/) | [`lvis_v1_minival_inserted_image_name.json`](https://huggingface.co/GLIPModel/GLIP/blob/main/lvis_v1_minival_inserted_image_name.json) | |
|
|
|
**Acknowledgement:** We sincerely thank [GLIP](https://github.com/microsoft/GLIP) and [mdetr](https://github.com/ashkamath/mdetr) for providing the annotation files for pre-training. |
|
|
|
|
|
### Dataset Class |
|
|
|
> For fine-tuning YOLO-World on Close-set Object Detection, using `MultiModalDataset` is recommended. |
|
|
|
#### Setting CLASSES/Categories |
|
|
|
If you use `COCO-format` custom datasets, you "DO NOT" need to define a dataset class for custom vocabularies/categories. |
|
Explicitly setting the CLASSES in the config file through `metainfo=dict(classes=your_classes),` is simple: |
|
|
|
```python |
|
|
|
coco_train_dataset = dict( |
|
_delete_=True, |
|
type='MultiModalDataset', |
|
dataset=dict( |
|
type='YOLOv5CocoDataset', |
|
metainfo=dict(classes=your_classes), |
|
data_root='data/your_data', |
|
ann_file='annotations/your_annotation.json', |
|
data_prefix=dict(img='images/'), |
|
filter_cfg=dict(filter_empty_gt=False, min_size=32)), |
|
class_text_path='data/texts/your_class_texts.json', |
|
pipeline=train_pipeline) |
|
``` |
|
|
|
|
|
For training YOLO-World, we mainly adopt two kinds of dataset classs: |
|
|
|
#### 1. `MultiModalDataset` |
|
|
|
`MultiModalDataset` is a simple wrapper for pre-defined Dataset Class, such as `Objects365` or `COCO`, which add the texts (category texts) into the dataset instance for formatting input texts. |
|
|
|
**Text JSON** |
|
|
|
The json file is formatted as follows: |
|
|
|
```json |
|
[ |
|
['A_1','A_2'], |
|
['B'], |
|
['C_1', 'C_2', 'C_3'], |
|
... |
|
] |
|
``` |
|
|
|
We have provided the text json for [`LVIS`](./../data/texts/lvis_v1_class_texts.json), [`COCO`](../data/texts/coco_class_texts.json), and [`Objects365`](../data/texts/obj365v1_class_texts.json) |
|
|
|
#### 2. `YOLOv5MixedGroundingDataset` |
|
|
|
The `YOLOv5MixedGroundingDataset` extends the `COCO` dataset by supporting loading texts/captions from the json file. It's desgined for `MixedGrounding` or `Flickr30K` with text tokens for each object. |
|
|
|
|
|
|
|
### π₯ Custom Datasets |
|
|
|
For custom dataset, we suggest the users convert the annotation files according to the usage. Note that, converting the annotations to the **standard COCO format** is basically required. |
|
|
|
1. **Large vocabulary, grounding, referring:** you can follow the annotation format as the `MixedGrounding` dataset, which adds `caption` and `tokens_positive` for assigning the text for each object. The texts can be a category or a noun phrases. |
|
|
|
2. **Custom vocabulary (fixed):** you can adopt the `MultiModalDataset` wrapper as the `Objects365` and create a **text json** for your custom categories. |
|
|
|
|
|
### CC3M Pseudo Annotations |
|
|
|
The following annotations are generated according to the automatic labeling process in our paper. Adn we report the results based on these annotations. |
|
|
|
To use CC3M annotations, you need to prepare the `CC3M` images first. |
|
|
|
| Data | Images | Boxes | File | |
|
| :--: | :----: | :---: | :---: | |
|
| CC3M-246K | 246,363 | 820,629 | [Download π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/cc3m_pseudo_annotations.json) | |
|
| CC3M-500K | 536,405 | 1,784,405| [Download π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/cc3m_pseudo_500k_annotations.json) | |
|
| CC3M-750K | 750,000 | 4,504,805 | [Download π€](https://huggingface.co/wondervictor/YOLO-World/blob/main/cc3m_pseudo_750k_annotations.json) | |