CLIP-RT Pretrained on OXE Data

This is the CLIP-RT model pretrained on Open X-Embodiment (OXE) data. We finetuned this model on downstream data, such as robot data in real-world or simulated environments. Please refer to the clip-rt github repository to see how to finetune this model.

Hyperparemeters

Category	Details
Train	8 × H100 GPUs, each with 80GB VRAM
Model size	1B
Loss	Binary Cross-Entropy
Epochs	20

Citation

@article{kang2024cliprt,
  title={CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision},
  author={Kang, Gi-Cheon and Kim, Junghyun and Shim, Kyuhwan and Lee, Jun Ki and Zhang, Byoung-Tak},
  journal={arXiv preprint arXiv:2411.00508},
  year = {2024}
}