Daily Robotics June #2 - Getting My Hands Dirty with SmolVLA

Community Article Published June 4, 2025

Tonight was all about diving deeper into SmolVLA, this time with a hands-on approach. After yesterday’s architectural overview and some high-level theorizing, I was eager to actually run the model and test its capabilities—especially its zero-shot performance. And, as expected, things got interesting pretty quickly.

Day 2 – June 4

a) Zero-Shot Tests: Early Friction

My main goal tonight was to evaluate how well SmolVLA performs in a zero-shot setting: no fine-tuning, no task-specific supervision. I set up a simple robotic task (“stack colored cubes”) and prompted the model with a short sequence to generate plausible actions.

Unfortunately, it didn’t go quite as smoothly as I’d hoped.

Currently, the LeRobot library doesn’t natively support zero-shot inference with the pre-trained SmolVLA model. It requires access to the action and feature normalization stats used during training, which aren’t yet exposed in the config. There’s some discussion around enabling this functionality in the future (which I’m very excited about and will definitely keep an eye on!).

Another angle I wanted to explore was data efficiency. LeRobot’s existing policies typically need several dozen trajectories to perform well. But since SmolVLA is pre-trained on a diverse set of tasks, I was curious whether it could generalize better than models trained from scratch—even with minimal data.

To test that, I decided to build a small custom dataset around my target task.

b) From Friction to Fine-Tuning: Building a Tiny Trajectory Dataset

Since zero-shot didn’t pan out, I shifted gears to something equally exciting: fine-tuning with minimal data. How little data is enough for SmolVLA to perform reliably?

To investigate, I created a small dataset with just 10 trajectories of the arm stacking a green cube on top of blue one: image/png

(And while you're there, you’ll find dozens of datasets I’ve shared as part of my ongoing experiments on my HF profile, they're not perfect but still can help to init a model)

With the data ready, I’ll be running a few fine-tuning passes over the next couple of days to see if the model can learn consistent behavior from such a small sample.

Wrapping Up

Today wasn’t without its bumps, but that’s part of the fun. I’ll probably be a bit busier in the coming days, so instead of daily posts, I’ll aim to publish updates every 3–4 days—more time for substance, less noise.

Next up: I’ll share results from this fine-tuning run, include some metrics, and highlight interesting failure cases worth unpacking. I’m also planning a deep dive into the model’s attention maps to better understand what it’s “seeing.”

As always, if you’re working on similar projects, feel free to reach out—or fork the dataset and experiment with it yourself. Let’s keep pushing the boundaries of small-scale robotics models.

Until next time 👋 — @Beeg_brain | huggingface.co/Beegbrain

Community

Sign up or log in to comment