Discussion about this post

User's avatar
Tim Dingman's avatar

I buy that video gen models should be better than VLAs. Having the core of the model be image/video makes more sense than having the core of the model be text.

My question is, what data will improve video generation accuracy? The authors cite that as the primary weakness, not the IDM.

I guess more video in pretraining, but surely there is a more effective post-training recipe/set of ingredients.

Hugo's avatar

I think the "auxiliary loss" framing deserves to be pushed further.

Predicting future video isn't just regularization. It forces the model to build something closer to a causal model of physics. That's a far richer signal than action labels alone, which are essentially waypoints with no information about why those waypoints are correct. In a low-data regime like robotics, that difference compounds fast.

One thing I've been thinking about: this reframes your closing question. To me, the question might not be "will video generation be distilled away" but "what form of world representation survives the distillation?" You'd distill away pixel-level generation, but the causal reasoning it produced would need to be retained somehow. Whether that's achievable without the video objective as a training scaffold feels like the deeper open question to me.

Looking forward to the RoboPapers podcast on DreamZero and your deeper dive into the world models landscape next week.

7 more comments...

No posts

Ready for more?