I buy that video gen models should be better than VLAs. Having the core of the model be image/video makes more sense than having the core of the model be text.
My question is, what data will improve video generation accuracy? The authors cite that as the primary weakness, not the IDM.
I guess more video in pretraining, but surely there is a more effective post-training recipe/set of ingredients.
I assume more in-domain data earlier will help; video models have gotten very good but as with everything related to robotics, I think we will not see the best results until we're actually training with the right data
I think the "auxiliary loss" framing deserves to be pushed further.
Predicting future video isn't just regularization. It forces the model to build something closer to a causal model of physics. That's a far richer signal than action labels alone, which are essentially waypoints with no information about why those waypoints are correct. In a low-data regime like robotics, that difference compounds fast.
One thing I've been thinking about: this reframes your closing question. To me, the question might not be "will video generation be distilled away" but "what form of world representation survives the distillation?" You'd distill away pixel-level generation, but the causal reasoning it produced would need to be retained somehow. Whether that's achievable without the video objective as a training scaffold feels like the deeper open question to me.
Looking forward to the RoboPapers podcast on DreamZero and your deeper dive into the world models landscape next week.
Kind of wonder if it would perform better if in addition to predicting video and robot actions also had a scratchpad or register that it predicted. And the following prediction can condition on the scratchpad as well. Allows it to store and use internal state independent of the information in visuals or arm positions.
EDIT: see for example the paper "Vision Transformers Need Registers":
I think it can predict multiple frames and take ~8 frames history right now, so there will be implicit memory. But yeah i think if you want to do longer context more tricks might be necessary…
I believe this article demonstrates some emerging trends. The powerful generalization ability stems from the pretrained video diffusion model. When presented with an unseen task, if the video diffusion model can envision the correct "image trajectory" to achieve the final goal, then the action decoder can simply perform inverse dynamics to realize it.
Why do you think it has not yet achieved this? I would love to hear your thoughts.
Thanks for another insightful post. As the community paid more attention to the new benchmark scores, it might just mean robotic evaluation becomes more standadized just like vision before. However, from a fair evaluation standpoint, it is kind of hard to claim this method more effective compared to pure VLA, because the smaller model size just degrades the performance so significantly. If a pure VLA just level up the model backbone to the same size and arch, can it just performe as good? What's your guess?
I buy that video gen models should be better than VLAs. Having the core of the model be image/video makes more sense than having the core of the model be text.
My question is, what data will improve video generation accuracy? The authors cite that as the primary weakness, not the IDM.
I guess more video in pretraining, but surely there is a more effective post-training recipe/set of ingredients.
I assume more in-domain data earlier will help; video models have gotten very good but as with everything related to robotics, I think we will not see the best results until we're actually training with the right data
I think the "auxiliary loss" framing deserves to be pushed further.
Predicting future video isn't just regularization. It forces the model to build something closer to a causal model of physics. That's a far richer signal than action labels alone, which are essentially waypoints with no information about why those waypoints are correct. In a low-data regime like robotics, that difference compounds fast.
One thing I've been thinking about: this reframes your closing question. To me, the question might not be "will video generation be distilled away" but "what form of world representation survives the distillation?" You'd distill away pixel-level generation, but the causal reasoning it produced would need to be retained somehow. Whether that's achievable without the video objective as a training scaffold feels like the deeper open question to me.
Looking forward to the RoboPapers podcast on DreamZero and your deeper dive into the world models landscape next week.
Reminds me a lot of predictive processing. Neat.
Kind of wonder if it would perform better if in addition to predicting video and robot actions also had a scratchpad or register that it predicted. And the following prediction can condition on the scratchpad as well. Allows it to store and use internal state independent of the information in visuals or arm positions.
EDIT: see for example the paper "Vision Transformers Need Registers":
https://arxiv.org/pdf/2309.16588
I think it can predict multiple frames and take ~8 frames history right now, so there will be implicit memory. But yeah i think if you want to do longer context more tricks might be necessary…
The predicted frames act like sub-goals for the action generator. I would like to refer to it as the 'thinking token' of the world action model.
Ideally this would be the case but right now I don't think its doing that.
I believe this article demonstrates some emerging trends. The powerful generalization ability stems from the pretrained video diffusion model. When presented with an unseen task, if the video diffusion model can envision the correct "image trajectory" to achieve the final goal, then the action decoder can simply perform inverse dynamics to realize it.
Why do you think it has not yet achieved this? I would love to hear your thoughts.
Thanks for another insightful post. As the community paid more attention to the new benchmark scores, it might just mean robotic evaluation becomes more standadized just like vision before. However, from a fair evaluation standpoint, it is kind of hard to claim this method more effective compared to pure VLA, because the smaller model size just degrades the performance so significantly. If a pure VLA just level up the model backbone to the same size and arch, can it just performe as good? What's your guess?