Thank you for the great breakdown on world models! I just wrote about AI agents navigating the world of Minecraft, explaining their training methods, and how these actually mimic the behavior of children. I'd love to have your feedback if you are interested in reading about this: https://substack.com/home/post/p-191864960
Great summary! In terms of the categorization is it unfair to put the DualWorld thing roughly in the video model + inverse dynamics bucket? It seems to me to be similar but that the V-JEPA in that DualWorld is just more complex (working over a longer sequence and at a faster rate, potentially). Either way, decoupling the robot form and action head from the big video model is a helpful feature for generalizability for both of those categories, IMO.
Yeah I think that is reasonable. I think the framing of it as hierarchical planning just with video is interesting, although I cant find an actual paper so who knows.
You are saying that Dreamer etc don't have good long-horizon capabilities, but then Dreamer4 actually solves the diamond challenge. Do you have any other evidence?
How are world models used in practice with robot hardware? Are they coupled with hardware or can models be delivered separately from hardware and interface with the robot controller?
Decoupling suggests a huge market for building models, with benefits to data hoarders and so on.
It would be quite ironic if "AI hardware* turned out to be humanoid robots rather than lifestyle pins!
Thank you for the clear summary of these models! It's such a useful article.
Thank you for the great breakdown on world models! I just wrote about AI agents navigating the world of Minecraft, explaining their training methods, and how these actually mimic the behavior of children. I'd love to have your feedback if you are interested in reading about this: https://substack.com/home/post/p-191864960
Great summary! In terms of the categorization is it unfair to put the DualWorld thing roughly in the video model + inverse dynamics bucket? It seems to me to be similar but that the V-JEPA in that DualWorld is just more complex (working over a longer sequence and at a faster rate, potentially). Either way, decoupling the robot form and action head from the big video model is a helpful feature for generalizability for both of those categories, IMO.
Yeah I think that is reasonable. I think the framing of it as hierarchical planning just with video is interesting, although I cant find an actual paper so who knows.
You are saying that Dreamer etc don't have good long-horizon capabilities, but then Dreamer4 actually solves the diamond challenge. Do you have any other evidence?
When they learn how air traffic controllers think.
How are world models used in practice with robot hardware? Are they coupled with hardware or can models be delivered separately from hardware and interface with the robot controller?
Decoupling suggests a huge market for building models, with benefits to data hoarders and so on.
It would be quite ironic if "AI hardware* turned out to be humanoid robots rather than lifestyle pins!