At what level of granularity is data, say for humanoids, useful?
I'm trying to think if a dumb way of scaling this is to just open a large warehouse in a country where labor is cheap, divide it into 100s of rooms with green screens, have people do different tasks, etc.
If we can infer the required granularity with just cameras then it wouldn't be that expensive either
I think *human* video data is inherently limited for other reasons - there's always an environment mismatch, and i think a lot of "learn from human video data" research papers seem to cap out at 80-90% success rates and are unable to drive past that without lots of robot data as well. and that's on easy tasks.
So imagine you instead have this huge warehouse filled with humanoids; like Tesla Optimus could definitely do this (and may in fact be doing it). As you say, greenscreens can give lots of visual diversity for "free."
but you still have the problems of iteration. It takes a lot of time and money to set up that warehouse full of humanoids. simulation lets you scale this in the same way you might scale *software* - not quite as fast, because you still have to close the loop on hardware of course - but it takes a lot of the real-world iteration out, and thats incredibly valuable.
in addition, you'd be surprised how often you end up with subtly correlated features when collecting data. Big learning from demonstration efforts often throw out a TON of data. this is a huge issue if you're collecting in the real world.
Are world models the 3rd way of generating data then?
Lots of companies are doing amazing work with video gen models, I wonder if you can fine tune them on videos of your form factor? (unsure how to incorporate tactile information)
I know 1X and comma are working in this direction, but I'm curious what your mental model of this topic is?
At what level of granularity is data, say for humanoids, useful?
I'm trying to think if a dumb way of scaling this is to just open a large warehouse in a country where labor is cheap, divide it into 100s of rooms with green screens, have people do different tasks, etc.
If we can infer the required granularity with just cameras then it wouldn't be that expensive either
Really good question.
I think *human* video data is inherently limited for other reasons - there's always an environment mismatch, and i think a lot of "learn from human video data" research papers seem to cap out at 80-90% success rates and are unable to drive past that without lots of robot data as well. and that's on easy tasks.
So imagine you instead have this huge warehouse filled with humanoids; like Tesla Optimus could definitely do this (and may in fact be doing it). As you say, greenscreens can give lots of visual diversity for "free."
but you still have the problems of iteration. It takes a lot of time and money to set up that warehouse full of humanoids. simulation lets you scale this in the same way you might scale *software* - not quite as fast, because you still have to close the loop on hardware of course - but it takes a lot of the real-world iteration out, and thats incredibly valuable.
in addition, you'd be surprised how often you end up with subtly correlated features when collecting data. Big learning from demonstration efforts often throw out a TON of data. this is a huge issue if you're collecting in the real world.
Interestingg, thanks for the reply!!
Are world models the 3rd way of generating data then?
Lots of companies are doing amazing work with video gen models, I wonder if you can fine tune them on videos of your form factor? (unsure how to incorporate tactile information)
I know 1X and comma are working in this direction, but I'm curious what your mental model of this topic is?