7 Comments
User's avatar
Nathan Lambert's avatar

Seems like this could be missing RLHF as an example of where we can use RL, which already points to how soft verifiers can be mixed in with verifiable. I think the appetite for RL is now higher on the capabilities side, so there's line for optimism.

Of course, I agree with all the limitations when it comes to applying robotics/rl classic like ideas.

Expand full comment
Chris Paxton's avatar

Yeah I briefly mentioned RLHF I think, but not in any detail. I'd like to write a bit more about how these things relate and what an RLHF for robotics would look like

Expand full comment
Godwyll Aikins's avatar

Amazing blog! I’d love to see a part two that explores the tools we can use to address these RL constraints. No RL algorithm can fully overcome them yet, but as you showed, we have ways to mitigate them-especially in robotics. 🦾

Of the three main constraints, input observation is the toughest-especially with RGB images and generalization. Right now, simulations have to be almost 1:1 with the real world. The main solution seems to be adding more modalities or learning better latent representations.

For validation, I’m really curious to see what people come up with using VLMs as a replacement for RLHF. There’s a lot of potential there.

Exploration is still tricky, but starting with a policy trained via supervised learning or offline RL can help bootstrap the process.

But the biggest challenge, IMO, is that the problem always has to be clearly bounded. RL just can’t handle long-horizon tasks without a sophisticated framework. We definitely need a breakthrough in hierarchical RL!

Expand full comment
Chris Paxton's avatar

Glad you liked the blog post! I'm an optimist about RL and do think we can address a lot of these limitations.

I think right now there are a number of ways people are solving exploration, but they amount to either:

1) reward engineering

2) providing a prior (human demonstrations), which amounts to a simplified version of (1) in some ways

3) brute force - what we saw in e.g. the Deepseek papers, and what we increasingly see in robotics

Expand full comment
Chris Paxton's avatar

VLMs for validation and verification are a really exciting direction. I think one great hope I have is that validation is almost always easier than generation; hopefully we can bootstrap performance this way, by training weaker models that are capable of verifying the trajectories produced by the learner.

Expand full comment
Benjamin Riley's avatar

This is terrific, thank you for this thorough exploration of reinforcement learning and "reasoning" models.

Expand full comment
Chris Paxton's avatar

Thanks! Glad you liked it.

Expand full comment