An Alternative to LLM-based Reward Models

This was intially a tweet thread. I thought it was insightful and long enough that I should (re)polish it into a post for posterity. If you’re coming from Twitter, the main addition is the two paragraphs starting with “Your critic should…”.

I don’t think LLM based reward models will ever be able to catch (non-obvious) bugs/bad code and/or bad lines of reasoning. And it has nothing to do with training data: it’s cause humans can’t do it either — in the same setting. Here’s what I mean: I was thinking about “what makes bad code bad”, and my first thought was “oh you know it when you see it”. For obvious cases (e.g. a complicated nest of if/elses), I think that is true.

But I think it’s not at all true for much of bad code. People (and models) are very good at theorizing — seeing structure and purpose where neither exists. Bad code will look more or less right and make it past a semi-rushed PR review — or even a proper PR review. You only realize it’s bad when it breaks, and you have trouble understanding why or how to fix it. That’s because it’s easy to project structure/purpose when you’re brushing over code or thinking a bit abstractly. When it breaks, you have to look through the code step by step with a concrete input and an expected output. This in turn makes every step concrete — there’s no more this vaguely looks right: it either produces the expected intermediate/final output, or it doesn’t. This is why writing good unit tests improves code quality in the present. Crystallizing inputs and outputs and trying to ensure good coverage forces you to think through every step at a practical level. In order for LLMs to act as good guides for reasoning training, they must themselves be agentic/autonomous AND function more like a discriminator in a GAN: “Alright so R1 gives me this code or proof, does it actually work out when I apply it to this input?” This is similar to when you’re in office hours, and the TA leads you through an example problem before testing you with a novel input. It’s very easy to convince yourself that you do actually get it from the initial explanation — you have no way of really knowing until you actually try doing it yourself.

Your critic should help you catch two big failure cases: a. you have an overfit reasoning chain and b. the reasoning chain just fudges things. In the former case, your reasoning chain gets to the right answer, but the solution found is actually wrong — it just happens to work in this case, and it looks right. An example of an overfit reasoning chain is: “the area of a 5 by 5 rectangle is 25 which is equal to 5 squared, so the area of any rectangle is equal to the height squared”. The model has found a reasoning chain that doesn’t generalize, and it would be caught if it simply tried applying the same chain to another input like a 4 by 10 rectange. Now you might be thinking: how the fuck would the model come up with another input to use for testing if it can’t solve the problem to begin with. That’s a valid concern, and there’s two responses: a. even stepping through the original input would be quite useful, and b. if we’re in the verifiable domain where problems are curated by humans with clear cut answers, then the problem set can just come with an extra input for the critic to work through.

In the second case (fudging), the LLM just cheats/lies to get to the right answer. An example is: “if \(4\) @ \(4 = 16\), then \(a\) @ \(b\) must equal \((a + 1) + (b + 2)\) because \((4 + 1) + (4 + 2) = 16\)”. If you go through and critique this chain of reasoning step by step, you’ll at least have a decent chance of catching this mistake — just purely based on the fact that you haven’t already committed to saying that this string of text is coherent — unlike the generative model that produced the reasoning chain.

There’s a natural way to train models to do this: use the reasoning model as its own critic. Have the model produce a reasoning chain and pass the chain to a critic (which is the same model just in a fresh new context window). Ask the critic to validate the reasoning on a variation of the problem i.e. a new input. The critic then has its own reasoning chain as it walks through the problem before finally saying “Pass” or “Fail”. Finally, use your automated verifier to check if the initial reasoning chain’s final answer was correct AND if the critic’s verdict was correct. Train on both objectives, but your critic is far more important because it is a self-verifier. The reason is that if your initial reasoning is wrong, then you can backtrack and try again, but if your critic is wrong, then you can only hope and pray.

This seems far more like how the critic functions in other RL problems where the PPO algorithm is used. It’s dynamic — adjusting as the actor (i.e. the model) learns — and it’s grounded in direct, relatively simple, clear cut rewards. In this case, the direct reward signal comes from the automated verifier. Contrast this with a static reward model trained beforehand on “human preference”.