Why Reward Shaping Sucks and What You Can Do About It

Agent navigating obstacles

In my previous article on reward shaping, I walked through four hard-learned lessons about balancing collision penalties in a navigation task. I eventually found the "Goldilocks solution" - a -0.1 penalty that let my agent learn to navigate obstacles. Problem solved, right?

Well, not quite. I kept staring at this chart showing my agent was still colliding with objects way more than I'd like:

Collision termination chart showing 90% collision rate

The bar shows that agents were only successfully navigating the course about 10% of the time, with collision being the primary cause of episode termination. My first instinct? Time for another reward shaping parameter sweep! Let's optimize that collision penalty some more...

But then I had a better idea. What if I stopped fighting against reinforcement learning and started leveraging its actual superpower?

The Reward Shaping Trap

Here's the thing about reward shaping: every time you add a shaped reward, you're making an assumption about what good behavior looks like. You're essentially saying "I know better than the Bellman equation." And maybe you do! But you're also:

Adding parameters to tune - More knobs means more ways for things to go wrong
Creating unpredictable side effects - That collision penalty might stop wall-running, but it might also stop the agent from ever getting close to walls, even when that's optimal
Fighting against RL's core strength - The ability to learn implicit objectives through value function approximation

Rich Sutton put it perfectly in "The Bitter Lesson":

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin... We should stop trying to find ways to exploit our limited domain knowledge, and instead focus on discovering the general principles that will enable our methods to make use of the massive amounts of computation that will be available."

Every shaped reward is us trying to inject our "limited domain knowledge" instead of letting the algorithm figure it out through computation.

The "Optimize the Thing" Approach

So here's a radical idea: what if we just... optimized the thing we actually want to optimize? ᾒF

Let's think about it. Collision avoidance isn't really our goal - it's an implicit constraint. Our actual goal is "navigate through the obstacle course successfully." So let's give a reward of +1 when the agent reaches the goal, and 0 otherwise.

"But wait," you might say, "won't the agent just crash into everything?"

Here's where the Bellman equation becomes magical.

The Bellman Equation's Implicit Learning Magic

The Bellman equation is:

V(s) = max_a [R(s,a) + γ * V(s')]

This simple equation has a superpower: it propagates value backwards through time. When an agent reaches the goal and gets that +1 reward, the value function learns that all the states leading up to that success were valuable too.

But here's the key insight: episodes where the agent crashes get no reward. So through the magic of temporal difference learning, the value function automatically learns that:

States just before crashes have low value
Actions that lead to crashes have low value
Collision avoidance emerges as an implicit behavior

The agent doesn't need us to explicitly tell it "don't hit walls." It figures out that wall-hitting episodes don't lead to rewards, so it learns to avoid them naturally.

This is reinforcement learning at its purest - the algorithm discovering optimal behavior through environmental feedback rather than human-designed reward signals.

The Exploration Challenge (And Its Solution)

"Okay sure," you might think, "but how will the agent ever explore enough to find the goal and get that +1 signal?"

This is the classic sparse reward problem. With a fixed spawn point, the agent would need to randomly stumble to the goal before learning anything useful. That could take... a while.

The solution is beautifully simple: random initialization of agent position.

Let's look at exploration patterns around a fixed spawn point:

Fixed spawn exploration

The agent can only explore a tiny bubble around its starting position before crashing. Most episodes end with no reward signal.

Now let's see what happens with random spawn positions:

Random spawn exploration

By randomizing the start position across our 4096 parallel simulations (thanks, Madrona engine!), we ensure some agents spawn close to the goal. These lucky agents provide immediate reward signal right from the start of training.

Some agents spawn near the goal and reach the edge of the map → immediate reward signal → learning begins → value function starts propagating back → collision avoidance emerges naturally.

The Results: Proof in the Pudding

After 1000 episodes of training with this approach:

Trained agent successfully navigating

And here's the real proof - the performance metrics we actually care about:

Direct optimization performance chart

Look at that beautiful, steady increase in success rate! No parameter tuning, no reward engineering, no unpredictable side effects. Just clean, direct optimization of the thing we wanted to optimize all along.

When to "Optimize the Thing" Directly

This approach works best when:

You can clearly define success - "Reach the goal" is unambiguous
You have sufficient simulation capacity - Random initialization needs lots of parallel episodes
The implicit constraints are learnable - Collision avoidance can be learned from episode outcomes
You want robust, generalizable behavior - No hand-crafted biases to break in new environments

When reward shaping might still be necessary:

Safety-critical applications where you can't afford exploration failures
Environments where the implicit constraints are too complex to learn from sparse signals
When you have strong domain knowledge about efficient learning paths

The Bitter Lesson Applied

My navigation problem perfectly illustrates Sutton's bitter lesson. I spent time crafting collision penalties, tuning magnitudes, trying curriculum learning - all attempts to inject my domain knowledge about "good navigation behavior."

The direct optimization approach said: "Forget your assumptions. Let computation and the Bellman equation figure it out."

And you know what? The algorithm found a better solution than my hand-crafted rewards ever did.

Sometimes to Optimize the Thing...

The best approach really is to optimize it.

Next time you catch yourself adding another shaped reward, ask: "Am I helping the algorithm or fighting against it?" Sometimes the most powerful thing you can do is get out of reinforcement learning's way and let it do what it does best - learn optimal behavior from environmental feedback.

Your agent might just surprise you with how good it gets when you stop micromanaging its learning process.

Interested in high-performance RL simulations like the one used in this article? Check out our Reality project - a simulation framework built on the Madrona engine that can run 4096+ parallel environments on a single GPU.

Why Reward Shaping Sucks and What You Can Do About It

The Reward Shaping Trap

The "Optimize the Thing" Approach

The Bellman Equation's Implicit Learning Magic

The Exploration Challenge (And Its Solution)

The Results: Proof in the Pudding

When to "Optimize the Thing" Directly

The Bitter Lesson Applied

Sometimes to Optimize the Thing...

Comments (1)

More from this blog

The Building Blocks of an Agent Memory System

Smaller is Better: Replacing GPT-4o-mini with a 7B Local Judge

How InfoNCE Creates Exploration: The Hidden Engine of Contrastive RL

Contrastive RL: A Step-by-Step Guide to Learning Reachability

How wp.ScopedTimer Found My 12x Speedup

Command Palette

The Reward Shaping Trap

The "Optimize the Thing" Approach

The Bellman Equation's Implicit Learning Magic

The Exploration Challenge (And Its Solution)

The Results: Proof in the Pudding

When to "Optimize the Thing" Directly

The Bitter Lesson Applied

Sometimes to Optimize the Thing...

Comments (1)

More from this blog