How wp.ScopedTimer Found My 12x Speedup

I was benchmarking a gridworld RL environment built on NVIDIA Warp. The native Warp version hit 8.4 million world-steps per second on small grids - impressive. But when I wrapped it with JAX for compatibility with standard RL training pipelines, performance dropped to 1.5M steps/sec. Fair enough, FFI overhead is real.

Then I tried larger grids. At 256x256, the JAX-wrapped version crawled at 113K steps/sec. That's a 13x slowdown from small grids. Something was wrong beyond just wrapper overhead.

Enter wp.ScopedTimer

Warp ships with built-in GPU profiling that I'd never used. Turns out, it's exactly what I needed.

import warp as wp

with wp.ScopedTimer("benchmark", cuda_filter=wp.TIMING_KERNEL, synchronize=True):
    for _ in range(n_steps):
        state, obs, reward, done = step(state, actions, config)

That's it. Three parameters:

cuda_filter=wp.TIMING_KERNEL - Profile kernel launches (you can also use wp.TIMING_ALL for memcpy, memset, and CUDA graphs)
synchronize=True - Force GPU sync for accurate timing
The context manager name shows up in the output

When the timer exits, it prints a breakdown of every kernel that ran inside the block, with time percentages. No Nsight Systems setup, no external tools, no friction.

The Smoking Gun

I wrapped my benchmark loop and ran it on a 64x64 grid:

Kernel	% Time
`obs_flat_system`	83.4%
`action_mask_system`	5.4%
`reset_terrain_2d`	3.5%
`transition_system`	2.8%
...	...

Wait, what? The observation system was eating 83% of my frame time? The actual MDP logic - transitions, rewards, resets - was barely 10% combined. I'd assumed the bottleneck would be in the step logic somewhere. I was wrong.

The Bug

I found the culprit in obs_flat_system. This kernel generates a one-hot position encoding for each agent:

@wp.kernel
def obs_flat_system(state: State, config: Config, obs_data: wp.array2d(dtype=wp.float32)):
    i = wp.tid()
    grid_size = config.grid_width * config.grid_height

    # Clear the array first
    for j in range(grid_size):
        obs_data[i, j] = 0.0

    # Set one-hot at current position
    x = state.pos.x[i, 0]
    y = state.pos.y[i, 0]
    idx = y * config.grid_width + x
    obs_data[i, idx] = 1.0

See the problem? That for j in range(grid_size) loop runs inside each GPU thread. For a 64x64 grid, that's 4,096 memory writes per world, per step. With 2,048 parallel worlds, that's 8.4 million memory writes just to zero an array before writing a single 1.0.

Warp will unroll small loops (size 32 or less), but for larger grids we're stuck with O(n) work per thread. You could parallelize this with a 2D kernel launch, but that's O(n²) threads total just to set a one-hot vector. Seems silly when you can do it in O(1).

The Fix

The insight: we don't need to zero the whole array. We just need to clear the previous position and set the new one:

@wp.kernel
def obs_flat_system(state: State, config: Config, obs_data: wp.array2d(dtype=wp.float32)):
    i = wp.tid()

    # Clear only the previous position - O(1)
    old_idx = state.prev_flat_obs_idx[i, 0]
    obs_data[i, old_idx] = 0.0

    # Set new position - O(1)
    x = state.pos.x[i, 0]
    y = state.pos.y[i, 0]
    idx = y * config.grid_width + x
    obs_data[i, idx] = 1.0

    # Track for next step
    state.prev_flat_obs_idx[i, 0] = idx

Two memory operations instead of 4,096. The kernel went from O(grid_size) to O(1).

The Results

After the fix, I ran profiling again:

Kernel	Before	After
`obs_flat_system`	83.4%	3.2%
`action_mask_system`	5.4%	27.7%
`reset_terrain_2d`	3.5%	19.5%

The observation kernel dropped from dominant bottleneck to negligible. The "real" work - action masking, terrain resets - now shows up properly in the profile.

And the benchmark numbers:

Grid Size	Before	After	Speedup
8x8	1.5M	1.4M	~same
64x64	1.6M	1.4M	~same
128x128	441K	1.4M	3x
256x256	113K	1.4M	12x

Large grids now run at the same speed as small grids. The O(n) bottleneck is gone.

The Warp Profiling Toolkit

Here's the full set of profiling tools Warp provides:

Context Manager (easiest)

with wp.ScopedTimer("my_operation", cuda_filter=wp.TIMING_KERNEL, synchronize=True):
    # Your code here
    pass

Filter options:

wp.TIMING_KERNEL - Kernel launches only
wp.TIMING_MEMCPY - Memory copies
wp.TIMING_MEMSET - Memory sets
wp.TIMING_GRAPH - CUDA graph operations
wp.TIMING_ALL - Everything

Manual API (more control)

wp.timing_begin(cuda_filter=wp.TIMING_ALL)
# ... your code ...
results = wp.timing_end()
wp.timing_print(results)

NVTX Integration (for Nsight Systems)

with wp.ScopedTimer("simulation", use_nvtx=True, color="yellow"):
    # Shows up as a named range in Nsight Systems timeline
    pass

Lessons Learned

Profile first, always. I would have spent hours optimizing the wrong kernels. The bottleneck was in the most boring-looking code.
Watch for sequential loops in parallel kernels. Any for loop inside a Warp kernel is suspect. Ask: can this be O(1)? Can it be parallelized with a 2D launch? Can it be replaced with wp.memset?
Warp's built-in profiling is good enough. I didn't need Nsight Systems or Nsight Compute to find this bug. wp.ScopedTimer gave me exactly what I needed in three lines of code.
Grid-scaling performance problems are a smell. When performance degrades with grid size, there's probably an O(n) loop hiding somewhere. Profile and look for kernels that grow with your problem dimensions.

The fix took 10 minutes. Finding it with wp.ScopedTimer took 5. The profiling tools are there - use them.

If you are interested in high-performance RL simulations, check out our Reality project - a simulation framework built on the Madrona engine.

How wp.ScopedTimer Found My 12x Speedup

Enter wp.ScopedTimer

The Smoking Gun

The Bug

The Fix

The Results

The Warp Profiling Toolkit

Context Manager (easiest)

Manual API (more control)

NVTX Integration (for Nsight Systems)

Lessons Learned

Comments

More from this blog

The Building Blocks of an Agent Memory System

Smaller is Better: Replacing GPT-4o-mini with a 7B Local Judge

How InfoNCE Creates Exploration: The Hidden Engine of Contrastive RL

Contrastive RL: A Step-by-Step Guide to Learning Reachability

Command Palette

Enter wp.ScopedTimer

The Smoking Gun

The Bug

The Fix

The Results

The Warp Profiling Toolkit

Context Manager (easiest)

Manual API (more control)

NVTX Integration (for Nsight Systems)

Lessons Learned

Comments

More from this blog