Skip to main content

Command Palette

Search for a command to run...

How wp.ScopedTimer Found My 12x Speedup

Published
5 min read

I was benchmarking a gridworld RL environment built on NVIDIA Warp. The native Warp version hit 8.4 million world-steps per second on small grids - impressive. But when I wrapped it with JAX for compatibility with standard RL training pipelines, performance dropped to 1.5M steps/sec. Fair enough, FFI overhead is real.

Then I tried larger grids. At 256x256, the JAX-wrapped version crawled at 113K steps/sec. That's a 13x slowdown from small grids. Something was wrong beyond just wrapper overhead.

Enter wp.ScopedTimer

Warp ships with built-in GPU profiling that I'd never used. Turns out, it's exactly what I needed.

import warp as wp

with wp.ScopedTimer("benchmark", cuda_filter=wp.TIMING_KERNEL, synchronize=True):
    for _ in range(n_steps):
        state, obs, reward, done = step(state, actions, config)

That's it. Three parameters:

  • cuda_filter=wp.TIMING_KERNEL - Profile kernel launches (you can also use wp.TIMING_ALL for memcpy, memset, and CUDA graphs)
  • synchronize=True - Force GPU sync for accurate timing
  • The context manager name shows up in the output

When the timer exits, it prints a breakdown of every kernel that ran inside the block, with time percentages. No Nsight Systems setup, no external tools, no friction.

The Smoking Gun

I wrapped my benchmark loop and ran it on a 64x64 grid:

Kernel% Time
obs_flat_system83.4%
action_mask_system5.4%
reset_terrain_2d3.5%
transition_system2.8%
......

Wait, what? The observation system was eating 83% of my frame time? The actual MDP logic - transitions, rewards, resets - was barely 10% combined. I'd assumed the bottleneck would be in the step logic somewhere. I was wrong.

The Bug

I found the culprit in obs_flat_system. This kernel generates a one-hot position encoding for each agent:

@wp.kernel
def obs_flat_system(state: State, config: Config, obs_data: wp.array2d(dtype=wp.float32)):
    i = wp.tid()
    grid_size = config.grid_width * config.grid_height

    # Clear the array first
    for j in range(grid_size):
        obs_data[i, j] = 0.0

    # Set one-hot at current position
    x = state.pos.x[i, 0]
    y = state.pos.y[i, 0]
    idx = y * config.grid_width + x
    obs_data[i, idx] = 1.0

See the problem? That for j in range(grid_size) loop runs inside each GPU thread. For a 64x64 grid, that's 4,096 memory writes per world, per step. With 2,048 parallel worlds, that's 8.4 million memory writes just to zero an array before writing a single 1.0.

Warp will unroll small loops (size 32 or less), but for larger grids we're stuck with O(n) work per thread. You could parallelize this with a 2D kernel launch, but that's O(n²) threads total just to set a one-hot vector. Seems silly when you can do it in O(1).

The Fix

The insight: we don't need to zero the whole array. We just need to clear the previous position and set the new one:

@wp.kernel
def obs_flat_system(state: State, config: Config, obs_data: wp.array2d(dtype=wp.float32)):
    i = wp.tid()

    # Clear only the previous position - O(1)
    old_idx = state.prev_flat_obs_idx[i, 0]
    obs_data[i, old_idx] = 0.0

    # Set new position - O(1)
    x = state.pos.x[i, 0]
    y = state.pos.y[i, 0]
    idx = y * config.grid_width + x
    obs_data[i, idx] = 1.0

    # Track for next step
    state.prev_flat_obs_idx[i, 0] = idx

Two memory operations instead of 4,096. The kernel went from O(grid_size) to O(1).

The Results

After the fix, I ran profiling again:

KernelBeforeAfter
obs_flat_system83.4%3.2%
action_mask_system5.4%27.7%
reset_terrain_2d3.5%19.5%

The observation kernel dropped from dominant bottleneck to negligible. The "real" work - action masking, terrain resets - now shows up properly in the profile.

And the benchmark numbers:

Grid SizeBeforeAfterSpeedup
8x81.5M1.4M~same
64x641.6M1.4M~same
128x128441K1.4M3x
256x256113K1.4M12x

Large grids now run at the same speed as small grids. The O(n) bottleneck is gone.

The Warp Profiling Toolkit

Here's the full set of profiling tools Warp provides:

Context Manager (easiest)

with wp.ScopedTimer("my_operation", cuda_filter=wp.TIMING_KERNEL, synchronize=True):
    # Your code here
    pass

Filter options:

  • wp.TIMING_KERNEL - Kernel launches only
  • wp.TIMING_MEMCPY - Memory copies
  • wp.TIMING_MEMSET - Memory sets
  • wp.TIMING_GRAPH - CUDA graph operations
  • wp.TIMING_ALL - Everything

Manual API (more control)

wp.timing_begin(cuda_filter=wp.TIMING_ALL)
# ... your code ...
results = wp.timing_end()
wp.timing_print(results)

NVTX Integration (for Nsight Systems)

with wp.ScopedTimer("simulation", use_nvtx=True, color="yellow"):
    # Shows up as a named range in Nsight Systems timeline
    pass

Lessons Learned

  1. Profile first, always. I would have spent hours optimizing the wrong kernels. The bottleneck was in the most boring-looking code.

  2. Watch for sequential loops in parallel kernels. Any for loop inside a Warp kernel is suspect. Ask: can this be O(1)? Can it be parallelized with a 2D launch? Can it be replaced with wp.memset?

  3. Warp's built-in profiling is good enough. I didn't need Nsight Systems or Nsight Compute to find this bug. wp.ScopedTimer gave me exactly what I needed in three lines of code.

  4. Grid-scaling performance problems are a smell. When performance degrades with grid size, there's probably an O(n) loop hiding somewhere. Profile and look for kernels that grow with your problem dimensions.

The fix took 10 minutes. Finding it with wp.ScopedTimer took 5. The profiling tools are there - use them.


If you are interested in high-performance RL simulations, check out our Reality project - a simulation framework built on the Madrona engine.