How wp.ScopedTimer Found My 12x Speedup
I was benchmarking a gridworld RL environment built on NVIDIA Warp. The native Warp version hit 8.4 million world-steps per second on small grids - impressive. But when I wrapped it with JAX for compatibility with standard RL training pipelines, performance dropped to 1.5M steps/sec. Fair enough, FFI overhead is real.
Then I tried larger grids. At 256x256, the JAX-wrapped version crawled at 113K steps/sec. That's a 13x slowdown from small grids. Something was wrong beyond just wrapper overhead.
Enter wp.ScopedTimer
Warp ships with built-in GPU profiling that I'd never used. Turns out, it's exactly what I needed.
import warp as wp
with wp.ScopedTimer("benchmark", cuda_filter=wp.TIMING_KERNEL, synchronize=True):
for _ in range(n_steps):
state, obs, reward, done = step(state, actions, config)
That's it. Three parameters:
cuda_filter=wp.TIMING_KERNEL- Profile kernel launches (you can also usewp.TIMING_ALLfor memcpy, memset, and CUDA graphs)synchronize=True- Force GPU sync for accurate timing- The context manager name shows up in the output
When the timer exits, it prints a breakdown of every kernel that ran inside the block, with time percentages. No Nsight Systems setup, no external tools, no friction.
The Smoking Gun
I wrapped my benchmark loop and ran it on a 64x64 grid:
| Kernel | % Time |
obs_flat_system | 83.4% |
action_mask_system | 5.4% |
reset_terrain_2d | 3.5% |
transition_system | 2.8% |
| ... | ... |
Wait, what? The observation system was eating 83% of my frame time? The actual MDP logic - transitions, rewards, resets - was barely 10% combined. I'd assumed the bottleneck would be in the step logic somewhere. I was wrong.
The Bug
I found the culprit in obs_flat_system. This kernel generates a one-hot position encoding for each agent:
@wp.kernel
def obs_flat_system(state: State, config: Config, obs_data: wp.array2d(dtype=wp.float32)):
i = wp.tid()
grid_size = config.grid_width * config.grid_height
# Clear the array first
for j in range(grid_size):
obs_data[i, j] = 0.0
# Set one-hot at current position
x = state.pos.x[i, 0]
y = state.pos.y[i, 0]
idx = y * config.grid_width + x
obs_data[i, idx] = 1.0
See the problem? That for j in range(grid_size) loop runs inside each GPU thread. For a 64x64 grid, that's 4,096 memory writes per world, per step. With 2,048 parallel worlds, that's 8.4 million memory writes just to zero an array before writing a single 1.0.
Warp will unroll small loops (size 32 or less), but for larger grids we're stuck with O(n) work per thread. You could parallelize this with a 2D kernel launch, but that's O(n²) threads total just to set a one-hot vector. Seems silly when you can do it in O(1).
The Fix
The insight: we don't need to zero the whole array. We just need to clear the previous position and set the new one:
@wp.kernel
def obs_flat_system(state: State, config: Config, obs_data: wp.array2d(dtype=wp.float32)):
i = wp.tid()
# Clear only the previous position - O(1)
old_idx = state.prev_flat_obs_idx[i, 0]
obs_data[i, old_idx] = 0.0
# Set new position - O(1)
x = state.pos.x[i, 0]
y = state.pos.y[i, 0]
idx = y * config.grid_width + x
obs_data[i, idx] = 1.0
# Track for next step
state.prev_flat_obs_idx[i, 0] = idx
Two memory operations instead of 4,096. The kernel went from O(grid_size) to O(1).
The Results
After the fix, I ran profiling again:
| Kernel | Before | After |
obs_flat_system | 83.4% | 3.2% |
action_mask_system | 5.4% | 27.7% |
reset_terrain_2d | 3.5% | 19.5% |
The observation kernel dropped from dominant bottleneck to negligible. The "real" work - action masking, terrain resets - now shows up properly in the profile.
And the benchmark numbers:
| Grid Size | Before | After | Speedup |
| 8x8 | 1.5M | 1.4M | ~same |
| 64x64 | 1.6M | 1.4M | ~same |
| 128x128 | 441K | 1.4M | 3x |
| 256x256 | 113K | 1.4M | 12x |
Large grids now run at the same speed as small grids. The O(n) bottleneck is gone.
The Warp Profiling Toolkit
Here's the full set of profiling tools Warp provides:
Context Manager (easiest)
with wp.ScopedTimer("my_operation", cuda_filter=wp.TIMING_KERNEL, synchronize=True):
# Your code here
pass
Filter options:
wp.TIMING_KERNEL- Kernel launches onlywp.TIMING_MEMCPY- Memory copieswp.TIMING_MEMSET- Memory setswp.TIMING_GRAPH- CUDA graph operationswp.TIMING_ALL- Everything
Manual API (more control)
wp.timing_begin(cuda_filter=wp.TIMING_ALL)
# ... your code ...
results = wp.timing_end()
wp.timing_print(results)
NVTX Integration (for Nsight Systems)
with wp.ScopedTimer("simulation", use_nvtx=True, color="yellow"):
# Shows up as a named range in Nsight Systems timeline
pass
Lessons Learned
Profile first, always. I would have spent hours optimizing the wrong kernels. The bottleneck was in the most boring-looking code.
Watch for sequential loops in parallel kernels. Any
forloop inside a Warp kernel is suspect. Ask: can this be O(1)? Can it be parallelized with a 2D launch? Can it be replaced withwp.memset?Warp's built-in profiling is good enough. I didn't need Nsight Systems or Nsight Compute to find this bug.
wp.ScopedTimergave me exactly what I needed in three lines of code.Grid-scaling performance problems are a smell. When performance degrades with grid size, there's probably an O(n) loop hiding somewhere. Profile and look for kernels that grow with your problem dimensions.
The fix took 10 minutes. Finding it with wp.ScopedTimer took 5. The profiling tools are there - use them.
If you are interested in high-performance RL simulations, check out our Reality project - a simulation framework built on the Madrona engine.
