How I Got a 60x Speedup with SoA Megakernels
Lessons learned from building GPU-accelerated RL environments with NVIDIA Warp
When I started building vectorized physics simulations for reinforcement learning, I made all the classic mistakes. Multiple kernel launches. Object-oriented data layouts. Clean abstractions that killed performance.
Then I refactored to Structure of Arrays (SoA) with fused megakernels. The result? 6,920 env steps/sec became 431,000 env steps/sec. A 62x speedup from architecture changes alone.
Here's what I learned.
The Problem: Death by Kernel Launch
My initial bouncing balls simulation looked reasonable. Separate kernels for gravity, integration, wall collisions, ball-ball collisions. Clean separation of concerns:
# The "clean" approach that killed performance
wp.launch(gravity_kernel, dim=num_balls, inputs=[...])
wp.launch(integrate_kernel, dim=num_balls, inputs=[...])
wp.launch(wall_collision_kernel, dim=num_balls, inputs=[...])
self.hash_grid.build(positions) # Sync point
wp.launch(ball_collision_kernel, dim=num_balls, inputs=[...])
Each kernel launch has overhead: ~5-20 microseconds for scheduling, memory barriers, and synchronization. With 4 kernels per step at 60 FPS across 64 parallel worlds, that overhead dominates.
Worse, each kernel reads and writes the same data. The GPU loads positions from global memory, computes gravity, writes back. Then loads positions again, integrates, writes back. The same data touches global memory 8+ times per step.
Result: ~7k env steps/sec. Pathetic for 2048 balls on a GPU that can do trillions of operations per second.
The Solution: SoA + Megakernels
The fix required two architectural changes:
1. Structure of Arrays (SoA)
Instead of an array of ball objects, store each property as its own array:
# Array of Structures (AoS) - cache unfriendly
class Ball:
pos_x: float
pos_y: float
vel_x: float
vel_y: float
radius: float
balls = [Ball() for _ in range(2048)]
# Structure of Arrays (SoA) - GPU friendly
pos_x = wp.zeros(2048, dtype=float)
pos_y = wp.zeros(2048, dtype=float)
vel_x = wp.zeros(2048, dtype=float)
vel_y = wp.zeros(2048, dtype=float)
radius = wp.zeros(2048, dtype=float)
Why does this matter? GPU threads execute in warps of 32. When thread 0 reads pos_x[0], threads 1-31 are reading pos_x[1] through pos_x[31]. With SoA, those 32 reads become a single coalesced memory transaction. With AoS, you get 32 scattered reads to different cache lines.
2. Megakernel Fusion
Combine all physics operations into a single kernel:
@wp.kernel
def physics_megakernel(
# SoA component arrays
pos_x: wp.array(dtype=float),
pos_y: wp.array(dtype=float),
vel_x: wp.array(dtype=float),
vel_y: wp.array(dtype=float),
radius: wp.array(dtype=float),
# Constants
gravity: float,
restitution: float,
bounds_min: float,
bounds_max: float,
dt: float,
):
tid = wp.tid()
# Load once from global memory
px, py = pos_x[tid], pos_y[tid]
vx, vy = vel_x[tid], vel_y[tid]
r = radius[tid]
# All physics in registers
vy = vy + gravity * dt # Gravity
px = px + vx * dt # Integration
py = py + vy * dt
# Wall collisions
if px - r < bounds_min:
px = bounds_min + r
if vx < 0.0: vx = -vx * restitution
# ... more collision logic
# Store once to global memory
pos_x[tid], pos_y[tid] = px, py
vel_x[tid], vel_y[tid] = vx, vy
The key insight: load once, compute everything, store once. All the intermediate physics happens in GPU registers, which are ~100x faster than global memory.
The Benchmark Results
| Architecture | Throughput | Relative |
| Separate kernels | 6,920 steps/sec | 1x |
| Fused megakernel | 472,000 steps/sec | 68x |
| After cleanup | 431,000 steps/sec | 62x |
The slight drop after cleanup was from adding proper abstractions back in - worth it for maintainability while keeping the performance gains.
For context, this is 64 parallel worlds with 32 balls each (2,048 total entities). At 431k env steps/sec, that's 27.6 million ball physics updates per second.
The Pattern: @wp.func for Composition
To keep code maintainable while using megakernels, I use @wp.func for reusable physics logic:
@wp.func
def gravity_system(vel_y: float, gravity: float, dt: float) -> float:
return vel_y + gravity * dt
@wp.func
def integrate_system(pos: float, vel: float, dt: float) -> float:
return pos + vel * dt
@wp.func
def wall_collision_system(
pos: float, vel: float, radius: float,
min_bound: float, max_bound: float, restitution: float
) -> wp.vec2:
if pos - radius < min_bound:
pos = min_bound + radius
if vel < 0.0:
vel = -vel * restitution
elif pos + radius > max_bound:
pos = max_bound - radius
if vel > 0.0:
vel = -vel * restitution
return wp.vec2(pos, vel)
These functions get inlined at compile time - zero overhead, full composability. The megakernel becomes:
@wp.kernel
def physics_megakernel(...):
tid = wp.tid()
# Load
px, py = pos_x[tid], pos_y[tid]
vx, vy = vel_x[tid], vel_y[tid]
r = radius[tid]
# Compose systems (all inlined)
vy = gravity_system(vy, gravity, dt)
px = integrate_system(px, vx, dt)
py = integrate_system(py, vy, dt)
result_x = wall_collision_system(px, vx, r, bounds_min, bounds_max, restitution)
result_y = wall_collision_system(py, vy, r, bounds_min, bounds_max, restitution)
# Store
pos_x[tid], pos_y[tid] = result_x[0], result_y[0]
vel_x[tid], vel_y[tid] = result_x[1], result_y[1]
Clean code, maximum performance.
When You Need Multiple Passes
Not everything can be fused. Ball-ball collision detection requires a spatial data structure (hash grid), which needs synchronized position data. The solution is a two-pass architecture:
def step(self, dt: float):
# Pass 1: Physics megakernel (gravity, integration, wall collision)
self.launch(physics_megakernel, self.total_balls, [...])
# Build spatial structure (sync point)
self.hash_grid.build(self.positions_3d, self.query_radius)
# Pass 2: Collision megakernel (ball-ball with hash grid queries)
self.launch(collision_megakernel, self.total_balls, [...])
Two kernel launches instead of four+. Each pass does maximum work before the sync point.
Key Takeaways
Kernel launch overhead is real. On modern GPUs, you're paying 5-20 microseconds per launch. At high step rates, this dominates.
SoA unlocks coalesced memory access. GPU threads reading adjacent array elements get served in a single memory transaction. This alone can be a 10x+ improvement.
Megakernels keep data in registers. Load once, compute everything, store once. Registers are ~100x faster than global memory.
@wp.func gives you composition without overhead. Warp inlines these at compile time. You get clean, reusable code with zero runtime cost.
Two-pass is often enough. Even algorithms requiring spatial queries can usually be structured as: (1) physics megakernel, (2) build structure, (3) collision megakernel.
The 62x speedup wasn't from a cleverer algorithm. It was from respecting how GPUs actually work: minimize kernel launches, maximize memory coalescing, keep data in registers. The same principles apply whether you're building RL environments, game physics, or any GPU-accelerated simulation.
If you're interested in high-performance RL simulations, check out our ECS Warp project - a GPU-accelerated simulation framework built on NVIDIA Warp.
