Skip to main content

Command Palette

Search for a command to run...

How I Got a 60x Speedup with SoA Megakernels

Published
5 min read

Lessons learned from building GPU-accelerated RL environments with NVIDIA Warp


When I started building vectorized physics simulations for reinforcement learning, I made all the classic mistakes. Multiple kernel launches. Object-oriented data layouts. Clean abstractions that killed performance.

Then I refactored to Structure of Arrays (SoA) with fused megakernels. The result? 6,920 env steps/sec became 431,000 env steps/sec. A 62x speedup from architecture changes alone.

Here's what I learned.

The Problem: Death by Kernel Launch

My initial bouncing balls simulation looked reasonable. Separate kernels for gravity, integration, wall collisions, ball-ball collisions. Clean separation of concerns:

# The "clean" approach that killed performance
wp.launch(gravity_kernel, dim=num_balls, inputs=[...])
wp.launch(integrate_kernel, dim=num_balls, inputs=[...])
wp.launch(wall_collision_kernel, dim=num_balls, inputs=[...])
self.hash_grid.build(positions)  # Sync point
wp.launch(ball_collision_kernel, dim=num_balls, inputs=[...])

Each kernel launch has overhead: ~5-20 microseconds for scheduling, memory barriers, and synchronization. With 4 kernels per step at 60 FPS across 64 parallel worlds, that overhead dominates.

Worse, each kernel reads and writes the same data. The GPU loads positions from global memory, computes gravity, writes back. Then loads positions again, integrates, writes back. The same data touches global memory 8+ times per step.

Result: ~7k env steps/sec. Pathetic for 2048 balls on a GPU that can do trillions of operations per second.

The Solution: SoA + Megakernels

The fix required two architectural changes:

1. Structure of Arrays (SoA)

Instead of an array of ball objects, store each property as its own array:

# Array of Structures (AoS) - cache unfriendly
class Ball:
    pos_x: float
    pos_y: float
    vel_x: float
    vel_y: float
    radius: float

balls = [Ball() for _ in range(2048)]

# Structure of Arrays (SoA) - GPU friendly
pos_x = wp.zeros(2048, dtype=float)
pos_y = wp.zeros(2048, dtype=float)
vel_x = wp.zeros(2048, dtype=float)
vel_y = wp.zeros(2048, dtype=float)
radius = wp.zeros(2048, dtype=float)

Why does this matter? GPU threads execute in warps of 32. When thread 0 reads pos_x[0], threads 1-31 are reading pos_x[1] through pos_x[31]. With SoA, those 32 reads become a single coalesced memory transaction. With AoS, you get 32 scattered reads to different cache lines.

2. Megakernel Fusion

Combine all physics operations into a single kernel:

@wp.kernel
def physics_megakernel(
    # SoA component arrays
    pos_x: wp.array(dtype=float),
    pos_y: wp.array(dtype=float),
    vel_x: wp.array(dtype=float),
    vel_y: wp.array(dtype=float),
    radius: wp.array(dtype=float),
    # Constants
    gravity: float,
    restitution: float,
    bounds_min: float,
    bounds_max: float,
    dt: float,
):
    tid = wp.tid()

    # Load once from global memory
    px, py = pos_x[tid], pos_y[tid]
    vx, vy = vel_x[tid], vel_y[tid]
    r = radius[tid]

    # All physics in registers
    vy = vy + gravity * dt           # Gravity
    px = px + vx * dt                 # Integration
    py = py + vy * dt

    # Wall collisions
    if px - r < bounds_min:
        px = bounds_min + r
        if vx < 0.0: vx = -vx * restitution
    # ... more collision logic

    # Store once to global memory
    pos_x[tid], pos_y[tid] = px, py
    vel_x[tid], vel_y[tid] = vx, vy

The key insight: load once, compute everything, store once. All the intermediate physics happens in GPU registers, which are ~100x faster than global memory.

The Benchmark Results

ArchitectureThroughputRelative
Separate kernels6,920 steps/sec1x
Fused megakernel472,000 steps/sec68x
After cleanup431,000 steps/sec62x

The slight drop after cleanup was from adding proper abstractions back in - worth it for maintainability while keeping the performance gains.

For context, this is 64 parallel worlds with 32 balls each (2,048 total entities). At 431k env steps/sec, that's 27.6 million ball physics updates per second.

The Pattern: @wp.func for Composition

To keep code maintainable while using megakernels, I use @wp.func for reusable physics logic:

@wp.func
def gravity_system(vel_y: float, gravity: float, dt: float) -> float:
    return vel_y + gravity * dt

@wp.func
def integrate_system(pos: float, vel: float, dt: float) -> float:
    return pos + vel * dt

@wp.func
def wall_collision_system(
    pos: float, vel: float, radius: float,
    min_bound: float, max_bound: float, restitution: float
) -> wp.vec2:
    if pos - radius < min_bound:
        pos = min_bound + radius
        if vel < 0.0:
            vel = -vel * restitution
    elif pos + radius > max_bound:
        pos = max_bound - radius
        if vel > 0.0:
            vel = -vel * restitution
    return wp.vec2(pos, vel)

These functions get inlined at compile time - zero overhead, full composability. The megakernel becomes:

@wp.kernel
def physics_megakernel(...):
    tid = wp.tid()

    # Load
    px, py = pos_x[tid], pos_y[tid]
    vx, vy = vel_x[tid], vel_y[tid]
    r = radius[tid]

    # Compose systems (all inlined)
    vy = gravity_system(vy, gravity, dt)
    px = integrate_system(px, vx, dt)
    py = integrate_system(py, vy, dt)

    result_x = wall_collision_system(px, vx, r, bounds_min, bounds_max, restitution)
    result_y = wall_collision_system(py, vy, r, bounds_min, bounds_max, restitution)

    # Store
    pos_x[tid], pos_y[tid] = result_x[0], result_y[0]
    vel_x[tid], vel_y[tid] = result_x[1], result_y[1]

Clean code, maximum performance.

When You Need Multiple Passes

Not everything can be fused. Ball-ball collision detection requires a spatial data structure (hash grid), which needs synchronized position data. The solution is a two-pass architecture:

def step(self, dt: float):
    # Pass 1: Physics megakernel (gravity, integration, wall collision)
    self.launch(physics_megakernel, self.total_balls, [...])

    # Build spatial structure (sync point)
    self.hash_grid.build(self.positions_3d, self.query_radius)

    # Pass 2: Collision megakernel (ball-ball with hash grid queries)
    self.launch(collision_megakernel, self.total_balls, [...])

Two kernel launches instead of four+. Each pass does maximum work before the sync point.

Key Takeaways

  1. Kernel launch overhead is real. On modern GPUs, you're paying 5-20 microseconds per launch. At high step rates, this dominates.

  2. SoA unlocks coalesced memory access. GPU threads reading adjacent array elements get served in a single memory transaction. This alone can be a 10x+ improvement.

  3. Megakernels keep data in registers. Load once, compute everything, store once. Registers are ~100x faster than global memory.

  4. @wp.func gives you composition without overhead. Warp inlines these at compile time. You get clean, reusable code with zero runtime cost.

  5. Two-pass is often enough. Even algorithms requiring spatial queries can usually be structured as: (1) physics megakernel, (2) build structure, (3) collision megakernel.

The 62x speedup wasn't from a cleverer algorithm. It was from respecting how GPUs actually work: minimize kernel launches, maximize memory coalescing, keep data in registers. The same principles apply whether you're building RL environments, game physics, or any GPU-accelerated simulation.


If you're interested in high-performance RL simulations, check out our ECS Warp project - a GPU-accelerated simulation framework built on NVIDIA Warp.

More from this blog

proximal

11 posts