Making cross-platform and cross-API GPU particle systems is extremely easy with Tellusim Core SDK. We wanted to make our API very efficient for compute shader dispatching because we have tons of them inside Tellusim Engine. In this example, we will use indirect dispatch, a built-in GPU RadixSort algorithm, and a particle rendering with instancing (for WebGPU compatibility).
A particle system is just an array of objects with the same behavior. Each particle has parameters like position, velocity, lifetime, and rotation. The particle system simulation has two essential parts: particle emission and particle simulation. Particle emission creates new particles with the specified position and parameters. Particle simulation updates particle position according to velocity, and additional external forces can be applied at this step.
We will start with Application, Window, and Device initialization. However, it is crucial to ensure that compute shaders are supported:
/*
*/
int32_t main(int32_t argc, char **argv) {
// create app
App app(argc, argv);
if(!app.create()) return 1;
// create window
Window window(app.getPlatform(), app.getDevice());
if(!window || !window.setSize(app.getWidth(), app.getHeight())) return 1;
if(!window.create("02 Hello Compute") || !window.setHidden(false)) return 1;
window.setKeyboardPressedCallback([&](uint32_t key, uint32_t code) {
if(key == Window::KeyEsc) window.stop();
});
// create device
Device device(window);
if(!device) return 1;
// check compute shader support
if(!device.hasShader(Shader::TypeCompute)) {
TS_LOG(Error, "compute shader is not supported\n");
return 0;
}
We must keep a bunch of structures synchronized between C++ and shader files. So we will declare them in a separate file and include it when required. The maximum number of simulated particles is limited by 1M:
// declarations
#include "main.h"
// parameters
constexpr uint32_t group_size = 128;
constexpr uint32_t max_emitters = 1024;
constexpr uint32_t max_particles = 1024 * 1024;
The Kernel is a class that manages compute shaders. Its configuration is similar to a Pipeline configuration from the previous example, with the exception of the absence of shader stage masks. A simple particle simulation requires kernels for particle initialization, spawning, and update. Additionally, the geometry and indirect dispatch/draw buffers must be updated:
// create init kernel
Kernel init_kernel = device.createKernel().setUniforms(1).setStorages(3, false);
if(!init_kernel.loadShaderGLSL("main.shader", "INIT_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1;
if(!init_kernel.create()) return 1;
// create emitter kernel
Kernel emitter_kernel = device.createKernel().setUniforms(1).setStorages(5, false).setStorageDynamic(0, true);
if(!emitter_kernel.loadShaderGLSL("main.shader", "EMITTER_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1;
if(!emitter_kernel.create()) return 1;
// create dispatch kernel
Kernel dispatch_kernel = device.createKernel().setUniforms(1).setStorages(4, false);
if(!dispatch_kernel.loadShaderGLSL("main.shader", "DISPATCH_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1;
if(!dispatch_kernel.create()) return 1;
// create update kernel
Kernel update_kernel = device.createKernel().setUniforms(1).setStorages(4, false);
if(!update_kernel.loadShaderGLSL("main.shader", "UPDATE_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1;
if(!update_kernel.create()) return 1;
// create geometry kernel
Kernel geometry_kernel = device.createKernel().setUniforms(1).setStorages(4, false);
if(!geometry_kernel.loadShaderGLSL("main.shader", "GEOMETRY_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1;
if(!geometry_kernel.create()) return 1;
All particles must be sorted (back to front) for correct rendering. We will use a built-in RadixSort algorithm with an indirect dispatch and ordering option. Ordering sort mode creates integer indices that map input keys to the sorted state.
// create radix sort
RadixSort radix_sort;
PrefixScan prefix_scan;
if(!radix_sort.create(device, RadixSort::FlagSingle | RadixSort::FlagIndirect | RadixSort::FlagOrder, prefix_scan, max_particles)) return 1;
Now it’s time to create storage buffers that will contain particle parameters:
// create compute state buffer
// contains global particle system state
ComputeState state = {};
Buffer state_buffer = device.createBuffer(Buffer::FlagStorage, &state, sizeof(state));
if(!state_buffer) return 1;
// create emitters state buffer
// contains dynamic emitter parameters
Buffer emitters_buffer = device.createBuffer(Buffer::FlagStorage, sizeof(EmitterState) * max_emitters * group_size);
if(!emitters_buffer) return 1;
// create particles state buffer
// contains per-particle state
Buffer particles_buffer = device.createBuffer(Buffer::FlagStorage, sizeof(ParticleState) * max_particles);
if(!particles_buffer) return 1;
// create particle allocator buffer
// contains new particle indices
Buffer allocator_buffer = device.createBuffer(Buffer::FlagStorage, sizeof(uint32_t) * max_particles);
if(!allocator_buffer) return 1;
// create particle distances buffer
// contains camera to particle distances and order indices
Buffer distances_buffer = device.createBuffer(Buffer::FlagStorage, sizeof(uint32_t) * max_particles * 2);
if(!distances_buffer) return 1;
// create particle vertex buffer
// contains particle position, velocity, and color
Buffer vertex_buffer = device.createBuffer(Buffer::FlagVertex | Buffer::FlagStorage, sizeof(Vertex) * max_particles);
if(!vertex_buffer) return 1;
// create particle indices buffer
const uint16_t indices_data[] = { 0, 1, 2, 2, 3, 0 };
Buffer indices_buffer = device.createBuffer(Buffer::FlagIndex, indices_data, sizeof(indices_data));
if(!indices_buffer) return 1;
// create indirect dispatch buffer
Compute::DispatchIndirect dispatch_data = {};
Buffer dispatch_buffer = device.createBuffer(Buffer::FlagStorage | Buffer::FlagIndirect, &dispatch_data, sizeof(dispatch_data));
if(!dispatch_buffer) return 1;
// create indirect draw buffer
Command::DrawElementsIndirect draw_data = {};
Buffer draw_buffer = device.createBuffer(Buffer::FlagStorage | Buffer::FlagIndirect, &draw_data, sizeof(draw_data));
if(!draw_buffer) return 1;
// create sort parameters buffer
RadixSort::DispatchParameters sort_data = {};
Buffer sort_buffer = device.createBuffer(Buffer::FlagStorage, &sort_data, sizeof(sort_data));
if(!sort_buffer) return 1;
This Compute dispatch will initialize particles. A barrier is required for all buffers that have been updated and will be used in the following stages:
// compute parameters
ComputeParameters compute_parameters = {};
compute_parameters.max_emitters = max_emitters;
compute_parameters.max_particles = max_particles;
compute_parameters.global_gravity = Vector4f(0.0f, 0.0f, -8.0f, 0.0f);
compute_parameters.wind_velocity = Vector4f(0.0f, 0.0f, 4.0f, 0.0f);
compute_parameters.wind_force = 0.2f;
// initialize buffers
{
Compute compute = device.createCompute();
// init kernel
compute.setKernel(init_kernel);
compute.setUniform(0, compute_parameters);
compute.setStorageBuffers(0, { emitters_buffer, particles_buffer, allocator_buffer });
compute.dispatch(max(max_emitters * group_size, max_particles));
compute.barrier({ emitters_buffer, particles_buffer, allocator_buffer });
}
The Compute dispatching is more simple than rendering with Command and Pipeline. The corresponding initialization kernel does nothing except for storage buffer writes. The allocator buffer contains indices for new particles. The last element contains the first particle index for better memory access. GLSL will be automatically converted to the target platform shader language, including Cuda and Hip compute:
layout(local_size_x = GROUP_SIZE) in;
layout(std140, binding = 0) uniform ComputeParametersBuffer { ComputeParameters compute; };
layout(std430, binding = 1) writeonly buffer EmitterStateBuffer { EmitterState emitters_buffer[]; };
layout(std430, binding = 2) writeonly buffer ParticleStateBuffer { ParticleState particles_buffer[]; };
layout(std430, binding = 3) writeonly buffer IndicesBuffer { uint allocator_buffer[]; };
/*
*/
void main() {
uint global_id = gl_GlobalInvocationID.x;
// initialize emitters
[[branch]] if(global_id < compute.max_emitters * GROUP_SIZE) {
emitters_buffer[global_id].position = vec4(0.0f);
emitters_buffer[global_id].seed = ivec2(global_id);
emitters_buffer[global_id].spawn = 0.0f;
}
// initialize particles
[[branch]] if(global_id < compute.max_particles) {
particles_buffer[global_id].position = vec4(1e16f);
particles_buffer[global_id].velocity = vec4(0.0f);
particles_buffer[global_id].radius = 0.0f;
particles_buffer[global_id].angle = 0.0f;
particles_buffer[global_id].life = 0.0f;
}
// initialize particle indices
[[branch]] if(global_id < compute.max_particles) {
allocator_buffer[global_id] = compute.max_particles - global_id - 1u;
}
}
Now everything is ready for particle system simulation that must be executed per frame:
// simulate particles
{
Compute compute = device.createCompute();
// compute parameters
compute_parameters.camera = Matrix4x3f::rotateZ(-time * 8.0f) * Vector4f(32.0f, 0.0f, 32.0f, 0.0f);
compute_parameters.num_emitters = emitters.size();
compute_parameters.ifps = ifps;
// emitter kernel
compute.setKernel(emitter_kernel);
compute.setUniform(0, compute_parameters);
compute.setStorageData(0, emitters.get(), emitters.bytes());
compute.setStorageBuffers(1, { state_buffer, emitters_buffer, particles_buffer, allocator_buffer });
compute.dispatch(emitters.size() * group_size);
compute.barrier({ state_buffer, emitters_buffer, particles_buffer, allocator_buffer });
// dispatch kernel
compute.setKernel(dispatch_kernel);
compute.setStorageBuffers(0, { state_buffer, dispatch_buffer, draw_buffer, sort_buffer });
compute.dispatch(1);
compute.barrier({ dispatch_buffer, draw_buffer, sort_buffer });
// update kernel
compute.setKernel(update_kernel);
compute.setUniform(0, compute_parameters);
compute.setStorageBuffers(0, { state_buffer, particles_buffer, allocator_buffer, distances_buffer });
compute.setIndirectBuffer(dispatch_buffer);
compute.dispatchIndirect();
compute.barrier({ state_buffer, particles_buffer, allocator_buffer, distances_buffer });
// sort particles
if(!radix_sort.dispatchIndirect(compute, distances_buffer, sort_buffer, 0, RadixSort::FlagOrder)) return false;
// geometry kernel
compute.setKernel(geometry_kernel);
compute.setUniform(0, compute_parameters);
compute.setStorageBuffers(0, { state_buffer, particles_buffer, distances_buffer, vertex_buffer });
compute.setIndirectBuffer(dispatch_buffer);
compute.dispatchIndirect();
}
As a result, we have particle positions, velocities, and colors in the vertex buffer and rendering parameters in the indirect buffer. The only rule you should follow is to have required write-read barriers between stages. Our debug run-time will tell you all errors about invalid kernel arguments if there are any. Let’s render our particle system:
// window target
target.begin();
{
// create command list
Command command = device.createCommand(target);
// set common parameters
CommonParameters common_parameters;
common_parameters.projection = Matrix4x4f::perspective(60.0f, (float32_t)window.getWidth() / window.getHeight(), 0.1f, 1000.0f);
common_parameters.modelview = Matrix4x4f::lookAt(compute_parameters.camera.xyz, Vector3f::zero, Vector3f(0.0f, 0.0f, 1.0f));
if(target.isFlipped()) common_parameters.projection = Matrix4x4f::scale(1.0f, -1.0f, 1.0f) * common_parameters.projection;
// set pipeline
command.setPipeline(pipeline);
command.setSampler(0, sampler);
command.setTexture(0, texture);
command.setUniform(0, common_parameters);
command.setVertexBuffer(0, vertex_buffer);
command.setIndexBuffer(FormatRu16, indices_buffer);
// draw particles
command.setIndirectBuffer(draw_buffer);
command.drawElementsIndirect(1);
}
target.end();
This sample runs on Android, iOS, WebGPU, Windows, Linux, and macOS platforms with excellent performance and does not require any source code modification. The left image is a link to the WebGPU application that can run in your browser: