This Week In Veloren 125

This week, we get an in-depth view into what was required for the wgpu
integration. We see some changes to econsim, performance, and caves.

– AngelOnFira, TWiV Editor

Contributor Work

Thanks to this week’s contributors, @YuriMomo, @zesterer, @Slipped, @xMAC94x,
@SWilliamsGames, @Christof, @aweinstock, @Sam, @juliancoffee, @Yusdacra,
@imbris, @DaforLynx, @donovanlank, @AngelOnFira, @Capucho, @Snowram, @Zakru, and
@Kisa!

@Sam worked on a harvester rework. It’ll have 4 unique attacks and all-new AI.
To see it, go visit your local T0 dungeon (after it merges).

This week’s meeting minutes.

Lava and performance by @aweinstock

This week, I added Lava to caves, which glows, and which sets you on fire when
you enter it.

I also improved the performance of entity-terrain collisions (one of the main
parts of physics) by an average of 20 microseconds per frame by adding a check
for a more precise bounding box before looking up voxels in the terrain.

Econsim improvements @Christof

I finally found the mistake in the faster economy simulation code and the code
got merged – server startup should be faster by several seconds.

After that, I started an
RFC
about short and long-term goals. Econsim will soon get more wares and
professions to match the increased number of items, and the much more powerful
crafting recipes. Also, prices will probably get independent of loot
probabilities and become more raw material derived.

All about wgpu

A section by @imbris and @Sharp, edited by @Capucho and @AngelOnFira

We now use the wgpu crate for our graphics!

After a long journey of refactoring our rendering code for wgpu, rebasing,
refactoring new graphics features to use wgpu, updating the code across wgpu
versions, testing on a plethora of platforms, fixing miscellaneous bugs, and
rewriting parts of our CI cache generation with efforts from @Imbris, @Capucho,
@Sharp, and others we have finally
merged the long
awaited 129 commit branch that switches our rendering code to this new graphics
crate!

wgpu is the spiritual successor to the gfx (pre low level) crate we were
using for graphics so it sits at a similar level of abstraction. However, wgpu
provides a cleaner modern API that gives use more control and the ability to
leverage the power of modern backends including vulkan, metal, and dx12. With
gfx we only used the OpenGL backend and now with wgpu we have automatic
support for multiple backends including automatic runtime backend selection.

The currently functioning backends are vulkan, dx12, metal, and dx11 (support
for OpenGL is lost for the moment). We saw improvements in performance
especially for Windows users with decent graphics cards that previously
experienced a significant discrepancy in performance from Linux users with
similar cards.

Now that this period of refactoring has been finished, we’re excited to see the
optimizations and features we can develop on this fresh graphics API.

Much thanks to @kvark and @cwfitzgerald ❤️ for answering the questions we had
about wgpu, resolving bugs we encountered, and being extremely receptive to
issue reports and pull requests!

Thanks to all the testers who helped us track down platform-specific bugs and
make sure everything was mostly working everywhere!

Notable previous mentions of the wgpu refactor are in
devblog-64 and
devblog-100

Presentation modes option

We added a new option in the graphics menu with the three presentation modes
exposed by wgpu: Immediate, Mailbox, and Fifo. Selecting Fifo or Mailbox can
be used to avoid screen tearing. Fifo corresponds to VSync and as such caps
the frame rate at the display refresh rate. Immediate and Mailbox both fallback
to Fifo within wgpu if that option isn’t available on the current platform.

GPU timing with `wgpu-profiler`

While working on this refactor I discovered, via lurking on the Rust gamedev
Discord and the wgpu matrix channels, that wpgu provided an API for collecting
timestamps
between sections of work on the GPU. This piqued my interest as I’ve previously
had much success identifying and optimizing our CPU bottlenecks with the use of
flamegraph and
Tracy while our GPU side performance was
mostly opaque to me.

Thus, I started working on writing an abstraction to make it easier to
instrument our rendering code with timestamps. While I was in the process of
this, Wumpf released wgpu-profiler. I discovered it had everything I needed
and nicely solved the complications of identifying timestamps which I was too
focused on over-optimizing. It even included the ability to write out to a
chrome trace which was pretty neat!

So I switched to wgpu-profiler, made a
PR adding drop based scope
wrappers around wgpu types, and deleted all the extra code I had for this in the
Veloren codebase which was pretty satisfying. I then added scopes around all the
major portions of our rendering (essentially one for each render pass and one
for each pipeline).

To make it easy to access this new information I added a new option in the
graphics menu to enable GPU timing. The timing values can be viewed in the HUD
debug info (F3) and will be saved as chrome trace files in the working directory
when taking a screenshot. It’s mostly just supported on the Vulkan backend at
the moment so if you don’t see anything new in the debug info then the feature
isn’t available on your current backend/platform combination. I intend to
eventually make the checkbox gray out in this case instead of having no
indication. It would be neat to plot these values over time in the future.

We can also now use tooling for profiling vulkan and dx12 such as this profiler
from AMD: https://gpuopen.com/rgp/

Example RGP profile on Ryzen 3 3200U integrated graphics

Sprite optimization

High sprite view distances previously had a really large impact on the CPU. We
used instancing for sprites to draw multiple ones with the same model within a
single draw call. However, within a chunk there several different sprite kinds
(some with multiple variations in their models) leading to a large number of
draw calls and significant CPU overhead when trying to draw more sprites farther
away.

To reduce the number of draw calls we need to make I combined all the sprite
models into a single buffer of vertices and then used a technique called vertex
pulling to fetch the vertex data from this buffer within the vertex shader by
determining the location in the buffer from the index of the current vertex
combined with the mesh offset for the current instance.

This moved the bottleneck for sprites into the vertex shader both because this
introduced a bit more work in the shader and because sprites simply have a lot
of vertices and were no longer limited by the CPU. To help alleviate this I
converted all our quad drawing (i.e. what all our terrain, figures, and sprites
are made of) to use an index buffer to reduce the number of vertices that need
to be processed per quad from 6 to 4 (as the GPU should cache vertices with the
same index).

This helped and also led to small gains for shadow rendering times, presumably
terrain and figures also benefited but the cost of the fragment shader for those
is far greater in typical scenes. I think we can squeeze more performance out of
sprites by optimizing the vertex shader further and reducing the amount of
per-instance data. Another area where we can tackle this is reducing the
vertices used for sprites that are far off and take up only a few pixels. We
already have different LOD levels that reduce the number of vertices but we may
be able to go further and simply render 2D billboard-style sprites for these.

Reversing the depth buffer

One of the bigger wins from switching to wgpu is that it lets us abandon the
OpenGL depth buffer. To make a long story short, the standard method for
calculating depth buffers is the worst possible way to do
it. They should be computed
“backwards” from how people normally set them up due to this taking far better
advantage of the way floating point precision works (in fact, it is nearly
optimal for floating point)!

Unfortunately, since it takes some fairly involved math to realize this, the
OpenGL spec was published in a way that makes it impossible to realize the
benefits of reversed depth buffers, due to setting depth planes to go from -1 to
1 instead of 0 to 1; the conversion to 0 to 1 must then be done in the drivers
by computing depth * 0.5 + 0.5, which is pretty much the worst possible thing
you can do for floating point accuracy and negates the reversal effect. wgpu
follows the far more sensible Vulkan, Metal, and DirectX option, and sets depth
buffers from 0 to 1, enabling us to realize the effect.

The practical upshot of this is that we can set the near plane much, much closer
than before, while extending the far plane to (effectively) infinity, while
retaining better precision than before at literally every point in the whole
range (except maybe within a very tiny fraction of a block in front of the near
plane). This should help us improve clipping since it solves the most annoying
aspect of it (near plane) and also lets us render much bigger maps without
noticeable artifacts (unless you zoom all the way out).

Organizing bind group and pipeline switching

To avoid redundant state changes (setting pipelines, bind groups, vertex
buffers) as much as possible. We created a series of “Drawer” types that wrap a
reference to the render pass and guarantee for instance that a certain pipeline
is in use. We have a top level Drawer that exists transiently during the
recording of the frame:

pub struct Drawer<'frame> {
    encoder: Option<ManualOwningScope<'frame, wgpu::CommandEncoder>>,
    borrow: RendererBorrow<'frame>,
    swap_tex: wgpu::SwapChainTexture,
    globals: &'frame GlobalsBindGroup,
    // Texture and other info for taking a screenshot
    // Writes to this instead in the third pass if it is present
    taking_screenshot: Option<super::screenshot::TakeScreenshot>,
}

This hold references to bits of the render state that we need and owns the
command encoder. Then for example, there is a method to start the first render
pass that returns a FirstPassDrawer:

    /// Returns None if all the pipelines are not available
    pub fn first_pass(&mut self) -> Option<FirstPassDrawer> {
        let pipelines = self.borrow.pipelines.all()?;
        // Note: this becomes Some once pipeline creation is complete even if shadows
        // are not enabled
        let shadow = self.borrow.shadow?;

        let encoder = self.encoder.as_mut().unwrap();
        let device = self.borrow.device;
        let mut render_pass = encoder.scoped_render_pass("first_pass", /* snip */ });

        render_pass.set_bind_group(0, &self.globals.bind_group, &[]);
        render_pass.set_bind_group(1, &shadow.bind.bind_group, &[]);

        Some(FirstPassDrawer {
            render_pass,
            borrow: &self.borrow,
            pipelines,
            globals: self.globals,
            shadows: &shadow.bind,
        })
    }

But before returning it two bind groups are set:

        render_pass.set_bind_group(0, &self.globals.bind_group, &[]);
        render_pass.set_bind_group(1, &shadow.bind.bind_group, &[]);

In subsequent drawing, we can then rely on these bind groups being set (as long
as we maintain that they aren’t overridden which we accomplish by keeping all
the logic which sets bind groups (and other things) in this one module). Then
e.g. to start drawing terrain we create a TerrainDrawer setting the pipeline
and the index buffer in the process:

    pub fn draw_terrain(&mut self) -> TerrainDrawer<'_, 'pass> {
        let mut render_pass = self.render_pass.scope("terrain", self.borrow.device);

        render_pass.set_pipeline(&self.pipelines.terrain.pipeline);
        set_quad_index_buffer::<terrain::Vertex>(&mut render_pass, &self.borrow);

        TerrainDrawer {
            render_pass,
            col_lights: None,
        }
    }

Such that when we draw all the chunks we only need to set the additional bind
groups and vertex buffer specific to each chunk:

impl<'pass_ref, 'pass: 'pass_ref> TerrainDrawer<'pass_ref, 'pass> {
    pub fn draw<'data: 'pass>(
        &mut self,
        model: &'data Model<terrain::Vertex>,
        col_lights: &'data Arc<ColLights<terrain::Locals>>,
        locals: &'data terrain::BoundLocals,
    ) {
        if self.col_lights
            // Check if we are still using the same atlas texture as the previous drawn chunk
            .filter(|current_col_lights| Arc::ptr_eq(current_col_lights, col_lights))
            .is_none()
        {
            self.render_pass
                .set_bind_group(2, &col_lights.bind_group, &[]);
            self.col_lights = Some(col_lights);
        };

        self.render_pass.set_bind_group(3, &locals.bind_group, &[]);
        self.render_pass.set_vertex_buffer(0, model.buf().slice(..));
        self.render_pass
            .draw_indexed(0..model.len() as u32 / 4 * 6, 0, 0..1);
    }
}

Future of OpenGL support

With this transition, we gain access to a lot of new graphics backends, but we
now no longer have OpenGL support. This mainly affects Linux users as support
for older Windows machines is provided through dx11. Nevertheless, we may regain
some form of OpenGL support on Linux in the future. The backend abstraction
layer for wgpu was recently rewritten and moved into the main wgpu
repo removing a lot of the barriers
to getting an OpenGL backend working.

Shaderc adventures

To use our glsl shaders with wgpu we have to compile them to SPIRV. Currently,
the only viable option for this is using shaderc. However, shaderc is a C++
dependency and thus has the potential to bring complications to the build
process which it does. Building shaderc requires git, cmake, and python to
be installed and also ninja on Windows. This adds extra steps for new
contributors looking to compile Veloren.

Additionally, shaderc introduced complications into our cross compiled CI builds
(for non Linux platforms). Cross compiling to Windows, I ran into this
issue which I was only able to
solve by using the posix threads version of MinGW which necessitated
distributing some DLL files.

Cross compiling to macOS also had various issues with compiling and/or linking
shaderc. After several attempts including trying to use prebuilt versions of
shaderc, we eventually switched to natively building on a macOS runner which was
admittedly a somewhat useful change outside of getting shaderc working as we
have had other
issues with cross
compiling to macOS. Shaderc is also pretty massive and appears to add at least
~10 mb to our binary size. Hence, we are really looking forward to when
naga, a shader translator library written in
pure Rust, will be able to replace shaderc for our purposes.

CPU profiling

Since we had a major refactor I thought it would be neat to examine the CPU
profiles of the work being done during rendering. Most of the CPU side time for
rendering appears to be spent in run_render_pass:

Tracy profile showing ~2 ms to queue rendering of everything for a
frame

Flamegraph snippet showing where `run_render_pass` spends time

Future directions

Our future with wgpu is bright. Because its API is much closer to that of modern
graphics drivers, it should be much easier to incorporate newer hardware
features (at least if they have a little bit of cross-platform support). We also
have a great opportunity to help influence the shape of not only wgpu, but the
WebGPU specification itself, to make it better for games in general.

Finally, having made this switch lets us focus on other rendering improvements
we’d been putting off in order not to block wgpu work (or which were blocked by
lack of wgpu), such as:

Separate the voxygen renderer module into its own crate (e.g. we can use it in
a voxel editor)
Create a small renderer engine test/demo for rapid iteration and testing.
Rendering on multiple threads (impossible before wgpu, currently waiting on
removal of locks in wgpu).
Deferred rendering (making large numbers of unshaded point lights cheap).
Move completely to naga once it has glsl support and can handle our shaders.
SMAA
Improved LOD rendering (including things like trees, sites, and maybe even
some form of sprites).
Smarter and more accurate shadows with occlusion queries (useful for many
other purposes as well) and Z-partitioning.
Faster and smarter chunk updates, both for initial rendering and updates to
things like sprites, to reduce apparent load times.
More physically correct HDR, allowing us to remove various lighting hacks
(should especially help at night).
Take advantage of compute shaders and GPU-side buffer updates to speed up
CPU-limited operations.
(Maybe) explicit tiled rendering, tesselation shaders, and multiple queues,
if/when supported!
And much, much more!

Out for an early morning dip. See you next week!

Support our devs!

Veloren Open Collective

Continue Reading →

Voxel

This Week In Veloren 130

This Week In Veloren 129

This Week In Veloren 129

This Week In Veloren 128

This Week In Veloren 128

This Week In Veloren 127

This Week In Veloren 127

This Week In Veloren 126

This Week In Veloren 126

This Week In Veloren 125

Contributor Work

Lava and performance by @aweinstock

Econsim improvements @Christof

All about wgpu

Presentation modes option

GPU timing with `wgpu-profiler`

Sprite optimization

Reversing the depth buffer

Organizing bind group and pipeline switching

Future of OpenGL support

Shaderc adventures

CPU profiling

Future directions

Contributor Work

Lava and performance by @aweinstock

Econsim improvements @Christof

All about wgpu

Presentation modes option

GPU timing with wgpu-profiler

Sprite optimization

Reversing the depth buffer

Organizing bind group and pipeline switching

Future of OpenGL support

Shaderc adventures

CPU profiling

Future directions

GPU timing with `wgpu-profiler`