Adaptive virtual terrain texturing
Today I would like to talk about a new algorithm implemented in Nebula, and this one is for making beautiful terrain at scale! Our implementation of this algorithm is extracted from the little concrete information provided by Ka Chen in his great presentation in 2015 at GDC, where he presented a novel way of doing virtual texturing in order to increase resolution dynamically, instead of the otherwise static resolution of ordinary virtual texturing.
Motivation
Before we get down to brass tacks, just a little motivation as to why on earth we would go through this work and not just do ordinary virtual texturing using the GPU sparse binding API (which Nebula also supports). Well, the answer is resolution. And if your question is why we can’t do the texture splatting (sampling the materials) at runtime instead of going about storing it in some texture cache, well the answer is that in order to add small detail like decals on the terrain, we would have to render many many decal boxes every frame and it would be expensive, while this method caches that result for sampling every frame.
In our solution, which follows the one from FarCry 4, we want a 64k maximum theoretical resolution SubTexture, which over an area of 64×64 meters yields 1024 pixels per meter. We use a fixed tile size of 256×256 pixels, which means the previous statement can also be reformulated as a resolution per tile, in which case one tile will cover 0.25 meters squared at the highest resolution, and the whole SubTexture, meaning 64 meters squared at the lowest resolution.
Overview
The algorithm can be summarized into a few abstract steps, we will then go through each step and have a look at the implementation. These steps include both CPU and GPU work, but they are presented in the order in which a programmer would approach it.
Setup
The first thing we need to do is set everything up. This step consists of the following – divide the world into 64×64 meter regions to each a SubTexture gets assigned. We will discuss the details of the SubTexture later, but for now, let’s just say it requires a world position in the range [-half world size, half world size] if oriented around 0, an indirection texture offset which can be initialized to FFFFFFF, a maximum LOD level float and an unsigned number of tiles, which can also be initialized to FFFFFFFF.
We also need an indirection texture, which in our case is 2048×2048 pixels and with a mip chain. The mip chain should meet the requirements for the maximum amount of tiles in a SubTexture, such that at the highest mip, a SubTexture represents only a single indirection pixel. So, for 256 pixel tiles, it would mean the number of mips is 8. We also have our physical texture caches which is an albedo, material and normals, at 8192×8192 standard size, or 8448×8448 padded size (explained later). The physical caches are not mipped.
We also need to render the lowres callback. This is a texture used to render pixels for which we have no SubTextures covering them, because they are too far away. This is done with a simple screen space pass and a shader which is more or less equivalent with the tile update shader.
Our solution differs form Ka Chens solution and the follow-up solution from Ghost Recon Wildlands with how we output our pages. In the first solution, a buffer was attached and written to, in Ghost Recon Wildlands they used a 3D texture (which sounds a bit wasteful) to map page coordinate XY to the pixel in a texture plane, and the mip as the texture plane selector. In our solution, we use a buffer of indirection buffer statuses, which utilize atomic operations to decide whether or not a page has been already produced or not, and if it hasn’t we output it directly to our page buffer.
uint index = mipOffset + pageCoord.x + pageCoord.y * mipSize;
uint status = atomicExchange(PageStatuses[index], 1u);
if (status == 0x0)
{
uvec4 entry = PackPageDataEntry(1u, subTextureIndex, lowerMip, pageCoord.x, pageCoord.y, subTextureTile.x, subTextureTile.y);
uint entryIndex = atomicAdd(PageList.NumEntries, 1u);
PageList.Entry[entryIndex] = entry;
}
Here we calculate the buffer index (which is mipped, thus the offset and size modifications to the index calculation), we use atomic exchange with the page status, and if 0, we add it to the page output! This way, we don’t need to clear the buffer, nor do we have to implement a second pass to extract the data from this pass!
We need a buffer to store the page statuses used in the Terrain Prepass, which should map to a texture. Therefore, we have to make a buffer big enough to capture all mips, and also generate a list of mip sizes and mip buffer offsets such that we can emulate this being a mipped texture on the GPU. We avoid any indirection data on the CPU side, because of reasons related to copying and rearranging subtexture regions.
uint offset = 0;
terrainVirtualTileState.indirectionMipOffsets.Resize(IndirectionNumMips);
terrainVirtualTileState.indirectionMipSizes.Resize(IndirectionNumMips);
for (uint i = 0; i < IndirectionNumMips; i++)
{
uint width = IndirectionTextureSize >> i;
uint height = IndirectionTextureSize >> i;
terrainVirtualTileState.indirection[i].Fill(IndirectionEntry{ 0xF, 0xFFFFFFFF, 0xFFFFFFFF });
terrainVirtualTileState.indirectionMipSizes[i] = width;
terrainVirtualTileState.indirectionMipOffsets[i] = offset;
offset += width * height;
}
Because of alignment wasting memory if we use arrays of single floats or integers for constant buffers in GLSL, we opted to store these values in ivec4s using the following method:
for (SizeT j = 0; j < terrainVirtualTileState.indirectionMipOffsets.Size(); j++)
{
uniforms.VirtualPageBufferMipOffsets[j / 4][j % 4] = terrainVirtualTileState.indirectionMipOffsets[j];
uniforms.VirtualPageBufferMipSizes[j / 4][j % 4] = terrainVirtualTileState.indirectionMipSizes[j];
}
Later in the GPU code, you will see how we get this values back by making a similar operation.
Runtime
- Read back data from Terrain Prepass – CPU
- Setup tile update jobs for tiles that should be deleted
- Setup tile update jobs for tiles that should be rendered
- Prepare indirection buffer updates
- Update SubTexture buffer – CPU
- Based on camera distance to the world space region occupied by a SubTexture, calculate the resolution it should have (explained in detail later)
- Calculate number of tiles from resolution, each tile is 256 pixels in size
- If SubTexture changed resolution, allocate a new indirection texture region for the new resolution, and deallocate the old
- If SubTexture increased in size, copy the whole region of indirection values to the new region, and shift the mip down such that we only write to mips 1..x and avoid 0
- If SubTexture decreased in size, copy whole region from mips 0..x-1 to new region
- Copy buffers – CPU/GPU Transfer
- Copy SubTextures from old regions to new regions if changed
- Copy indirection pixels from buffer to texture
- Copy from staging subtexture buffer to GPU buffer
- Clear entries – GPU – Compute
- Render Terrain Prepass – GPU/Render
- Updates page entry buffer
- Copy page entries to CPU-readable buffer to be used in 1. – GPU/Transfer
- Render page tiles – GPU/Render
SubTexture
Given the overview above, it is time to go through what a SubTexture is. Imagine that we want our 64k resolution for a 64×64 meter area. For a world of size, let’s say 8192, we would need an 8mi total texture! Even with sparse resources, the biggest possible resource is 16k on Nvidia, and 32k on AMD, so that is immediately out of the question. However, we will talk about usage of sparse resources a bit later…
So think of it like this: We pretend like we have an 8mi texture, but only on the CPU side. We already have our world split into 64×64 meter regions, so we can think of each 64×64 meter region as a smaller virtual texture within our 8mi texture. This is what we refer to as a SubTexture, it is a virtual texture within the virtual texture :D. Think of the maximum resolution, like 64k for a tile of the highest resolution, as the theoretical size of a texture, which you will see won’t be fully utilized.
Based on the camera distance to that region, we decide how many actual physical texture tiles should be represented by that region. In the highest resolution case, where we have a 64k resolution, we have 256×256 tiles. These tiles can be found in two places, first in a 256×256 block of pixels in the indirection texture, and as scattered 256×256 pixel tiles in the three physical caches. So each tile occupies 1 pixel in the indirection texture, and a 256×256 tile in the physical caches. And SubTextures closer to the camera uses more indirection pixels, and therefore more texture tiles, hence can provide a higher resolution for nearby pixels.
Now you might think: hold on, so if we have 256×256 indirection pixels for the highest resolution SubTexture, and each tile is 256×256 pixels, that would mean a 256*256 x 256*256 texture, which is astronomical! Well, on top of all this complexity, we will also introduce mipmaps! Each SubTexture occupies a block of indirection pixels, but as we mentioned before, the indirection texture is actually mipmapped, meaning we actually have one block for each mip level, all the way down to where a SubTexture only represents a single indirection pixel. This also means that each tile is rendered to represent a certain mip, so for a 256×256 tile at mip 0, a tile covers 0.25 meters of the world, but at mip 8, it covers a whooping 64 meters. For a tile covering only 1×1 tiles, it only has a single mip.
Calculating LODs
You might have gathered by now that one of the things we have to do manually in this algorithm is to calculate the LODs. We need to calculate LODs ourselves for two things, first to determine how many tiles a SubTexture should use, and then when rendering the terrain on the GPU, decide which mip a pixel needs so we can mark the page at the proper mip as being resident. Our formula is rather straight forward:
// control the maximum resolution as such, to get 10.24 texels/cm, we need to have 65536 pixels (theoretical) for a 64 meter region
const uint maxResolution = SubTextureTileWorldSize * 1024;
// distance where we should switch lods, set it to every 10 meters
const float switchDistance = 2.0f;
// mask out y coordinate by multiplying result with, 1, 0 ,1
Math::vec4 min = Math::vec4(subTex.worldCoordinate[0], 0, subTex.worldCoordinate[1], 0);
Math::vec4 max = min + Math::vec4(64.0f, 0.0f, 64.0f, 0.0f);
Math::vec4 cameraXZ = Math::ceil(cameraTransform.position * Math::vec4(1, 0, 1, 0));
Math::vec4 nearestPoint = Math::minimize(Math::maximize(cameraXZ, min), max);
float distance = length(nearestPoint - cameraXZ);
// if we are outside the virtual area, just default the resolution to 0
uint resolution = 0;
if (distance > 300)
goto skipResolution;
// at every regular distance interval, increase t
uint t = Math::n_max(1.0f, (distance / switchDistance));
// calculate lod logarithmically, such that it goes geometrically slower to progress to higher lods
uint lod = Math::n_min((uint)Math::n_log2(t), (IndirectionNumMips - 1));
// calculate the resolution by offseting the max resolution with the lod
resolution = maxResolution >> lod;
skipResolution:
// calculate the amount of tiles, which is the final lodded resolution divided by the size of a tile
// the max being maxResolution and the smallest being 1
uint tiles = resolution / PhysicalTextureTileSize;
Two constants here to take in mind, SubTextureTileWorldSize is set to 64 and represents the world size a SubTexture takes. IndirectionNumMips represents the amount of mips in the indirection texture, which we earlier figured out was 8. What this code is doing is that it is calculating the resolution of the SubTexture, and then converts it to a number of tiles. This is then used to determine the resolution for this SubTexture during this frame.
Let’s move on to the first GPU loop! Yes, there are two loops…
Clearing
The clear pass is rather intuitive, but for the sake of it, let’s explain exactly what it is doing.
[local_size_x] = 1
shader
void
csTerrainPageClearUpdateBuffer()
{
PageList.NumEntries = 0u;
}
Just set the amount of entries in the entry output list
Mip
Remember the LOD calculation for the SubTexture? On the GPU it is slightly different. There, we do the following:
// convert world space to positive integer interval [0..WorldSize]
vec2 worldSize = vec2(WorldSizeX, WorldSizeZ);
vec2 unsignedPos = worldPos.xz + worldSize * 0.5f;
uvec2 subTextureCoord = uvec2(unsignedPos / VirtualTerrainSubTextureSize);
// calculate subtexture index
uint subTextureIndex = subTextureCoord.x + subTextureCoord.y * VirtualTerrainNumSubTextures.x;
if (subTextureIndex >= VirtualTerrainNumSubTextures.x * VirtualTerrainNumSubTextures.y)
return;
TerrainSubTexture subTexture = SubTextures[subTextureIndex];
// if this subtexture is bound on the CPU side, use it
if (subTexture.tiles != 0xFFFFFFFF)
{
// calculate LOD
const float lodScale = 4 * subTexture.tiles;
vec2 dy = dFdy(worldPos.xz * lodScale);
vec2 dx = dFdx(worldPos.xz * lodScale);
float d = max(1.0f, max(dot(dx, dx), dot(dy, dy)));
d = clamp(sqrt(d), 1.0f, pow(2, subTexture.maxLod));
float lod = log2(d);
...
}
...
Let’s walk through what is happening… First thing we do is to convert the pixel coordinate to be in the [0..WorldSize] range. Using that, we select which SubTexture (remember, 64×64 meter area) this pixel is in. We check if the SubTexture is valid by checking if it has tiles. Then, we use the partial derivative of x and y for the world position, scaled by the amount of tiles the SubTexture has (the range of these values is like we said before, [1..256]) multiplied by a constant factor of 4 which just seemed to work well. That difference is used as a distance to our pixel in order to determine the slope, which is then clamped to not exceed the maximum LOD for this SubTexture. The max LOD is determined based on the amount of tiles in the SubTexture, so a SubTexture of 256 tiles has 8 as the max LOD, while a SubTexture of 1 tile has 0 as the max LOD.
Binning the pages
The next stage of the shader is to bin the pages. Following from the code before inside the inner condition, we do the following:
// calculate pixel position relative to the world coordinate for the subtexture
vec2 relativePos = worldPos.xz - subTexture.worldCoordinate;
// the mip levels would be those rounded up, and down from the lod value we receive
uint upperMip = uint(ceil(lod));
uint lowerMip = uint(floor(lod));
// calculate tile coords
uvec2 subTextureTile;
uvec2 pageCoord;
vec2 dummy;
CalculateTileCoords(lowerMip, subTexture.tiles, relativePos, subTexture.indirectionOffset, pageCoord, subTextureTile, dummy);
// since we have a buffer, we must find the appropriate offset and size into the buffer for this mip
uint mipOffset = VirtualPageBufferMipOffsets[lowerMip / 4][lowerMip % 4];
uint mipSize = VirtualPageBufferMipSizes[lowerMip / 4][lowerMip % 4];
uint index = mipOffset + pageCoord.x + pageCoord.y * mipSize;
uint status = atomicExchange(PageStatuses[index], 1u);
if (status == 0x0)
{
uvec4 entry = PackPageDataEntry(1u, subTextureIndex, lowerMip, pageCoord.x, pageCoord.y, subTextureTile.x, subTextureTile.y);
uint entryIndex = atomicAdd(PageList.NumEntries, 1u);
PageList.Entry[entryIndex] = entry;
}
// if the mips are not identical, we need to repeat this process for the upper mip
if (upperMip != lowerMip)
{
...
}
...
Here, we first calculate the relative position of this SubTexture in world space. This is used to determine a SubTexture relative distance for the tile this pixel will be binned inside. We also calculate an upper mip and lower mip, such that we capture both mips that will be used when sampling. The CalculateTileCoords function is implemented as such:
void
CalculateTileCoords(in uint mip, in uint maxTiles, in vec2 relativePos, in uvec2 subTextureIndirectionOffset, out uvec2 pageCoord, out uvec2 subTextureTile, out vec2 tilePosFract)
{
// calculate the amount of meters a single tile is, this is adjusted by the mip and the number of tiles at max lod
vec2 metersPerTile = VirtualTerrainSubTextureSize / float(maxTiles >> mip);
// calculate subtexture tile index by dividing the relative position by the amount of meters there are per tile
vec2 tilePos = relativePos / metersPerTile;
tilePosFract = fract(tilePos);
subTextureTile = uvec2(tilePos);
// the actual page within that tile is the indirection offset of the whole
// subtexture, plus the sub texture tile index
pageCoord = (subTextureIndirectionOffset >> mip) + subTextureTile;
}
This function is meant to take in a position relative to the SubTexture in world space, that is, if a SubTexture is at coordinates (-256, -256) in world space, then a pixel at (-256, -256) would have relativePos (0, 0). The argument mip is the mip provided by the calculation above, maxTiles is the amount of tiles a SubTexture is providing at mip 0, and subTextureIndirectionOffset is the offset in the indirection texture where this SubTexture resides. The output is the pageCoord, which is the indirection texture space coordinate for this page. Accompanied by this variable is subTextureTile, which is the tile index for this pixel inside the SubTexture. Going back to the function calling this one, we use these values to calculate the index of the tile which we are updating, by using the mip offsets and mip size from the Setup phase, the calculation is a simple 2D to 1D index conversion to know which indirection pixel should be affected, and therefore which page should be updated. Like mentioned before, we use atomics to synchronize the writes to the PageEntry list, which has a cost but is much cheaper than having a clear and extraction pass.
Read back
Okay, so now we are done with the heavy lifting on the GPU side for this frame, but we will be generating work from a previous frame to update tiles on this frame! Sounds confusing? Well because we are buffering frames, the read back for this frame is at least N buffers old, so any tile updates we do now might actually already be out of view, but because of the nature of buffering there really is no way around this, unfortunately.
When reading back the data, we get what we produced from the shader and, if you noticed before the data on the GPU is packed, so we first need to unpack it. The packing is a way to reduce the amount of memory being passed over the PCI bus when reading back, so it’s important to keep it as small as possible. However, this can be contended and optimized later, for example removing the pageCoords and calculate them on the CPU side.
uint status, subTextureIndex, mip, pageCoordX, pageCoordY, subTextureTileX, subTextureTileY;
UnpackPageDataEntry(updateList[0].Entry[i], status, subTextureIndex, mip, pageCoordX, pageCoordY, subTextureTileX, subTextureTileY);
// the update state is either 1 if the page is allocated, or 2 if it used to be allocated but has since been deallocated
uint updateState = status;
uint index = pageCoordX + pageCoordY * (dims.width >> mip);
IndirectionEntry& entry = terrainVirtualTileState.indirection[mip][index];
We get the indirection entry by using the page coords and mip produced by the GPU. This is actually all the info we need, now all we have to do is to allocate pages or deallocate pages from the physical texture caches, produce tile update jobs and we are done!
One important thing to note here is that there is a code path for the read back information which does the following:
TerrainSubTexture& subTexture = subTextures[subTextureIndex];
float metersPerTile = SubTextureTileWorldSize / float(subTexture.tiles >> entry.mip);
This uses the SubTexture buffer which the GPU assumed was resident at the time of these page updates, and as per subTextureIndex, it becomes clear we need to use the SubTextures that were in use when the GPU was producing its pages. Therefore, the SubTextures CPU buffer is also N buffered. Actually, one must have to N buffer a bunch of things for this algorithm to work, but this problem in particular cost a lot of time to realize.
Texture space allocation
As previously mentioned, we allocate indirection pages when we update our SubTextures, and physical texture cache pixels when we get the read back from the GPU.
We implement two different ways to allocate texture memory, one for the indirection texture, which is done with a quad tree, and the physical texture caches which is done with a LRU cache.
Indirection allocation
Indirection regions are allocated as a quad-tree due to it’s nature of effectively searching for a region of a specific size, or finding a previously allocated indirection region. A quadtree equally divides a space spatially into four equally sized squares, and for each such square it produces another quadtree. However, a quadtree doesn’t necessarily allocate all of these levels up-front, but does it dynamically when needed. Due to the fact that for every search we can reduce the search size by 4, it has a log4 search time, which is really good for us since we will need to allocate regions of different sizes in a fast manner. Indirection pixels are allocated based on SubTexture size, and when a SubTexture changes size, it allocates a new area, queues a copy from the old to the new area, and then deletes the old are for next frame, such that another SubTexture can fill the space.
Physical texture page tile allocation
With physical pages it is a little bit more sensitive. Since we have no mechanism to determine whether or not a page is not used anymore, we need a smarter approach to decide whether or not we should reuse a physical texture page. Another solution would be to implement a mechanism where we clear the page entries and determine which pages are not seen, and deallocate those pages, but if we do, it means we will lose the tiles that we just recently saw! This means that if we turn around, the same pages we just saw will be updated again, even if they never changed. The same happens when we resize SubTextures – we copy the old area to the new according to how many mips we can fill such that we reduce the amount of popping when moving between SubTextures.
Instead, we implement an LRU cache (last recently used) which is implemented as a dictionary as a cache lookup and a doubly linked list. We used the doubly linked list to keep track of the most recently used item, and the last used item. When an item is encountered in the lookup we move this linked list node to the head of the list. If an item is not found in the lookup, we allocate a new node and put it as head and in the lookup. If the item is not found but the list is full, we delete the last used item from the list, which evicts that tile. Implementation wise, we allocate one node per each tile in the physical caches, and reuse these nodes to avoid memory allocation when we free and obtain a new item. Every node is associated with a tile coordinate, so this is how we determine which tile should be updated.
Instead, we implement an LRU cache (last recently used) which is implemented as a dictionary as a cache lookup and a doubly linked list. We used the doubly linked list to keep track of the most recently used item, and the last used item. When an item is encountered in the lookup we move this linked list node to the head of the list. If an item is not found in the lookup, we allocate a new node and put it as head and in the lookup. If the item is not found but the list is full, we delete the last used item from the list, which evicts that tile. Implementation wise, we allocate one node per each tile in the physical caches, and reuse these nodes to avoid memory allocation when we free and obtain a new item. Every node is associated with a tile coordinate, so this is how we determine which tile should be updated.
Copying between mips
I briefly mentioned that when a SubTexture gets resized, we perform a copy between the mips in order to fill as much as we can of the new region. Another way to conceptualize this is that we have two ‘layers’ of detail, the size of a SubTexture in tiles, and the mips in the indirection texture for that SubTexture. A SubTexture will most likely only have a few pages used in just some of the mips, meaning we might have a few pages in mip 0, more in mip 1, and the rest in mip 2. If you remember, this is because the readback will tell us for which SubTexture and in which mip a page is currently visible, and it will trigger a tile update.
But can we save protect this data when a SubTexture is scaled up or down? Just like in Ka Chens presentation, yes of course! When a SubTexture is scaled down, we simply need to copy from the current region to the new, cutting off some of those top mips. If we scale up, we make sure to copy as many mips as we have resident into the new region. As an example, let’s say a region goes from 256 tiles to 128. This means that mips 0-8 contain valid information, but the new region can really only fit mips 1-8, so we copy those. If we scale up, lets’s say we go from 128 to 256, then we are missing mip 0, but the rest of the mip chain is already there, so we simply copy from mips 0-7 of the old region (128) to mips 1-8 of the new region (256). Because the size of the region follows the same binary series as the mips, a 256 tile SubTexture is equivalent to a 128 tiled one if we offset the mip by 1!
Anisotropy
Alright, last detail to think about is the anisotropical issue. Why would we have such an issue you might ask? Well, we have a texture cache with tiles tightly packed next to each other, which means that if we sample along the border of such a tile, we will surely get texture information from another tile, and that will produce ugly artifacts. To combat this, we simply just add some padding to the tiles, so a 256×256 tile is now 264×264 pixels, leaving an 8 pixel border around it. In the sampling shader, we simply offset by 4 pixels, and reduce the tile size by 4 pixels, leaving us with a 256×256 pixel sample space again.