This program compresses a texture using the GPU into the BC7 format and compares the results with the original image. You can optionally write out an uncompressed version of the texture to see the results. It only supports TGA images and is pretty bare bones to demonstrate how to use the code.
usage: bc7_gpu image.tga [output.tga]
There is an OpenCL version and a CUDA version which can be switched with the #defines in “bc7_gpu.h”. Hopefully it is fairly straight forward to incorporate the code into another tool. You would use the following files:
./bc7_gpu.h ./bc7_compressed_block.h ./bc7_decompress.h ./bc7_decompress.cpp ./CUDA/bc7_cuda.h ./CUDA/bc7_cuda.cpp ./CUDA/BC7.cu ./OpenCL/bc7_opencl.h ./OpenCL/bc7_opencl.cpp ./OpenCL/BC7.opencl
This is a Visual Studio 2010 solution and it depends on the CUDA SDK to build (which should be easy to change). The OpenCL version of the program does work on AMD cards as well. The program will probably trip the “Timeout Detection and Recovery” for images that are large enough. You can either disable the timeout in the registry or only dispatch portions of an image in a loop.
BC7 is a block compression scheme that compresses a 4×4 block of 24-bit or 32-bit pixels into 16 bytes. Two colors in RGB or RGBA space are used as endpoints for interpolation to calculate the rest of the block of pixels. BC7 has up to 3 sets of endpoints, where DXT1-5 just have one set. Since a GPU has several hundred, possibly thousands of threads, each thread runs the compression algorithm on a 4×4 block of pixels. BC7 has a large search space: there are 8 different modes, up to 64 ways to partition up the 16 pixels (called a “shape”), channel swapping, etc. The GPU code iterates over the 8 modes, optionally calculates which “shapes” are the best to refine, and refines them choosing the lowest error from the resulting combination of mode, shape, etc.
There is a define called __CULL_SHAPES that will chose the best shapes to refine using a linearity measure of the set of pixels. The more linear a set of pixels are, the better they are going to fit a line segment. I have found that just testing all the shapes resulted in higher quality and about the same speed when using less Gradient Descent iterations.
Once the shapes to refine are chosen, a bounding box is found for each set of pixels. The minimum and maximum are used as the initial endpoints for the line segment. Gradient Descent is then used over several iterations to adjust the endpoints to minimize the error using floating point precision. Once that is finished, the endpoints are quantized to the correct precision and the pixels are assigned indices to the quantized palette.
According to the author’s benchmarks, the CUDA version is noticeably faster (~30-40%) with conceptually identical code.