This is a first pass of what I believe to be a not too terrible
implementation of a cooperative thread-based compressor. The idea is
simple... If a compressor is invoked with the same parameters on multiple
threads, then the threads cooperate via an atomic counter to compress the
texture. Each thread can take as long as possible until the texture is finished.
If a caller calls a compression routine that has different parameters, then
it will help the current compression finish before starting on its own compression. In this
way, we can split the textures up among the threads and guarantee that we maximize the
resource usage between them. I.e. this becomes more efficient:
Thread 1: Thread 2: Thread N:
tex0 texN tex(N-1)N
tex1 texN+1 tex(N-1)(N+1)
.. .. ..
texN-1 tex2N tex(N-1)N
I have not tested this for bugs, so I'm still not completely convinced that it is deadlock-free
although it should be...
Changed the function prototype to match that of the typedef in the rest of the library, and fixed a bug where we would iterate too far with the initial buffer.