Gamma-correct rendering, part 2

In part 1, I showed how to implement gamma-correct rendering using the gamma_lut class provided with AGG. The gamma_lut class works by creating two lookup tables - one to convert the non-linear source colour space (usually sRGB) to linear RGB, and another to convert in the reverse direction. Due to the properties of the gamma power curve, we would generally like the linear RGB space to be 16-bit in order to avoid loss of precision. This implies a reverse lookup table with 65536 entries - probably not a big deal with the multi-megabyte caches of today's desktop processors, but it could still be an issue with embedded processors.

An alternative strategy is to use an optimised binary search to perform the reverse transform. I'd love to be able to take credit for this idea, but in fact I got it from Jim Blinn's excellent book Dirty Pixels. My idea, which I do claim credit for, is to provide a generic implementation of the search algorithm that works for any bit depth, then provide a hand-optimised version via template specialisation for the common 8-bit case.

First, we define the generic version of the binary search. This has to be a static member of a class template, because C++ does not (currently) support partial specialisation of functions.

template<class LoResT, class HiResT, unsigned GammaShift>

struct gamma_bsearch

{

enum { gamma_half = 1 << (GammaShift - 1) };

static LoResT find(HiResT const * table, HiResT v)

{

// Binary search for any bit depth.

LoResT x = 0;

unsigned y = gamma_half;

if (v > table[y]) x = y;

y >>= 1;

while (y > 0)

{

if (v > table[x + y]) x += y;

y >>= 1;

}

return x;

}

};

We can now define a hand-optimised specialisation for the 8-bit case:

template<class LoResT, class HiResT>

struct gamma_bsearch<LoResT, HiResT, 8>

{

static LoResT find(HiResT const * table, HiResT v)

{

// Unrolled binary search.

LoResT x = 0;

if (v > table[128]) x = 128;

if (v > table[x + 64]) x += 64;

if (v > table[x + 32]) x += 32;

if (v > table[x + 16]) x += 16;

if (v > table[x + 8]) x += 8;

if (v > table[x + 4]) x += 4;

if (v > table[x + 2]) x += 2;

if (v > table[x + 1]) x += 1;

return x;

}

};

Hmmm, 8 comparisons and up to 15 additions per channel per pixel - could be worse, I suppose. As Jim Blinn mentions in his book, the binary search algorithm automatically clamps the output value between 0 and 255, so there is no need to check for overflow.

The agg::gamma_lut replacement looks like this:

template<class LoResT=int8u,

class HiResT=int8u,

unsigned GammaShift=8,

unsigned HiResShift=8> class gamma_lut_bsearch

{

public:

typedef gamma_lut_bsearch<LoResT, HiResT, GammaShift, HiResShift> self_type;

enum gamma_scale_e

{

gamma_shift = GammaShift,

gamma_size = 1 << gamma_shift,

gamma_mask = gamma_size - 1,

};

enum hi_res_scale_e

{

hi_res_shift = HiResShift,

hi_res_size = 1 << hi_res_shift,

hi_res_mask = hi_res_size - 1

};

gamma_lut_bsearch(double g) :

m_gamma(1.0)

{

gamma(g);

}

void gamma(double g)

{

m_gamma = g;

double s1 = double(gamma_mask);

double s2 = double(hi_res_mask);

// Generate lookup tables.

const unsigned max = gamma_size - 1;

m_dir_gamma[0] = 0;

m_trans_table[0] = 0;

double t1 = 0; // previous value

for (unsigned i = 1; i < max; i++)

{

double t2 = pow(i / s1, m_gamma) * s2;

m_dir_gamma[i] = HiResT(agg::uround(t2));

m_trans_table[i] = HiResT(agg::uround((t1 + t2) / 2));

t1 = t2;

}

m_dir_gamma[max] = hi_res_mask;

m_trans_table[max] = hi_res_mask;

}

double gamma() const

{

return m_gamma;

}

HiResT dir(LoResT v) const

{

return m_dir_gamma[unsigned(v)];

}

LoResT inv(HiResT v) const

{

return gamma_bsearch<LoResT, HiResT, GammaShift>::find(&m_trans_table[0], v);

}

private:

gamma_lut_bsearch(const self_type&);

const self_type& operator = (const self_type&);

double m_gamma;

HiResT m_dir_gamma[gamma_size];

HiResT m_trans_table[gamma_size];

};

Finally, we can plug the binary search into the aa_demo.cpp sample by simply replacing gamma_lut with gamma_lut_bsearch.

No need for screen shots - it looks about the same as in part 1. As before, I've attached the source files and a compiled Win32 binary.

So, is gamma_lut_bsearch any faster than gamma_lut? On a desktop system - probably not, as a straight lookup table is pretty hard to beat. However, on embedded processors with limited cache memory, the binary search could be a win, especially when the full 16-bit precision is required.

In part 3, I'll discuss gamma-correct rendering in RGBA pixel formats.

Posted by Jim Barry, 2010-06-21.