After my previous post, which showed a tremendous increase in speed by choosing float-specific math functions (ie, "log10f") versus their generic counterpart (ie, "log10"), I still wasn't satisfied with the speed of my audio processing on the Teensy 3.6. I had some calculations that went into and out of dB space, which means I needed to do lots of log10f(x) and pow(10,x) function calls. I needed them to be faster.  Here's how I did it...

They Should be Faster:  The figure above shows the number of CPU cycles that I measured for each math function.  The red bars are using the standard math function.  What caught my eye was that the log10f() function is slower than the logf() function.  They should only be different by one multiplication (ie, 1-2 cycles).  Similarly, I was surprised that powf(10.0,x) was so much slower than expf(x).  It seemed like I ought to be able to accelerate these functions.

Faster powf():  I started with powf() it was in the most need of acceleration.  Because I did not need a general powf() command that could do any base number, I could optimize specifically for powf(10.0,x).  Through the rules of exponential and logs, I wrote my own macro that executed pow in terms of exp:

//powf(10.f,x) is exactly exp(log(10.0f)*x)
#define pow10f(x) expf(2.302585092994046f*x)  

Note that the reformulation above is not a numerical approximation.  If you had all of the digits for that 2.302 constant, this would be an exact substitution for pow(10,x).  Yet, as you'll see in a moment, it is 3 times faster.  Way faster with no loss in accuracy.  Wow!

Faster log10f():  Similarly, for log10f(), I started with a reformulating log10 in terms of log.  While that was effective, it was only a modest increase in speed.  I wanted it even faster.  So, in a post in the ARM community forum (here), I found that someone (thanks Dr. Beckmann!) had reformulated log10f using log base 2 and then further accelerated it using an approximation for log base 2 that exploits way that single-precision numbers are represented in memory.  It's a pretty neat solution.

So, using this, my complete substitution for log10f() is shown below.  The log2 approximation function is at the end of this post.

//log10f is exactly log2(x)/log2(10.0f)
#define log10f_fast(x)  (log2f_approx(x)*0.3010299956639812f)

Because this reformulation uses an approximation for log2(), it is not an exact substitution.  But, over the range of input values that I explored, the resulting error seemed to be less than 0.05%, which is good enough for my needs.

Faster by 3x!  In looking at the speed of these to reformulated functions, I saw that my log10f approximation was 2.8x faster than the standard log10f().  Similarly, I found that my pow10f function was 3.0x faster than the standard powf(10,x) function call.  That's a pretty nice acceleration!  I'm pleased.

As promised, here's the fast log2 approximation (source):

// This is a fast approximation to log2()
// Y = C[0]*F*F*F + C[1]*F*F + C[2]*F + C[3] + E;
float log2f_approx(float X) {
  float Y, F;
  int E;

  F = frexpf(fabsf(X), &E);
  Y = 1.23149591368684f;
  Y *= F;
  Y += -4.11852516267426f;
  Y *= F;
  Y += 6.02197014179219f;
  Y *= F;
  Y += -3.13396450166353f;
  Y += E;