This post originated from an RSS feed registered with Java Buzz
by James Gosling.

Original Post: Transcendental Meditation
Feed Title: James Gosling: on the Java road...
Feed URL: /jag/feed/entries/rss?cat=%2FJava
Feed Description: I've been inescapably tagged as "the Java guy", despite the fact that I havn't worked in the Java product organization for a couple of years (I do still kibbitz). These days I work at Sun's research lab about two thirds of the time, and do customer visits and talks the other third. For more detail...

I got into a conversation with some folks who've been moving a large
sophisticated image processing application to Java. They've been getting
great performance numbers, much to the surprise of the C crowd in their
shop.

With one exception: code that invokes sin() and cos() heavily is
somewhat slower. They asked me why this was happening. I had a pretty
good idea, but I checked with Joe Darcy, our local Floating Point God,
and he had this to say:

For many years, the JDK on x86 platforms has used the hardware
fsin/fcos x87 instructions in the range [-pi/4, pi/4], a
range which encompasses about half of all representable floating-point
values. Therefore, in that range the performance of the JDK's
transcendental functions should be nearly the same as the performance of
the transcendental functions in C, C++, etc. that are using those same
fsin/fcos instructions. Benchmarks which focus on testing the trig
performance of large values, such as almabench, present a skewed
portrait of Java's trigonometric performance. The next question is why
don't we just use fsin/fcos over the entire floating-point range? The
simple answer is that fsin/fcos can deliver answers that are arbitrarily
wrong for the most straightforward way of measuring error in the result.

Every finite real number, no matter how large, has a well-defined
value for sin/cos. Ideally, the floating-point result returned for
sin/cos would be the representable floating-point number closest to the
mathematically defined result for the floating-point input. A
floating-point library having this property is called correctly rounded,
which is equivalent to saying the library has an error bound less than
or equal to 1/2 an ulp (unit in the last place). For sin/cos, writing a
correctly rounding implementation that runs at a reasonable speed is
still something of a research problem so in practice platforms often use
a library with a 1 ulp error bound instead, which means either of the
floating-point numbers adjacent to the true result can be returned. This
is the implementation criteria the Java Math library has to meet. The
implementation challenge is that sin/cos are implemented using argument
reduction whereby any input is mapped into a corresponding input in the
[-pi/4, pi/4] range. Since the period of sin/cos is pi
and pi is transcendental, this amounts to having to compute a
remainder from the division by a transcendental number, which is
non-obvious. A few years after the x87 was designed, people figured out
how to do this division as if by an exact value of pi. Instead
the x87 fsin/fcos use a particular approximation to pi, which
effectively means the period of the function is changed, which can lead
to large errors outside [-pi/4, pi/4]. For example the
value of sine for the floating-point number Math.PI is around

1.2246467991473532E-16

while the computed value from fsin is

1.2246063538223773E-16

In other words, instead of getting the full 15-17 digit accuracy of
double, the returned result is only correct to about 5 decimal digits.
In terms of ulps, the error is about 1.64e11 ulps, over *ten billion*
ulps. With some effort, I'm confident I could find results with the
wrong sign, etc. There is a rationale which can justify this behavior;
however, it was much more compelling before the argument reduction
problem was solved.

This error has tragically become un-fixable because of the compatibility
requirements from one generation to the next. The fix for this problem
was figured out quite a long time ago. In the excellent paper The
K5 transcendental functions by T. Lynch, A. Ahmed, M. Schulte,
T. Callaway, and R. Tisdale a technique is described for doing argument
reduction as if you had an infinitely precise value for pi. As far as I
know, the K5 is the only x86 family CPU that did sin/cos accurately. AMD
went back to being bit-for-bit compatibile with the old x87 behavior,
assumably because too many applications broke. Oddly enough, this is
fixed in Itanium.

What we do in the JVM on x86 is moderately obvious: we range check the
argument, and if it's outside the range [-pi/4, pi/4]we do
the precise range reduction by hand, and then call fsin.

So Java is accurate, but slower. I've never been a fan of "fast, but
wrong" when "wrong" is roughly random(). Benchmarks rarely test
accuracy. "double sin(double theta) { return 0; }" would be a great
benchmark-compatible implementation of sin(). For large values of theta,
0 would be arguably more accurate since the absolute error is never
greater than 1. fsin/fcos can have absolute errors as large as 2
(correct answer=1; returned result=-1).

This is one of those area where no matter what we do, we're screwed.