You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1130 lines
45 KiB
1130 lines
45 KiB
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%%
|
|
%%% Version 12, 1999-09-07, Bruno
|
|
%%% -----------------------------
|
|
%%%
|
|
%%% * Habe neue Referenzen zu den Karatsuba-Arbeiten hinzugefügt, entnommen
|
|
%%% dem großartigen Paper CECM-98-118 (98_118-Borwein-Bradley-Crandall.dvi).
|
|
%%% * Habe in acm.bst, Funktion format.title, das change.case rausgenommen,
|
|
%%% das die Namen von Euler, Riemann u.a. in Kleinbuchstaben konvertiert
|
|
%%% hatte.
|
|
%%%
|
|
%%% Version 11, 1998-12-20, Bruno
|
|
%%% -----------------------------
|
|
%%%
|
|
%%% * Habe Referenzen zu den Karatsuba-Arbeiten hinzugefügt.
|
|
%%%
|
|
%%% Version 10, 1998-03-10, Thomas & Bruno
|
|
%%% --------------------------------------
|
|
%%%
|
|
%%% * Korrigiere die Formel für a(n) bei zeta(3).
|
|
%%% * Schönheitsfehler im Literaturverzeichnis.
|
|
%%%
|
|
%%% Version 9, 1998-03-06, Thomas & Bruno
|
|
%%% -------------------------------------
|
|
%%%
|
|
%%% * Schreibe \frac{1}{|x|} statt \frac{1}{x}.
|
|
%%%
|
|
%%% Version 8, 1998-01-16b, Thomas
|
|
%%% ------------------------------
|
|
%%%
|
|
%%% * Drei Literaturverweise für LiDIA statt nur einem.
|
|
%%%
|
|
%%% Version 7, 1998-01-16a, Bruno
|
|
%%% -----------------------------
|
|
%%%
|
|
%%% * Adresse: Praefix F fuer Frankreich
|
|
%%% * Abstract: Erwaehne zeta(3)
|
|
%%% * Kleinere Korrekturen der O()-Abschaetzungen
|
|
%%%
|
|
%%% Version 6, 1998-01-14, Thomas
|
|
%%% -----------------------------
|
|
%%%
|
|
%%% * habe meine Adresse ge"andert.
|
|
%%%
|
|
%%% * habe Resultat f"ur die Euler Konstante + Kettenbruch eingef"uhrt
|
|
%%%
|
|
%%% Version 5, 1997-12-11, Thomas
|
|
%%% -----------------------------
|
|
%%%
|
|
%%% * Habe die Anzahl der Scritte bei der Euler Konstante von
|
|
%%% x = ceiling(N log(2)/4)^2 auf x = ceiling((N+2) log(2)/4)^2
|
|
%%% hochgesetzt (Mail von Sofroniou)
|
|
%%%
|
|
%%% * Habe Kommentar eingef"ugt bzgl Checkpointing bei der Euler
|
|
%%% Konstante und Gamma(x) (Mail von Sofroniou)
|
|
%%%
|
|
%%% * Habe Section zu Geschwindigkeit von Maple gegen"uber Pari
|
|
%%% verbessert (mail von Laurent Bernardi).
|
|
%%%
|
|
%%% Version 4, 1997-01-09, Thomas
|
|
%%% -----------------------------
|
|
%%%
|
|
%%% * Habe die Komplexitätsaussage für sinh, cosh ergänzt.
|
|
%%% * Habe die Versionen der getesteten CAS ergänzt.
|
|
%%%
|
|
%%% Version 3, 1997-01-07, Bruno
|
|
%%% ----------------------------
|
|
%%%
|
|
%%% * Meine Firma schreibt sich mit vier Grossbuchstaben.
|
|
%%% * Apery schreibt sich m.W. mit einem Akzent.
|
|
%%% * Die Fehlermeldung meldet n2-n1>0, nicht n1-n2>0.
|
|
%%% * N -> \(N\) (zweimal)
|
|
%%% * Leerzeile entfernt nach Display-Formeln, bei denen der Absatz
|
|
%%% weitergeht. Hat den Effekt eines \noindent.
|
|
%%% * Im Abschnitt: arctan(x) for rational x: "another way" -> "the fastest way"
|
|
%%% * "[87]" -> "\cite{87}"
|
|
%%% * Das Cohen-Villegas-Zagier brauchen wir nun doch nicht zu zitieren.
|
|
%%% * Die "Note:" am Ende von Abschnitt ueber die Gamma-Funktion optisch
|
|
%%% verkleinert.
|
|
%%% * Die Formel fuer die hypergeometrische Funktionen optisch verschoenert.
|
|
%%% Andere Formeln ebenso.
|
|
%%% * Figure 2, erste Spalte rechtsbuendig.
|
|
%%% * "out performs" -> "outperforms"
|
|
%%% * "the streng" -> "the strength"
|
|
%%% * Hinweis auf die Parallelisierbarkeit im Abstract.
|
|
%%% * Bibtex-Style gehackt, damit nicht jeder zweite Autor auf seine
|
|
%%% Anfangsbuchstaben verkuerzt und alleinstehende Autoren ihres
|
|
%%% Vornamens beraubt werden.
|
|
%%%
|
|
%%% Version 2, 1997-01-06, Thomas
|
|
%%% -----------------------------
|
|
%%%
|
|
%%% * geänderte Abschnitte sind auskommentiert mit %%%. Alle
|
|
%%% Änderungen sind als Vorschlag zu verstehen. Der Grund
|
|
%%% wird im folgenden angegeben. Falls Du mit einer Änderung
|
|
%%% einverstanden bist, kannst Du alle %%%-Zeilen löschen.
|
|
%%%
|
|
%%% * Lyx defines wurden entfernt. Grund: das ISSAC-acmconf.sty
|
|
%%% erlaubt keine fremde macros. Übersetzung mit LaTeX geht.
|
|
%%% * habe Keyboardumlaute (ä,ü,ö,ß) in LaTeX umgeschrieben.
|
|
%%% Grund: damit die Submission in einem file geht.
|
|
%%% * Habe fontenc und psfig usepackage Befehle entfernt.
|
|
%%% Grund: fonts bestimmt acmconf.sty und keine Bilder vorhanden.
|
|
%%% * Habe bibliography mit BibTeX (binsplit.bib) und acm.bst
|
|
%%% erstellt. Grund: wird von ISSAC '97 verlangt.
|
|
%%% * Habe langen Formeln in einer eqnarray Umgebung gesteckt.
|
|
%%% Grund: acmconf.sty läuft im twocolumn-Modus. Lange Formeln
|
|
%%% haben die Ausgabe durcheinander gebracht.
|
|
%%% * Habe Reihenfolge bei der Beschreibung der elementare
|
|
%%% Funktionen geändert, sodaß zuerst die rationale und dann
|
|
%%% die reelle version beschrieben wird. Grund: Einheitlichkeit.
|
|
%%% * Habe sinh mit binary-splitting gegen sinh mit exp-Berechnung
|
|
%%% getestet. Sie sind ungefähr gleich gut auf meinen Pentium,
|
|
%%% mir machen "Wackler" beim cosh. cosh ist ab und zu, sowohl
|
|
%%% bei kleiner als auch bei große Präzision langsamer. Habe
|
|
%%% dies und dem Abschnitt sinh, cosh ausgefüllt. Grund: es hat
|
|
%%% gefehlt.
|
|
%%% * Habe artanh Abschnitt entfernt. Grund: ich habe in der
|
|
%%% Einleitung der elementaren Funktionen darauf verwiesen, daß
|
|
%%% man die Berechnung anderer Funktionen (wie artanh) auf die
|
|
%%% hier erwähnte zurückführen oder auf analoger Weise
|
|
%%% implementieren kann. Ich denke man braucht nicht alles explizit
|
|
%%% anzugeben.
|
|
%%%
|
|
%%% * Habe Dein Dankeschön an mich entfernt.
|
|
%%% * Habe Abschnitt über Konvergenzbeschleunigung entfernt.
|
|
%%% Grund: das geht in Dein MathComp paper.
|
|
%%%
|
|
%%% * Habe neue Formel für pi eingefügt. Grund: einfacher,
|
|
%%% effizienter und stimmt mit der angegebenen Referenz
|
|
%%% überein.
|
|
%%% * Habe die Berechnung der Apery Konstante angepasst.
|
|
%%% Grund: die hier angegebenen Formel wurde mit einer
|
|
%%% umgeformten Reihe berechnet. Wenn man dieses nicht
|
|
%%% kennt, wirkt es verwirrend. Keine Effizienz-steigerung.
|
|
%%% * Habe die Beschreibung für die erste version der Euler
|
|
%%% Konstante entfernt. Grund: wird von der zweiten version
|
|
%%% in jeder Hinsicht (Beweis, Effizienz) gedeckt.
|
|
%%% * Habe Abschnitte über Checkpointing und Parallelisierung
|
|
%%% eingefügt. Ein Beispiel über die Wirksamkeit habe ich
|
|
%%% bei der Apery Konstante angegeben. Grund: damit können wir
|
|
%%% das Paper auch bei PASCO '97 einreichen.
|
|
%%%
|
|
%%% * Habe Beispiel-C++-Implementierung für abpq Reihen eingefügt.
|
|
%%% Grund: zeigen wie einfach es ist wenn man die Formeln hat ;-)
|
|
%%% * Habe Beispiel-Implementierung für abpqcd Reihen eingefügt.
|
|
%%% Grund: dito
|
|
%%% * Habe Computational results und Conclusions Abschnitt eingefügt.
|
|
%%% * Habe die Namen der Konstanten (C, G, ...) and die entsprechenden
|
|
%%% Abschnitten eingefügt. Grund: diese Namen werden bei den
|
|
%%% Tabellen im Abschnitt Computational results benutzt.
|
|
%%% * Habe Verweis an LiDIA eingefügt. Grund: wird bei Computational
|
|
%%% results erw\"ahnt.
|
|
%%%
|
|
%%% Version 1, 1996-11-30, Bruno
|
|
%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%% Bruno Haible, Thomas Papanikolaou. %%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\documentstyle{acmconf}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
% Plain TeX macros
|
|
%\catcode`@=11 % @ ist ab jetzt ein gewoehnlicher Buchstabe
|
|
%\def\@Re{\qopname@{Re}} \def\re#1{{\@Re #1}}
|
|
%\def\@Im{\qopname@{Im}} \def\im#1{{\@Im #1}}
|
|
%\catcode`@=12 % @ ist ab jetzt wieder ein Sonderzeichen
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
% LaTeX2e macros
|
|
\catcode`@=11 % @ ist ab jetzt ein gewoehnlicher Buchstabe
|
|
\def\re{\mathop{\operator@font Re}\nolimits}
|
|
\def\im{\mathop{\operator@font Im}\nolimits}
|
|
\def\artanh{\mathop{\operator@font artanh}\nolimits}
|
|
\catcode`@=12 % @ ist ab jetzt wieder ein Sonderzeichen
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\begin{document}
|
|
|
|
\title{Fast multiprecision evaluation of series of rational numbers}
|
|
|
|
\author{
|
|
\begin{tabular}{ccc}
|
|
{Bruno Haible} & \hspace*{2cm} & {Thomas Papanikolaou}\\
|
|
{\normalsize ILOG} && {\normalsize Laboratoire A2X}\\
|
|
{\normalsize 9, rue de Verdun} && {\normalsize 351, cours de la Lib\'eration}\\
|
|
{\normalsize F -- 94253 Gentilly Cedex} && {\normalsize F -- 33405 Talence Cedex}\\
|
|
{\normalsize {\tt haible@ilog.fr}} && {\normalsize {\tt papanik@math.u-bordeaux.fr}}\\
|
|
\end{tabular}
|
|
}
|
|
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
|
|
We describe two techniques for fast multiple-precision evaluation of linearly
|
|
convergent series, including power series and Ramanujan series. The computation
|
|
time for \(N\) bits is \( O((\log N)^{2}M(N)) \), where \( M(N) \) is the time
|
|
needed to multiply two \(N\)-bit numbers. Applications include fast algorithms
|
|
for elementary functions, \(\pi\), hypergeometric functions at rational points,
|
|
$\zeta(3)$, Euler's, Catalan's and Ap{\'e}ry's constant. The algorithms are
|
|
suitable for parallel computation.
|
|
|
|
\end{abstract}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Introduction}
|
|
|
|
Multiple-precision evaluation of real numbers has become efficiently possible
|
|
since Sch\"onhage and Strassen \cite{71} have showed that the bit complexity of
|
|
the multiplication of two \(N\)-bit numbers is
|
|
\( M(N)=O(N\:\log N\:\log\log N) \).
|
|
This is not only a theoretical result; a C++ implementation \cite{96a} can
|
|
exploit this already for \( N=40000 \) bits. Algorithms for computing
|
|
elementary functions (exp, log, sin, cos, tan, asin, acos, atan, sinh, cosh,
|
|
tanh, arsinh, arcosh, artanh) have appeared in \cite{76b}, and a remarkable
|
|
algorithm for \( \pi \) was found by Brent and Salamin \cite{76c}.
|
|
|
|
However, all these algorithms suffer from the fact that calculated results
|
|
are not reusable, since the computation is done using real arithmetic (using
|
|
exact rational arithmetic would be extremely inefficient). Therefore functions
|
|
or constants have to be recomputed from the scratch every time higher precision
|
|
is required.
|
|
|
|
In this note, we present algorithms for fast computation of sums of the form
|
|
|
|
\[S=\sum _{n=0}^{\infty }R(n)F(0)\cdots F(n)\]
|
|
where \( R(n) \) and \( F(n) \) are rational functions in \( n \) with rational
|
|
coefficients, provided that this sum is linearly convergent, i.e. that the
|
|
\( n \)-th term is \( O(c^{-n}) \) with \( c>1 \). Examples include elementary
|
|
and hypergeometric functions at rational points in the {\em interior} of the
|
|
circle of convergence, as well as \( \pi \) and Euler's, Catalan's and
|
|
Ap{\'e}ry's constants.
|
|
|
|
The presented algorithms are {\em easy to implement} and {\em extremely
|
|
efficient}, since they take advantage of pure integer arithmetic. The
|
|
calculated results are {\em exact}, making {\em checkpointing} and
|
|
{\em reuse} of computations possible. Finally,
|
|
the computation of our algorithms {\em can be easily parallelised}.
|
|
|
|
After publishing the present paper, we were informed that the results of
|
|
section 2 were already published by E.~Karatsuba in \cite{91,91b,93,95c}.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Evaluation of linearly convergent series}
|
|
|
|
The technique presented here applies to all linearly convergent sums of the
|
|
form
|
|
|
|
\[ S=\sum ^{\infty }_{n=0}
|
|
\frac{a(n)}{b(n)}\frac{p(0)\cdots p(n)}{q(0)\cdots q(n)}\]
|
|
where \( a(n) \), \( b(n) \), \( p(n) \), \( q(n) \) are integers with
|
|
\( O(\log n) \) bits. The most often used case is that \( a(n) \), \( b(n) \),
|
|
\( p(n) \), \( q(n) \) are polynomials in \( n \) with integer coefficients.
|
|
|
|
\begin{description}
|
|
\item [Algorithm:]~
|
|
\end{description}
|
|
|
|
Given two index bounds \( n_{1} \) and \( n_{2} \), consider the partial sum
|
|
|
|
\[
|
|
S=\sum _{n_{1}\leq n<n_{2}}
|
|
\frac{a(n)}{b(n)}\frac{p(n_{1})\cdots p(n)}{q(n_{1})\cdots q(n)}\]
|
|
|
|
It is not computed directly. Instead, we compute the integers
|
|
\( P={p(n_{1})}\cdots {p(n_{2}-1)} \), \( Q={q(n_{1})}\cdots {q(n_{2}-1)} \),
|
|
\( B={b(n_{1})}\cdots {b(n_{2}-1)} \) and \( T=BQS \). If \( n_{2}-n_{1}<5 \),
|
|
these are computed directly. If \( n_{2}-n_{1}\geq 5 \), they are computed
|
|
using {\em binary splitting}: Choose an index \( n_{m} \) in the middle of
|
|
\( n_{1} \) and \( n_{2} \), compute the components \( P_{l} \), \( Q_{l} \),
|
|
\( B_{l} \), \( T_{l} \) belonging to the interval \( n_{1}\leq n<n_{m} \),
|
|
compute the components \( P_{r} \), \( Q_{r} \), \( B_{r} \), \( T_{r} \)
|
|
belonging to the interval \( n_{m}\leq n<n_{2} \), and set
|
|
\( P=P_{l}P_{r} \), \( Q=Q_{l}Q_{r} \), \( B=B_{l}B_{r} \) and
|
|
\( T=B_{r}Q_{r}T_{l}+B_{l}P_{l}T_{r} \).
|
|
|
|
Finally, this algorithm is applied to \( n_{1}=0 \) and
|
|
\( n_{2}=n_{\max }=O(N) \), and a final floating-point division
|
|
\( S=\frac{T}{BQ} \) is performed.
|
|
|
|
\begin{description}
|
|
\item [Complexity:]~
|
|
\end{description}
|
|
|
|
The bit complexity of computing \( S \) with \( N \) bits of precision is
|
|
\( O((\log N)^{2}M(N)) \).
|
|
|
|
\begin{description}
|
|
\item [Proof:]~
|
|
\end{description}
|
|
|
|
Since we have assumed the series to be linearly convergent, the \( n \)-th
|
|
term is \( O(c^{-n}) \) with \( c>1 \). Hence choosing
|
|
\( n_{\max }=N\frac{\log 2}{\log c}+O(1) \) will ensure that the round-off
|
|
error is \( <2^{-N} \). By our assumption that \( a(n) \), \( b(n) \),
|
|
\( p(n) \), \( q(n) \) are integers with \( O(\log n) \) bits, the integers
|
|
\( P \), \( Q \), \( B \), \( T \) belonging to the interval
|
|
\( n_{1}\leq n<n_{2} \) all have \( O((n_{2}-n_{1})\log n_{2}) \) bits.
|
|
|
|
The algorithm's recursion depth is \( d=\frac{\log n_{\max }}{\log 2}+O(1) \).
|
|
At recursion depth \( k \) (\( 1\leq k\leq d \)), integers having each
|
|
\( O(\frac{n_{\max }}{2^{k}}\log n_{\max }) \) bits are multiplied. Thus,
|
|
the entire computation time \( t \) is
|
|
\begin{eqnarray*}
|
|
t &=& \sum ^{d}_{k=1}
|
|
2^{k-1}O\left( M\left( \frac{n_{\max }}{2^{k}}\log n_{\max }\right)\right)\\
|
|
&=& \sum ^{d}_{k=1} O\left( M\left( n_{\max }\log n_{\max }\right) \right)\\
|
|
&=& O(\log n_{\max }M(n_{\max }\log n_{\max }))
|
|
\end{eqnarray*}
|
|
Because of \( n_{\max }=O( \frac{N}{\log c}) \) and
|
|
\begin{eqnarray*}
|
|
M\left(\frac{N}{\log c} \log \frac{N}{\log c}\right)
|
|
&=& O\left(\frac{1}{\log c} N\, (\log N)^{2}\, \log \log N\right)\\
|
|
&=& O\left(\frac{1}{\log c} \log N\, M(N)\right)
|
|
\end{eqnarray*}
|
|
we have
|
|
\[ t = O\left(\frac{1}{\log c} (\log N)^{2}M(N)\right) \]
|
|
Considering \(c\) as constant, this is the desired result.
|
|
|
|
\begin{description}
|
|
\item [Checkpointing/Parallelising:]~
|
|
\end{description}
|
|
|
|
A checkpoint can be easily done by storing the (integer) values of
|
|
\( n_1 \), \( n_2 \), \( P \), \( Q \), \( B \) and \( T \).
|
|
Similarly, if \( m \) processors are available, then the interval
|
|
\( [0,n_{max}] \) can be divided into \( m \) pieces of length
|
|
\( l = \lfloor n_{max}/m \rfloor \). After each processor \( i \) has
|
|
computed the sum of its interval \( [il,(i+1)l] \), the partial sums are
|
|
combined to the final result using the rules described above.
|
|
|
|
\begin{description}
|
|
\item [Note:]~
|
|
\end{description}
|
|
|
|
For the special case \( a(n)=b(n)=1 \), the binary splitting algorithm has
|
|
already been documented in \cite{76a}, section 6, and \cite{87}, section 10.2.3.
|
|
|
|
Explicit computation of \( P \), \( Q \), \( B \), \( T \) is only required
|
|
as a recursion base, for \( n_{2}-n_{1}<2 \), but avoiding recursions for
|
|
\( n_{2}-n_{1}<5 \) gains some percent of execution speed.
|
|
|
|
The binary splitting algorithm is asymptotically faster than step-by-step
|
|
evaluation of the sum -- which has binary complexity \( O(N^{2}) \) -- because
|
|
it pushes as much multiplication work as possible to the region where
|
|
multiplication becomes efficient. If the multiplication were implemented
|
|
as an \( M(N)=O(N^{2}) \) algorithm, the binary splitting algorithm would
|
|
provide no speedup over step-by-step evaluation.
|
|
|
|
\begin{description}
|
|
\item [Implementation:]~
|
|
\end{description}
|
|
|
|
In the following we present a simplified C++ implementation of
|
|
the above algorithm\footnote{A complete implementation can be found in
|
|
CLN \cite{96a}. The implementation of the binary-splitting method will
|
|
be also available in {\sf LiDIA-1.4}}. The initialisation is done by a
|
|
structure {\tt abpq\_series} containing arrays {\tt a}, {\tt b}, {\tt p}
|
|
and {\tt q} of multiprecision integers ({\tt bigint}s). The values of
|
|
the arrays at the index \( n \) correspond to the values of the functions
|
|
\( a \), \( b \), \( p \) and \( q \) at the integer point \( n \).
|
|
The (partial) results of the algorithm are stored in the
|
|
{\tt abpq\_series\_result} structure.
|
|
|
|
\begin{verbatim}
|
|
// abpq_series is initialised by user
|
|
struct { bigint *a, *b, *p, *q;
|
|
} abpq_series;
|
|
|
|
// abpq_series_result holds the partial results
|
|
struct { bigint P, Q, B, T;
|
|
} abpq_series_result;
|
|
|
|
// binary splitting summation for abpq_series
|
|
void sum_abpq(abpq_series_result & r,
|
|
int n1, int n2,
|
|
const abpq_series & arg)
|
|
{
|
|
// check the length of the summation interval
|
|
switch (n2 - n1)
|
|
{
|
|
case 0:
|
|
error_handler("summation device",
|
|
"sum_abpq:: n2-n1 should be > 0.");
|
|
break;
|
|
|
|
case 1: // the result at the point n1
|
|
r.P = arg.p[n1];
|
|
r.Q = arg.q[n1];
|
|
r.B = arg.b[n1];
|
|
r.T = arg.a[n1] * arg.p[n1];
|
|
break;
|
|
|
|
// cases 2, 3, 4 left out for simplicity
|
|
|
|
default: // the general case
|
|
|
|
// the left and the right partial sum
|
|
abpq_series_result L, R;
|
|
|
|
// find the middle of the interval
|
|
int nm = (n1 + n2) / 2;
|
|
|
|
// sum left side
|
|
sum_abpq(L, n1, nm, arg);
|
|
|
|
// sum right side
|
|
sum_abpq(R, nm, n2, arg);
|
|
|
|
// put together
|
|
r.P = L.P * R.P;
|
|
r.Q = L.Q * R.Q;
|
|
r.B = L.B * R.B;
|
|
r.T = R.B * R.Q * L.T + L.B * L.P * R.T;
|
|
break;
|
|
}
|
|
}
|
|
\end{verbatim}
|
|
|
|
Note that the multiprecision integers could be replaced here by integer
|
|
polynomials, or by any other ring providing the operators \( = \) (assignment),
|
|
\( + \) (addition) and \( * \) (multiplication). For example, one could regard
|
|
a bivariate polynomial over the integers as a series over the second variable,
|
|
with polynomials over the first variable as its coefficients. This would result
|
|
an accelerated algorithm for summing bivariate (and thus multivariate)
|
|
polynomials.
|
|
|
|
\subsection{Example: The factorial}
|
|
|
|
This is the most classical example of the binary splitting algorithm and was
|
|
probably known long before \cite{87}.
|
|
|
|
Computation of the factorial is best done using the binary splitting algorithm,
|
|
combined with a reduction of the even factors into odd factors and
|
|
multiplication with a power of 2, according to the formula
|
|
|
|
\[
|
|
n!=2^{n-\sigma _{2}(n)}\cdot \prod _{k\geq 1}
|
|
\left( \prod _{\frac{n}{2^{k}}<2m+1\leq \frac{n}{2^{k-1}}}(2m+1)\right) ^{k}\]
|
|
and where the products
|
|
\[
|
|
P(n_{1},n_{2})=\prod _{n_{1}<m\leq n_{2}}(2m+1)\]
|
|
are evaluated according to the binary splitting algorithm:
|
|
\( P(n_{1},n_{2})=P(n_{1},n_{m})P(n_{m},n_{2}) \) with
|
|
\( n_{m}=\left\lfloor \frac{n_{1}+n_{2}}{2}\right\rfloor \)
|
|
if \( n_{2}-n_{1}\geq 5 \).
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Example: Elementary functions at rational points}
|
|
|
|
The binary splitting algorithm can be applied to the fast computation of the
|
|
elementary functions at rational points \( x=\frac{u}{v} \), simply
|
|
by using the power series. We present how this can be done for
|
|
\( \exp (x) \), \( \ln (x) \), \( \sin (x) \), \( \cos (x) \),
|
|
\( \arctan (x) \), \( \sinh (x) \) and \( \cosh (x) \). The calculation
|
|
of other elementary functions is similar (or it can be reduced to the
|
|
calculation of these functions).
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{\( \exp (x) \) for rational \( x \)}
|
|
|
|
This is a direct application of the above algorithm with \( a(n)=1 \),
|
|
\( b(n)=1 \), \( p(0)=q(0)=1 \), and \( p(n)=u \), \( q(n)=nv \) for \( n>0 \).
|
|
Because the series is not only linearly convergent -- \( \exp (x) \) is an
|
|
entire function --, \( n_{\max }=O(\frac{N}{\log N + \log \frac{1}{|x|}}) \),
|
|
hence the bit complexity is
|
|
\[ O\left(\frac{(\log N)^2}{\log N + \log \frac{1}{|x|}} M(N)\right) \]
|
|
Considering \(x\) as constant, this is \( O(\log N\: M(N)) \).
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{\( \exp (x) \) for real \( x \)}
|
|
|
|
This can be computed using the addition theorem for exp, by a trick due to
|
|
Brent \cite{76a} (see also \cite{87}, section 10.2, exercise 8). Write
|
|
|
|
\[
|
|
x=x_{0}+\sum _{k=0}^{\infty }\frac{u_{k}}{v_{k}}\]
|
|
with \( x_{0} \) integer, \( v_{k}=2^{2^{k}} \) and
|
|
\( |u_{k}|<2^{2^{k-1}} \), and compute
|
|
\[
|
|
\exp (x)=
|
|
\exp (x_{0})\cdot \prod _{k\geq 0}\exp \left( \frac{u_{k}}{v_{k}}\right) \]
|
|
|
|
This algorithm has bit complexity
|
|
\[ O\left(\sum\limits_{k=0}^{O(\log N)} \frac{(\log N)^2}{\log N + 2^k} M(N)\right)
|
|
= O((\log N)^{2}M(N)) \]
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{ \( \ln (x) \) for rational \( x \)}
|
|
|
|
For rational \( |x-1|<1 \), the binary splitting algorithm can also be applied
|
|
directly to the power series for \( \ln (x) \). Write \( x-1=\frac{u}{v} \)
|
|
and compute the series with \( a(n)=1 \), \( b(n)=n+1 \), \( q(n)=v \),
|
|
\( p(0)=u \), and \( p(n)=-u \) for \( n>0 \).
|
|
|
|
This algorithm has bit complexity \( O((\log N)^{2}M(N)) \).
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{\( \ln (x) \) for real \( x \)}
|
|
|
|
This can be computed using the ``inverse'' Brent trick:
|
|
|
|
Start with \( y:=0 \).
|
|
|
|
As long as \( x\neq 1 \) within the actual precision, choose \( k \)
|
|
maximal with \( |x-1|<2^{-k} \). Put \( z=2^{-2k}\left[ 2^{2k}(x-1)\right] \),
|
|
i.e. let \( z \) contain the first \( k \) significant bits of \( x-1 \).
|
|
\( z \) is a good approximation for \( \ln (x) \). Set \( y:=y+z \) and
|
|
\( x:=x\cdot \exp (-z) \).
|
|
|
|
Since \( x\cdot \exp (y) \) is an invariant of the algorithm, the final
|
|
\( y \) is the desired value \( \ln (x) \).
|
|
|
|
This algorithm has bit complexity
|
|
\[ O\left(\sum\limits_{k=0}^{O(\log N)} \frac{(\log N)^2}{\log N + 2^k} M(N)\right)
|
|
= O((\log N)^{2}M(N)) \]
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{ \( \sin (x) \), \( \cos (x) \) for rational \( x \)}
|
|
|
|
These are direct applications of the binary splitting algorithm: For
|
|
\( \sin (x) \), put \( a(n)=1 \), \( b(n)=1 \), \( p(0)=u \),
|
|
\( q(0)=v \), and \( p(n)=-u^{2} \), \( q(n)=(2n)(2n+1)v^{2} \) for
|
|
\( n>0 \). For \( \cos (x) \), put \( a(n)=1 \), \( b(n)=1 \),
|
|
\( p(0)=1 \), \( q(0)=1 \), and \( p(n)=-u^{2} \), \( q(n)=(2n-1)(2n)v^{2} \)
|
|
for \( n>0 \). Of course, when both \( \sin (x) \) and \( \cos (x) \) are
|
|
needed, one should only compute
|
|
\( \sin (x) \) this way, and then set
|
|
\( \cos (x)=\pm \sqrt{1-\sin (x)^{2}} \). This is a 20\% speedup at least.
|
|
|
|
The bit complexity of these algorithms is \( O(\log N\: M(N)) \).
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{ \( \sin (x) \), \( \cos (x) \) for real \( x \)}
|
|
|
|
To compute \( \cos (x)+i\sin (x)=\exp (ix) \) for real \( x \), again the
|
|
addition theorems and Brent's trick
|
|
can be used. The resulting algorithm has bit complexity
|
|
\( O((\log N)^{2}M(N)) \).
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{ \( \arctan (x) \) for rational \( x \)}
|
|
|
|
For rational \( |x|<1 \), the fastest way to compute \( \arctan (x) \) with
|
|
bit complexity \( O((\log N)^{2}M(N)) \) is
|
|
to apply the binary splitting algorithm directly to the power series
|
|
for \( \arctan (x) \). Put \( a(n)=1 \), \( b(n)=2n+1 \), \( q(n)=1 \),
|
|
\( p(0)=x \) and \( p(n)=-x^{2} \) for \( n>0 \).
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{ \( \arctan (x) \) for real \( x \)}
|
|
|
|
This again can be computed using the ``inverse'' Brent trick:
|
|
|
|
Start out with \( z:=\frac{1}{\sqrt{1+x^{2}}}+i\frac{x}{\sqrt{1+x^{2}}} \)
|
|
and \( \varphi :=0 \). During the algorithm \( z \) will be a complex number
|
|
with \( |z|=1 \) and \( \re (z)>0 \).
|
|
|
|
As long as \( \im (z)\neq 0 \) within the actual precision, choose \( k \)
|
|
maximal with \( |\im (z)|<2^{-k} \).
|
|
Put \( \alpha =2^{-2k}\left[ 2^{2k}\im (z)\right] \), i.e. let \( \alpha \)
|
|
contain the first \( k \) significant bits of \( \im (z) \). \( \alpha \)
|
|
is a good approximation for \( \arcsin (\im (z)) \). Set
|
|
\( \varphi :=\varphi +\alpha \) and \( z:=z\cdot \exp (-i\alpha ) \).
|
|
|
|
Since \( z\cdot \exp (i\varphi ) \) is an invariant of the algorithm, the
|
|
final \( \varphi \) is the desired
|
|
value \( \arcsin \frac{x}{\sqrt{1+x^{2}}} \).
|
|
|
|
This algorithm has bit complexity
|
|
\[ O\left(\sum\limits_{k=0}^{O(\log N)} \frac{(\log N)^2}{\log N + 2^k} M(N)\right)
|
|
= O((\log N)^{2}M(N)) \]
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{\( \sinh (x) \), \( \cosh (x) \) for rational and real \( x \)}
|
|
|
|
These can be computed by similar algorithms as \( \sin (x) \) and
|
|
\( \cos (x) \) above, with the same asymptotic bit complexity. The
|
|
standard computation, using \( \exp (x) \) and its reciprocal (calculated
|
|
by the Newton method) results also to the same complexity and works equally
|
|
well in practice.
|
|
|
|
The bit complexity of these algorithms is \( O(\log N\: M(N)) \) for rational
|
|
\( x \) and \( O((\log N)^{2}M(N)) \) for real \( x \).
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Example: Hypergeometric functions at rational points}
|
|
|
|
The binary splitting algorithm is well suited for the evaluation of a
|
|
hypergeometric series
|
|
|
|
\[
|
|
F\left( \begin{array}{ccc}
|
|
a_{1}, & \ldots , & a_{r}\\
|
|
b_{1}, & \ldots , & b_{s}
|
|
\end{array}
|
|
\big| x\right) =\sum ^{\infty }_{n=0}
|
|
\frac{a_{1}^{\overline{n}}\cdots
|
|
a_{r}^{\overline{n}}}{b_{1}^{\overline{n}}\cdots b_{s}^{\overline{n}}}x^{n}\]
|
|
with rational coefficients \( a_{1} \), ..., \( a_{r} \), \( b_{1} \),
|
|
..., \( b_{s} \) at a rational point \( x \) in the interior of the circle of
|
|
convergence. Just put \( a(n)=1 \), \( b(n)=1 \), \( p(0)=q(0)=1 \), and
|
|
\( \frac{p(n)}{q(n)}=\frac{(a_{1}+n-1)\cdots
|
|
(a_{r}+n-1)x}{(b_{1}+n-1)\cdots (b_{s}+n-1)} \) for \( n>0 \). The evaluation
|
|
can thus be done with
|
|
bit complexity \( O((\log N)^{2}M(N)) \) for
|
|
\( r=s \) and \( O(\log N\: M(N)) \) for \( r<s \).
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Example: \( \pi \)}
|
|
|
|
The Ramanujan series for \( \pi \)
|
|
\[
|
|
\frac{1}{\pi }=\frac{12}{C^{3/2}}\sum^{\infty }_{n=0}
|
|
\frac{(-1)^n(6n)!(A+nB)}{(3n)!n!^{3}C^{3n}}\]
|
|
with \( A=13591409 \), \( B=545140134 \), \( C=640320 \) found by the
|
|
Chudnovsky's \footnote{A special case of \cite{87}, formula (5.5.18),
|
|
with N=163.} and which is used by the {\sf LiDIA} \cite{95,97,97b} and the Pari
|
|
\cite{95b} system to compute \( \pi \), is usually written as an algorithm
|
|
of bit complexity \( O(N^{2}) \). It is, however, possible to apply
|
|
binary splitting to the sum. Put \( a(n)=A+nB \), \( b(n)=1 \),
|
|
\( p(0)=1 \), \( q(0)=1 \), and \( p(n)=-(6n-5)(2n-1)(6n-1) \),
|
|
\( q(n)=n^{3}C^3/24 \) for \( n>0 \). This reduces the complexity to
|
|
\( O((\log N)^{2}M(N)) \). Although this is theoretically slower than
|
|
Brent-Salamin's quadratically convergent iteration, which has a bit
|
|
complexity of \( O(\log N\: M(N)) \), in practice the binary splitted
|
|
Ramanujan sum is three times faster than Brent-Salamin, at least in the
|
|
range from \( N=1000 \) bits to \( N=1000000 \) bits.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%% \subsection{Example: Catalan's constant}
|
|
|
|
\subsection{Example: Catalan's constant \( G \)}
|
|
|
|
A linearly convergent sum for Catalan's constant
|
|
\[
|
|
G:=\sum ^{\infty }_{n=0}\frac{(-1)^{n}}{(2n+1)^{2}}\]
|
|
is given in \cite{87}, p. 386:
|
|
\[
|
|
G = \frac{3}{8}\sum ^{\infty }_{n=0}\frac{1}{{2n \choose n} (2n+1)^{2}}
|
|
+\frac{\pi }{8}\log (2+\sqrt{3})
|
|
\]
|
|
|
|
The series is summed using binary splitting, putting \( a(n)=1 \),
|
|
\( b(n)=2n+1 \), \( p(0)=1 \), \( q(0)=1 \), and
|
|
\( p(n)=n \), \( q(n)=2(2n+1) \) for \( n>0 \). Thus
|
|
\( G \) can be computed with bit complexity \( O((\log N)^{2}M(N)) \).
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Example: The Gamma function at rational points}
|
|
|
|
For evaluating \( \Gamma (s) \) for rational \( s \), we first reduce \( s \)
|
|
to the range \( 1\leq s\leq 2 \) by the formula \( \Gamma (s+1)=s\Gamma (s) \).
|
|
To compute \( \Gamma (s) \) with a precision of \( N \) bits, choose a
|
|
positive integer \( x \) with \( xe^{-x}<2^{-N} \). Partial integration lets
|
|
us write
|
|
|
|
\begin{eqnarray*}
|
|
\Gamma (s)&=& \int ^{\infty }_{0}e^{-t}t^{s-1}dt\\
|
|
&=& x^{s}e^{-x}\:\sum ^{\infty }_{n=0}
|
|
\frac{x^{n}}{s(s+1)\cdots (s+n)}
|
|
+\int^{\infty }_{x}e^{-t}t^{s-1}dt\\
|
|
\end{eqnarray*}
|
|
The last integral is \( <xe^{-x}<2^{-N} \). The series is evaluated as a
|
|
hypergeometric function (see above); the number of terms to be summed up is
|
|
\( O(N) \), since \( x=O(N) \). Thus the entire computation can be done with
|
|
bit complexity \( O((\log N)^{2}M(N)) \).
|
|
|
|
\begin{description}
|
|
\item [Note:]~
|
|
\end{description}
|
|
|
|
This result is already mentioned in \cite{76b}.
|
|
|
|
E.~Karatsuba \cite{91} extends this result to \( \Gamma (s) \) for algebraic
|
|
\( s \).
|
|
|
|
For \( \Gamma (s) \) there is no checkpointing possible because of the
|
|
dependency on \( x \) in the binary splitting.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Example: The Riemann Zeta value \( \zeta (3) \) \label{zeta}}
|
|
|
|
Recently, Doron Zeilberger's method of ``creative telescoping'' has been
|
|
applied to Riemann's zeta function at \( s=3 \) (see \cite{96c}), which is
|
|
also known as {\em Ap{\'e}ry's constant}:
|
|
\[
|
|
\zeta (3)=
|
|
\frac{1}{2}\sum ^{\infty }_{n=1}
|
|
\frac{(-1)^{n-1}(205n^{2}-160n+32)}{n^{5}{2n \choose n}^{5}}
|
|
\]
|
|
|
|
This sum consists of three hypergeometric series. Binary splitting can also be
|
|
applied directly, by putting \( a(n)=205n^{2}+250n+77 \), \( b(n)=1 \),
|
|
\( p(0)=1 \), \( p(n)=-n^{5} \) for \( n>0 \), and \( q(n)=32(2n+1)^{5} \).
|
|
Thus the bit complexity of computing \( \zeta (3) \) is
|
|
\( O((\log N)^{2}M(N)) \).
|
|
|
|
\begin{description}
|
|
\item [Note:]~
|
|
\end{description}
|
|
|
|
Using this the authors were able to establish a new record in the
|
|
calculation of \( \zeta (3) \) by computing 1,000,000 decimals \cite{96d}.
|
|
The computation took 8 hours on a Hewlett Packard 9000/712 machine. After
|
|
distributing on a cluster of 4 HP 9000/712 machines the same computation
|
|
required only 2.5 hours. The half hour was necessary for reading the partial
|
|
results from disk and for recombining them. Again, we have used binary-splitting
|
|
for recombining: the 4 partial result produced 2 results which were combined
|
|
to the final 1,000,000 decimals value of \( \zeta (3) \).
|
|
|
|
This example shows the importance of checkpointing. Even if a machine crashes
|
|
through the calculation, the results of the other machines are still usable.
|
|
Additionally, being able to parallelise the computation reduced the computing
|
|
time dramatically.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Evaluation of linearly convergent series of sums}
|
|
|
|
The technique presented in the previous section also applies to all linearly
|
|
convergent sums of the form
|
|
|
|
\[
|
|
U=\sum ^{\infty }_{n=0}\frac{a(n)}{b(n)}\left( \frac{c(0)}{d(0)}+\cdots
|
|
+\frac{c(n)}{d(n)}\right) \frac{p(0)\cdots p(n)}{q(0)\cdots q(n)}\]
|
|
where \( a(n) \), \( b(n) \), \( c(n) \), \( d(n) \), \( p(n) \),
|
|
\( q(n) \) are integers with \( O(\log n) \) bits. The most often
|
|
used case is again that \( a(n) \), \( b(n) \), \( c(n) \), \( d(n) \),
|
|
\( p(n) \), \( q(n) \) are polynomials in \( n \) with
|
|
integer coefficients.
|
|
|
|
\begin{description}
|
|
\item [Algorithm:]~
|
|
\end{description}
|
|
|
|
Given two index bounds \( n_{1} \)and \( n_{2} \), consider the partial sums
|
|
\[
|
|
S=\sum _{n_{1}\leq n<n_{2}}\frac{a(n)}{b(n)}
|
|
\frac{p(n_{1})\cdots p(n)}{q(n_{1})\cdots q(n)}\]
|
|
and
|
|
\[
|
|
U=\sum _{n_{1}\leq n<n_{2}}\frac{a(n)}{b(n)}
|
|
\left( \frac{c(n_{1})}{d(n_{1})}+\cdots +\frac{c(n)}{d(n)}\right)
|
|
\frac{p(n_{1})\cdots p(n)}{q(n_{1})\cdots q(n)}\]
|
|
|
|
|
|
As above, we compute the integers \( P={p(n_{1})}\cdots {p(n_{2}-1)} \),
|
|
\( Q={q(n_{1})}\cdots {q(n_{2}-1)} \), \( B={b(n_{1})}\cdots {b(n_{2}-1)} \),
|
|
\( T=BQS \), \( D={d(n_{1})}\cdots {d(n_{2}-1)} \),
|
|
\( C=D\left( \frac{c(n_{1})}{d(n_{1})}+\cdots +
|
|
\frac{c(n_{2}-1)}{d(n_{2}-1)}\right) \) and \( V=DBQU \).
|
|
If \( n_{2}-n_{1}<4 \), these
|
|
are computed directly. If \( n_{2}-n_{1}\geq 4 \), they are computed using
|
|
{\em binary splitting}: Choose an index \( n_{m} \) in the middle of
|
|
\( n_{1} \)and \( n_{2} \), compute the components \( P_{l} \), \( Q_{l} \),
|
|
\( B_{l} \), \( T_{l} \), \( D_{l} \), \( C_{l} \), \( V_{l} \) belonging
|
|
to the interval \( n_{1}\leq n<n_{m} \), compute the components \( P_{r} \),
|
|
\( Q_{r} \), \( B_{r} \), \( T_{r} \), \( D_{r} \), \( C_{r} \),
|
|
\( V_{r} \) belonging to the interval \( n_{m}\leq n<n_{2} \), and set
|
|
\( P=P_{l}P_{r} \), \( Q=Q_{l}Q_{r} \), \( B=B_{l}B_{r} \),
|
|
\( T=B_{r}Q_{r}T_{l}+B_{l}P_{l}T_{r} \), \( D=D_{l}D_{r} \),
|
|
\( C=C_{l}D_{r}+C_{r}D_{l} \) and
|
|
\( V=D_{r}B_{r}Q_{r}V_{l}+D_{r}C_{l}B_{l}P_{l}T_{r}+D_{l}B_{l}P_{l}V_{r} \).
|
|
|
|
Finally, this algorithm is applied to \( n_{1}=0 \) and
|
|
\( n_{2}=n_{\max}=O(N) \), and final floating-point divisions
|
|
\( S=\frac{T}{BQ} \) and \( U=\frac{V}{DBQ} \) are performed.
|
|
|
|
\begin{description}
|
|
\item [Complexity:]~
|
|
\end{description}
|
|
|
|
The bit complexity of computing \( S \) and \( U \) with \( N \) bits of
|
|
precision is \( O((\log N)^{2}M(N)) \).
|
|
|
|
\begin{description}
|
|
\item [Proof:]~
|
|
\end{description}
|
|
|
|
By our assumption that \( a(n) \), \( b(n) \), \( c(n) \), \( d(n) \),
|
|
\( p(n) \), \( q(n) \) are integers with \( O(\log n) \) bits,
|
|
the integers \( P \), \( Q \), \( B \), \( T \), \( D \), \( C \),
|
|
\( V \) belonging to the interval \( n_{1}\leq n<n_{2} \) all have
|
|
\( O((n_{2}-n_{1})\log n_{2}) \) bits. The rest of the proof is as in the
|
|
previous section.
|
|
|
|
\begin{description}
|
|
\item [Checkpointing/Parallelising:]~
|
|
\end{description}
|
|
|
|
A checkpoint can be easily done by storing the (integer) values of
|
|
\( n_1 \), \( n_2 \), \( P \), \( Q \), \( B \), \( T \) and additionally
|
|
\( D \), \( C \), \( V \). Similarly, if \( m \) processors are available,
|
|
then the interval \( [0,n_{max}] \) can be divided into \( m \) pieces of
|
|
length \( l = \lfloor n_{max}/m \rfloor \). After each processor \( i \) has
|
|
computed the sum of its interval \( [il,(i+1)l] \), the partial sums are
|
|
combined to the final result using the rules described above.
|
|
|
|
\begin{description}
|
|
\item [Implementation:]~
|
|
\end{description}
|
|
|
|
The C++ implementation of the above algorithm is very similar
|
|
to the previous one. The initialisation is done now by a structure
|
|
{\tt abpqcd\_series} containing arrays {\tt a}, {\tt b}, {\tt p},
|
|
{\tt q}, {\tt c} and {\tt d} of multiprecision integers. The values of
|
|
the arrays at the index \( n \) correspond to the values of the functions
|
|
\( a \), \( b \), \( p \), \( q \), \( c \) and \( d \) at the integer point
|
|
\( n \). The (partial) results of the algorithm are stored in the
|
|
{\tt abpqcd\_series\_result} structure, which now contains 3 new elements
|
|
({\tt C}, {\tt D} and {\tt V}).
|
|
|
|
\begin{verbatim}
|
|
// abpqcd_series is initialised by user
|
|
struct { bigint *a, *b, *p, *q, *c, *d;
|
|
} abpqcd_series;
|
|
|
|
// abpqcd_series_result holds the partial results
|
|
struct { bigint P, Q, B, T, C, D, V;
|
|
} abpqcd_series_result;
|
|
|
|
void sum_abpqcd(abpqcd_series_result & r,
|
|
int n1, int n2,
|
|
const abpqcd_series & arg)
|
|
{
|
|
switch (n2 - n1)
|
|
{
|
|
case 0:
|
|
error_handler("summation device",
|
|
"sum_abpqcd:: n2-n1 should be > 0.");
|
|
break;
|
|
|
|
case 1: // the result at the point n1
|
|
r.P = arg.p[n1];
|
|
r.Q = arg.q[n1];
|
|
r.B = arg.b[n1];
|
|
r.T = arg.a[n1] * arg.p[n1];
|
|
r.D = arg.d[n1];
|
|
r.C = arg.c[n1];
|
|
r.V = arg.a[n1] * arg.c[n1] * arg.p[n1];
|
|
break;
|
|
|
|
// cases 2, 3, 4 left out for simplicity
|
|
|
|
default: // general case
|
|
|
|
// the left and the right partial sum
|
|
abpqcd_series_result L, R;
|
|
|
|
// find the middle of the interval
|
|
int nm = (n1 + n2) / 2;
|
|
|
|
// sum left side
|
|
sum_abpqcd(L, n1, nm, arg);
|
|
|
|
// sum right side
|
|
sum_abpqcd(R, nm, n2, arg);
|
|
|
|
// put together
|
|
r.P = L.P * R.P;
|
|
r.Q = R.Q * L.Q;
|
|
r.B = L.B * R.B;
|
|
bigint tmp = L.B * L.P * R.T;
|
|
r.T = R.B * R.Q * L.T + tmp;
|
|
r.D = L.D * R.D;
|
|
r.C = L.C * R.D + R.C * L.D;
|
|
r.V = R.D * (R.B * R.Q * L.V + L.C * tmp)
|
|
+ L.D * L.B * L.P * R.V;
|
|
break;
|
|
}
|
|
}
|
|
\end{verbatim}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Example: Euler's constant \( C \) \label{eulergamma}}
|
|
|
|
\begin{description}
|
|
\item [Theorem:]~
|
|
\end{description}
|
|
|
|
Let \( f(x)=\sum ^{\infty }_{n=0}\frac{x^{n}}{n!^{2}} \) and
|
|
\( g(x)=\sum ^{\infty }_{n=0}H_{n}\frac{x^{n}}{n!^{2}} \). Then for
|
|
\( x\rightarrow \infty \),
|
|
\( \frac{g(x)}{f(x)}=\frac{1}{2}\log x+C+O\left( e^{-4\sqrt{x}}\right) \).
|
|
|
|
\begin{description}
|
|
\item [Proof:]~
|
|
\end{description}
|
|
|
|
The Laplace method for asymptotic evaluation of exponentially growing
|
|
sums and integrals yields
|
|
\[
|
|
f(x)=
|
|
e^{2\sqrt{x}}x^{-\frac{1}{4}}\frac{1}{2\sqrt{\pi }}(1+O(x^{-\frac{1}{4}}))\]
|
|
and
|
|
\[
|
|
g(x)=e^{2\sqrt{x}}x^{-\frac{1}{4}}\frac{1}{2\sqrt{\pi }}
|
|
\left(\frac{1}{2}\log x+C+O(\log x\cdot x^{-\frac{1}{4}})\right)\]
|
|
On the other hand, \( h(x):=\frac{g(x)}{f(x)} \) satisfies the
|
|
differential equation
|
|
\[
|
|
xf(x)\cdot h''(x)+(2xf'(x)+f(x))\cdot h'(x)=f'(x)\]
|
|
hence
|
|
\[
|
|
h(x)=\frac{1}{2}\log x+C+c_{2}
|
|
\int ^{\infty }_{x}\frac{1}{tf(t)^{2}}dt=\frac{1}{2}\log x+C+O(e^{-4\sqrt{x}})\]
|
|
|
|
|
|
\begin{description}
|
|
\item [Algorithm:]~
|
|
\end{description}
|
|
|
|
To compute \( C \) with a precision of \( N \) bits, set
|
|
\[ x=\left\lceil (N+2)\: \frac{\log 2}{4}\right\rceil ^{2} \]
|
|
and evaluate the series for \( g(x) \) and \( f(x) \) simultaneously,
|
|
using the binary-splitting algorithm,
|
|
with \( a(n)=1 \), \( b(n)=1 \), \( c(n)=1 \), \( d(n)=n+1 \),
|
|
\( p(n)=x \), \( q(n)=(n+1)^{2} \). Let \( \alpha =3.591121477\ldots \)
|
|
be the solution of the equation \( -\alpha \log \alpha +\alpha +1=0 \). Then
|
|
\( \alpha \sqrt{x}-\frac{1}{4\log \alpha }\log \sqrt{x}+O(1) \)
|
|
terms of the series suffice for the relative error to be bounded
|
|
by \( 2^{-N} \).
|
|
|
|
\begin{description}
|
|
\item [Complexity:]~
|
|
\end{description}
|
|
|
|
The bit complexity of this algorithm is \( O((\log N)^{2}M(N)) \).
|
|
|
|
\begin{description}
|
|
\item [Note:]~
|
|
\end{description}
|
|
|
|
This algorithm was first mentioned in \cite{80}. It is by far
|
|
the fastest known algorithm for computing Euler's constant.
|
|
|
|
For Euler's constant there is no checkpointing possible because
|
|
of the dependency on \( x \) in the binary splitting.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Computational results}
|
|
|
|
In this section we present some computational results of our CLN and
|
|
{\sf LiDIA} implementation of the algorithms presented in this note. We use
|
|
the official version (1.3) and an experimental version (1.4a) of {\sf LiDIA}.
|
|
We have taken advantage of {\sf LiDIA}'s ability to replace its kernel
|
|
(multiprecision arithmetic and memory management) \cite{95,97,97b}, so we were
|
|
able to use in both cases CLN's fast integer arithmetic routines.
|
|
|
|
\subsection{Timings}
|
|
|
|
The table in Figure \ref{Fig1} shows the running times for the calculation of
|
|
\( \exp(1) \), \( \log(2) \), \( \pi \), \( C \), \( G \) and \( \zeta(3) \)
|
|
to precision 100, 1000, 10000 and 100000 decimal digits. The timings are given
|
|
in seconds and they denote the {\em real} time needed, i.e. system and user
|
|
time. The computation was done on an Intel Pentium with 133Hz and 32MB of RAM.
|
|
|
|
\begin{figure}[htb]
|
|
\begin{center}
|
|
\begin{tabular}{|l|l|l|l|l|l|l|}
|
|
\hline
|
|
D &\( \exp(1) \)&\( \log(2) \)&\( \pi \)&\( C \) &\( G \)&\( \zeta(3) \)\\
|
|
\hline
|
|
\hline
|
|
\( 10^2 \) &0.0005 & 0.0020 &0.0014 & 0.0309 &0.0179 & 0.0027 \\
|
|
\hline
|
|
\( 10^3 \) &0.0069 & 0.0474 &0.0141 & 0.8110 &0.3580 & 0.0696 \\
|
|
\hline
|
|
\( 10^4 \) &0.2566 & 1.9100 &0.6750 & 33.190 &13.370 & 2.5600 \\
|
|
\hline
|
|
\( 10^5 \) &5.5549 & 45.640 &17.430 & 784.93 &340.33 & 72.970 \\
|
|
\hline
|
|
\end{tabular}
|
|
\caption{{\sf LiDIA-1.4a} timings of computation of constants using
|
|
binary-splitting}\label{Fig1}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
The second table (Figure \ref{Fig2}) summarizes the performance of
|
|
\( exp(x) \) in various Computer Algebra systems\footnote{We do not list
|
|
the timings of {\sf LiDIA-1.4a} since these are comparable to those of CLN.}.
|
|
For a fair comparison of the algorithms, both argument and precision are
|
|
chosen in such a way, that system--specific optimizations (BCD arithmetic
|
|
in Maple, FFT multiplication in CLN, special exact argument handling in
|
|
{\sf LiDIA}) do not work. We use \( x = -\sqrt{2} \) and precision
|
|
\( 10^{(i/3)} \), with \( i \) running from \( 4 \) to \( 15 \).
|
|
|
|
\begin{figure}[htb]
|
|
\begin{center}
|
|
\begin{tabular}{|r|l|l|l|l|}
|
|
\hline
|
|
D & Maple & Pari & {\sf LiDIA-1.3} & CLN \\
|
|
\hline
|
|
\hline
|
|
21 & 0.00090 & 0.00047 & 0.00191 & 0.00075 \\
|
|
\hline
|
|
46 & 0.00250 & 0.00065 & 0.00239 & 0.00109 \\
|
|
\hline
|
|
100 & 0.01000 & 0.00160 & 0.00389 & 0.00239 \\
|
|
\hline
|
|
215 & 0.03100 & 0.00530 & 0.00750 & 0.00690 \\
|
|
\hline
|
|
464 & 0.11000 & 0.02500 & 0.02050 & 0.02991 \\
|
|
\hline
|
|
1000 & 0.4000 & 0.2940 & 0.0704 & 0.0861 \\
|
|
\hline
|
|
2154 & 1.7190 & 0.8980 & 0.2990 & 0.2527 \\
|
|
\hline
|
|
4641 & 8.121 & 5.941 & 1.510 & 0.906 \\
|
|
\hline
|
|
10000 & 39.340 & 39.776 & 7.360 & 4.059 \\
|
|
\hline
|
|
21544 & 172.499 & 280.207 & 39.900 & 15.010 \\
|
|
\hline
|
|
46415 & 868.841 & 1972.184& 129.000 & 39.848 \\
|
|
\hline
|
|
100000 & 4873.829 & 21369.197& 437.000 & 106.990 \\
|
|
\hline
|
|
\end{tabular}
|
|
\caption{Timings of computation of \( \exp(-\sqrt{2}) \)}\label{Fig2}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
MapleV R3 is the slowest system in this comparison. This is probably due to
|
|
the BCD arithmetic it uses. However, Maple seems to have an asymptotically
|
|
better algorithm for \( exp (x) \) for numbers having more than 10000 decimals.
|
|
In this range it outperforms Pari-1.39.03, which is the fastest system in the
|
|
0--200 decimals range.
|
|
|
|
The comparison indicating the strength of binary-splitting is between
|
|
{\sf LiDIA-1.3} and CLN itself. Having the same kernel, the only
|
|
difference is here that {\sf LiDIA-1.3} uses Brent's \( O(\sqrt{n}M(n)) \)
|
|
for \( \exp(x) \), whereas CLN changes from Brent's method to a
|
|
binary-splitting version for large numbers.
|
|
|
|
As expected in the range of 1000--100000 decimals CLN outperforms
|
|
{\sf LiDIA-1.3} by far. The fact that {\sf LiDIA-1.2.1} is faster
|
|
in the range of 200--1000 decimals (also in some trig. functions)
|
|
is probably due to a better optimized \( O(\sqrt{n}M(n)) \) method
|
|
for \( \exp(x) \).
|
|
|
|
\subsection {Distributed computing of \( \zeta (3) \)}
|
|
|
|
Using the method described in \ref{zeta} the authors were the first to
|
|
compute 1,000,000 decimals of \( \zeta (3) \) \cite{96d}.
|
|
The computation took 8 hours on a Hewlett Packard 9000/712 machine. After
|
|
distributing on a cluster of 4 HP 9000/712 machines the same computation
|
|
required only 2.5 hours. The half hour was necessary for reading the partial
|
|
results from disk and for recombining them. Again, we have used binary-splitting
|
|
for recombining: the 4 partial result produced 2 results which were combined
|
|
to the final 1,000,000 decimals value of \( \zeta (3) \).
|
|
|
|
This example shows the importance of checkpointing. Even if a machine crashes
|
|
through the calculation, the results of the other machines are still usable.
|
|
Additionally, being able to parallelise the computation reduced the computing
|
|
time dramatically.
|
|
|
|
\subsection{Euler's constant \( C \)}
|
|
|
|
We have implemented a version of Brent's and McMillan's algorithm \cite{80} and
|
|
a version accelerated by binary-splitting as shown in \ref{eulergamma}.
|
|
|
|
The computation of \( C \) was done twice on a SPARC-Ultra machine
|
|
with 167 MHz and 256 MB of RAM. The first computation using the non-acellerated
|
|
version required 160 hours. The result of this computation was then verified
|
|
by the binary splitting version in (only) 14 hours.
|
|
|
|
The first 475006 partial quotients of the continued fraction of \( C \)
|
|
were computed on an Intel Pentium with 133 MHz and 32 MB of RAM in 3 hours
|
|
using a programm by H. te Riele based on \cite{96e}, which was translated to
|
|
{\sf LiDIA} for efficiency reasons. Computing the 475006th
|
|
convergent produced the following improved theorem:
|
|
|
|
\medskip
|
|
|
|
\centerline{If \( C \) is a rational number, \(C=p/q\), then \( |q| > 10^{244663} \)}
|
|
|
|
\medskip
|
|
|
|
Details of this computation (including statistics on the partial
|
|
quotients) can be found in \cite{98}.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Conclusions}
|
|
|
|
Although powerful, the binary splitting method has not been widely used.
|
|
Especially, no information existed on the applicability of this method.
|
|
|
|
In this note we presented a generic binary-splitting summation device for
|
|
evaluating two types of linearly convergent series. From this we derived simple
|
|
and computationally efficient algorithms for the evaluation of elementary
|
|
functions and constants. These algorithms work with {\em exact}
|
|
objects, making them suitable for use within Computer Algebra systems.
|
|
|
|
We have shown that the practical performance of our algorithms is
|
|
superior to current system implementations. In addition to existing methods,
|
|
our algorithms provide the possibility of checkpointing and parallelising.
|
|
These features can be useful for huge calculations, such as those done in
|
|
analytic number theory research.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Thanks}
|
|
|
|
The authors would like to thank J\"org Arndt, for pointing us to
|
|
chapter 10 in \cite{87}. We would also like to thank Richard P. Brent for
|
|
his comments and Hermann te Riele for providing us his program for the
|
|
continued fraction computation of Euler's constant.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\bibliography{binsplit}
|
|
\bibliographystyle{acm}
|
|
|
|
\end{document}
|