doc/ratseries/paper/binsplit.tex

   1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2 %%%
   3 %%% Version 12, 1999-09-07, Bruno
   4 %%% -----------------------------
   5 %%%
   6 %%% * Habe neue Referenzen zu den Karatsuba-Arbeiten hinzugefügt, entnommen
   7 %%%   dem großartigen Paper CECM-98-118 (98_118-Borwein-Bradley-Crandall.dvi).
   8 %%% * Habe in acm.bst, Funktion format.title, das change.case rausgenommen,
   9 %%%   das die Namen von Euler, Riemann u.a. in Kleinbuchstaben konvertiert
  10 %%%   hatte.
  11 %%%
  12 %%% Version 11, 1998-12-20, Bruno
  13 %%% -----------------------------
  14 %%%
  15 %%% * Habe Referenzen zu den Karatsuba-Arbeiten hinzugefügt.
  16 %%%
  17 %%% Version 10, 1998-03-10, Thomas & Bruno
  18 %%% --------------------------------------
  19 %%%
  20 %%% * Korrigiere die Formel für a(n) bei zeta(3).
  21 %%% * Schönheitsfehler im Literaturverzeichnis.
  22 %%%
  23 %%% Version 9, 1998-03-06, Thomas & Bruno
  24 %%% -------------------------------------
  25 %%%
  26 %%% * Schreibe \frac{1}{|x|} statt \frac{1}{x}.
  27 %%%
  28 %%% Version 8, 1998-01-16b, Thomas
  29 %%% ------------------------------
  30 %%%
  31 %%% * Drei Literaturverweise für LiDIA statt nur einem.
  32 %%%
  33 %%% Version 7, 1998-01-16a, Bruno
  34 %%% -----------------------------
  35 %%%
  36 %%% * Adresse: Praefix F fuer Frankreich
  37 %%% * Abstract: Erwaehne zeta(3)
  38 %%% * Kleinere Korrekturen der O()-Abschaetzungen
  39 %%%
  40 %%% Version 6, 1998-01-14, Thomas
  41 %%% -----------------------------
  42 %%%
  43 %%% * habe meine Adresse ge"andert.
  44 %%%
  45 %%% * habe Resultat f"ur die Euler Konstante + Kettenbruch eingef"uhrt
  46 %%%
  47 %%% Version 5, 1997-12-11, Thomas
  48 %%% -----------------------------
  49 %%%
  50 %%% * Habe die Anzahl der Scritte bei der Euler Konstante von
  51 %%%   x = ceiling(N log(2)/4)^2 auf x = ceiling((N+2) log(2)/4)^2
  52 %%%   hochgesetzt (Mail von Sofroniou)
  53 %%%
  54 %%% * Habe Kommentar eingef"ugt bzgl Checkpointing bei der Euler
  55 %%%   Konstante und Gamma(x) (Mail von Sofroniou)
  56 %%%
  57 %%% * Habe Section zu Geschwindigkeit von Maple gegen"uber Pari
  58 %%%   verbessert (mail von Laurent Bernardi).
  59 %%%
  60 %%% Version 4, 1997-01-09, Thomas
  61 %%% -----------------------------
  62 %%%
  63 %%% * Habe die Komplexitätsaussage für sinh, cosh ergänzt.
  64 %%% * Habe die Versionen der getesteten CAS ergänzt.
  65 %%%
  66 %%% Version 3, 1997-01-07, Bruno
  67 %%% ----------------------------
  68 %%%
  69 %%% * Meine Firma schreibt sich mit vier Grossbuchstaben.
  70 %%% * Apery schreibt sich m.W. mit einem Akzent.
  71 %%% * Die Fehlermeldung meldet n2-n1>0, nicht n1-n2>0.
  72 %%% * N -> \(N\) (zweimal)
  73 %%% * Leerzeile entfernt nach Display-Formeln, bei denen der Absatz
  74 %%%   weitergeht. Hat den Effekt eines \noindent.
  75 %%% * Im Abschnitt: arctan(x) for rational x: "another way" -> "the fastest way"
  76 %%% * "[87]" -> "\cite{87}"
  77 %%% * Das Cohen-Villegas-Zagier brauchen wir nun doch nicht zu zitieren.
  78 %%% * Die "Note:" am Ende von Abschnitt ueber die Gamma-Funktion optisch
  79 %%%   verkleinert.
  80 %%% * Die Formel fuer die hypergeometrische Funktionen optisch verschoenert.
  81 %%%   Andere Formeln ebenso.
  82 %%% * Figure 2, erste Spalte rechtsbuendig.
  83 %%% * "out performs" -> "outperforms"
  84 %%% * "the streng" -> "the strength"
  85 %%% * Hinweis auf die Parallelisierbarkeit im Abstract.
  86 %%% * Bibtex-Style gehackt, damit nicht jeder zweite Autor auf seine
  87 %%%   Anfangsbuchstaben verkuerzt und alleinstehende Autoren ihres
  88 %%%   Vornamens beraubt werden.
  89 %%%
  90 %%% Version 2, 1997-01-06, Thomas
  91 %%% -----------------------------
  92 %%%
  93 %%% * geänderte Abschnitte sind auskommentiert mit %%%. Alle
  94 %%%   Änderungen sind als Vorschlag zu verstehen. Der Grund
  95 %%%   wird im folgenden angegeben. Falls Du mit einer Änderung
  96 %%%   einverstanden bist, kannst Du alle %%%-Zeilen löschen.
  97 %%%
  98 %%% * Lyx defines wurden entfernt. Grund: das ISSAC-acmconf.sty
  99 %%%   erlaubt keine fremde macros. Übersetzung mit LaTeX geht.
 100 %%% * habe Keyboardumlaute (ä,ü,ö,ß) in LaTeX umgeschrieben.
 101 %%%   Grund: damit die Submission in einem file geht.
 102 %%% * Habe fontenc und psfig usepackage Befehle entfernt.
 103 %%%   Grund: fonts bestimmt acmconf.sty und keine Bilder vorhanden.
 104 %%% * Habe bibliography mit BibTeX (binsplit.bib) und acm.bst
 105 %%%   erstellt. Grund: wird von ISSAC '97 verlangt.
 106 %%% * Habe langen Formeln in einer eqnarray Umgebung gesteckt.
 107 %%%   Grund: acmconf.sty läuft im twocolumn-Modus. Lange Formeln
 108 %%%   haben die Ausgabe durcheinander gebracht.
 109 %%% * Habe Reihenfolge bei der Beschreibung der elementare
 110 %%%   Funktionen geändert, sodaß zuerst die rationale und dann
 111 %%%   die reelle version beschrieben wird. Grund: Einheitlichkeit.
 112 %%% * Habe sinh mit binary-splitting gegen sinh mit exp-Berechnung
 113 %%%   getestet. Sie sind ungefähr gleich gut auf meinen Pentium,
 114 %%%   mir machen "Wackler" beim cosh. cosh ist ab und zu, sowohl
 115 %%%   bei kleiner als auch bei große Präzision langsamer. Habe
 116 %%%   dies und dem Abschnitt sinh, cosh ausgefüllt. Grund: es hat
 117 %%%   gefehlt.
 118 %%% * Habe artanh Abschnitt entfernt. Grund: ich habe in der
 119 %%%   Einleitung der elementaren Funktionen darauf verwiesen, daß
 120 %%%   man die Berechnung anderer Funktionen (wie artanh) auf die
 121 %%%   hier erwähnte zurückführen oder auf analoger Weise
 122 %%%   implementieren kann. Ich denke man braucht nicht alles explizit
 123 %%%   anzugeben.
 124 %%%
 125 %%% * Habe Dein Dankeschön an mich entfernt.
 126 %%% * Habe Abschnitt über Konvergenzbeschleunigung entfernt.
 127 %%%   Grund: das geht in Dein MathComp paper.
 128 %%%
 129 %%% * Habe neue Formel für pi eingefügt. Grund: einfacher,
 130 %%%   effizienter und stimmt mit der angegebenen Referenz
 131 %%%   überein.
 132 %%% * Habe die Berechnung der Apery Konstante angepasst.
 133 %%%   Grund: die hier angegebenen Formel wurde mit einer
 134 %%%   umgeformten Reihe berechnet. Wenn man dieses nicht
 135 %%%   kennt, wirkt es verwirrend. Keine Effizienz-steigerung.
 136 %%% * Habe die Beschreibung für die erste version der Euler
 137 %%%   Konstante entfernt. Grund: wird von der zweiten version
 138 %%%   in jeder Hinsicht (Beweis, Effizienz) gedeckt.
 139 %%% * Habe Abschnitte über Checkpointing und Parallelisierung
 140 %%%   eingefügt. Ein Beispiel über die Wirksamkeit habe ich
 141 %%%   bei der Apery Konstante angegeben. Grund: damit können wir
 142 %%%   das Paper auch bei PASCO '97 einreichen.
 143 %%%
 144 %%% * Habe Beispiel-C++-Implementierung für abpq Reihen eingefügt.
 145 %%%   Grund: zeigen wie einfach es ist wenn man die Formeln hat ;-)
 146 %%% * Habe Beispiel-Implementierung für abpqcd Reihen eingefügt.
 147 %%%   Grund: dito
 148 %%% * Habe Computational results und Conclusions Abschnitt eingefügt.
 149 %%% * Habe die Namen der Konstanten (C, G, ...) and die entsprechenden
 150 %%%   Abschnitten eingefügt. Grund: diese Namen werden bei den
 151 %%%   Tabellen im Abschnitt Computational results benutzt.
 152 %%% * Habe Verweis an LiDIA eingefügt. Grund: wird bei Computational
 153 %%%   results erw\"ahnt.
 154 %%%
 155 %%% Version 1, 1996-11-30, Bruno
 156 %%%
 157 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 158
 159 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 160 %%% Bruno Haible, Thomas Papanikolaou.                                     %%%%
 161 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 162
 163 \documentstyle{acmconf}
 164
 165 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 166 % Plain TeX macros
 167 %\catcode`@=11 % @ ist ab jetzt ein gewoehnlicher Buchstabe
 168 %\def\@Re{\qopname@{Re}}      \def\re#1{{\@Re #1}}
 169 %\def\@Im{\qopname@{Im}}      \def\im#1{{\@Im #1}}
 170 %\catcode`@=12 % @ ist ab jetzt wieder ein Sonderzeichen
 171
 172 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 173 % LaTeX2e macros
 174 \catcode`@=11 % @ ist ab jetzt ein gewoehnlicher Buchstabe
 175 \def\re{\mathop{\operator@font Re}\nolimits}
 176 \def\im{\mathop{\operator@font Im}\nolimits}
 177 \def\artanh{\mathop{\operator@font artanh}\nolimits}
 178 \catcode`@=12 % @ ist ab jetzt wieder ein Sonderzeichen
 179
 180 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 181 \begin{document}
 182
 183 \title{Fast multiprecision evaluation of series of rational numbers}
 184
 185 \author{
 186 \begin{tabular}{ccc}
 187 {Bruno Haible} & \hspace*{2cm} & {Thomas Papanikolaou}\\
 188 {\normalsize ILOG} && {\normalsize Laboratoire A2X}\\
 189 {\normalsize 9, rue de Verdun} && {\normalsize 351, cours de la Lib\'eration}\\
 190 {\normalsize F -- 94253 Gentilly Cedex} && {\normalsize F -- 33405 Talence Cedex}\\
 191 {\normalsize {\tt haible@ilog.fr}} && {\normalsize {\tt papanik@math.u-bordeaux.fr}}\\
 192 \end{tabular}
 193 }
 194
 195 \maketitle
 196
 197 \begin{abstract}
 198
 199 We describe two techniques for fast multiple-precision evaluation of linearly
 200 convergent series, including power series and Ramanujan series. The computation
 201 time for \(N\) bits is  \( O((\log N)^{2}M(N)) \), where \( M(N) \) is the time
 202 needed to multiply two \(N\)-bit numbers. Applications include fast algorithms
 203 for elementary functions,  \(\pi\), hypergeometric functions at rational points,
 204 $\zeta(3)$, Euler's, Catalan's and Ap{\'e}ry's constant. The algorithms are
 205 suitable for parallel computation.
 206
 207 \end{abstract}
 208
 209 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 210 \section{Introduction}
 211
 212 Multiple-precision evaluation of real numbers has become efficiently possible
 213 since Sch\"onhage and Strassen \cite{71} have showed that the bit complexity of
 214 the multiplication of two \(N\)-bit numbers is
 215 \( M(N)=O(N\:\log N\:\log\log N) \).
 216 This is not only a theoretical result; a C++ implementation \cite{96a} can
 217 exploit this already for  \( N=40000 \) bits. Algorithms for computing
 218 elementary functions (exp, log, sin, cos, tan, asin, acos, atan, sinh, cosh,
 219 tanh, arsinh, arcosh, artanh) have appeared in \cite{76b}, and a remarkable
 220 algorithm for \( \pi \) was found by Brent and Salamin \cite{76c}.
 221
 222 However, all these algorithms suffer from the fact that calculated results
 223 are not reusable, since the computation is done using real arithmetic (using
 224 exact rational arithmetic would be extremely inefficient). Therefore functions
 225 or constants have to be recomputed from the scratch every time higher precision
 226 is required.
 227
 228 In this note, we present algorithms for fast computation of sums of the form
 229
 230 \[S=\sum _{n=0}^{\infty }R(n)F(0)\cdots F(n)\]
 231 where \( R(n) \) and \( F(n) \) are rational functions in \( n \) with rational
 232 coefficients, provided that this sum is linearly convergent, i.e. that the
 233 \( n \)-th term is \( O(c^{-n}) \) with  \( c>1 \). Examples include elementary
 234 and hypergeometric functions at rational points in the {\em interior} of the
 235 circle of convergence, as well as \( \pi  \) and Euler's, Catalan's and
 236 Ap{\'e}ry's constants.
 237
 238 The presented algorithms are {\em easy to implement} and {\em extremely
 239 efficient}, since they take advantage of pure integer arithmetic. The
 240 calculated results are {\em exact}, making {\em checkpointing} and
 241 {\em reuse} of computations possible. Finally,
 242 the computation of our algorithms {\em can be easily parallelised}.
 243
 244 After publishing the present paper, we were informed that the results of
 245 section 2 were already published by E.~Karatsuba in \cite{91,91b,93,95c}.
 246
 247 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 248 \section{Evaluation of linearly convergent series}
 249
 250 The technique presented here applies to all linearly convergent sums of the
 251 form
 252
 253 \[ S=\sum ^{\infty }_{n=0}
 254 \frac{a(n)}{b(n)}\frac{p(0)\cdots p(n)}{q(0)\cdots q(n)}\]
 255 where  \( a(n) \),  \( b(n) \),  \( p(n) \),  \( q(n) \) are integers with
 256 \( O(\log n) \) bits. The most often used case is that  \( a(n) \), \( b(n) \),
 257 \( p(n) \),  \( q(n) \) are polynomials in  \( n \) with integer coefficients.
 258
 259 \begin{description}
 260 \item [Algorithm:]~
 261 \end{description}
 262
 263 Given two index bounds  \( n_{1} \) and  \( n_{2} \), consider the partial sum
 264
 265 \[
 266 S=\sum _{n_{1}\leq n<n_{2}}
 267 \frac{a(n)}{b(n)}\frac{p(n_{1})\cdots p(n)}{q(n_{1})\cdots q(n)}\]
 268
 269 It is not computed directly. Instead, we compute the integers
 270 \( P={p(n_{1})}\cdots {p(n_{2}-1)} \),  \( Q={q(n_{1})}\cdots {q(n_{2}-1)} \),
 271 \( B={b(n_{1})}\cdots {b(n_{2}-1)} \) and  \( T=BQS \). If  \( n_{2}-n_{1}<5 \),
 272 these are computed directly.  If  \( n_{2}-n_{1}\geq 5 \), they are computed
 273 using {\em binary splitting}: Choose an index  \( n_{m} \) in the middle of
 274 \( n_{1} \) and \( n_{2} \), compute the components \( P_{l} \), \( Q_{l} \),
 275 \( B_{l} \), \( T_{l} \) belonging to the interval \( n_{1}\leq n<n_{m} \),
 276 compute the components  \( P_{r} \),  \( Q_{r} \),  \( B_{r} \),  \( T_{r} \)
 277 belonging to the interval  \( n_{m}\leq n<n_{2} \), and set
 278 \( P=P_{l}P_{r} \),  \( Q=Q_{l}Q_{r} \),  \( B=B_{l}B_{r} \) and
 279 \( T=B_{r}Q_{r}T_{l}+B_{l}P_{l}T_{r} \).
 280
 281 Finally, this algorithm is applied to  \( n_{1}=0 \) and
 282 \( n_{2}=n_{\max }=O(N) \), and a final floating-point division
 283 \( S=\frac{T}{BQ} \) is performed.
 284
 285 \begin{description}
 286 \item [Complexity:]~
 287 \end{description}
 288
 289 The bit complexity of computing  \( S \) with  \( N \) bits of precision is
 290 \( O((\log N)^{2}M(N)) \).
 291
 292 \begin{description}
 293 \item [Proof:]~
 294 \end{description}
 295
 296 Since we have assumed the series to be linearly convergent, the  \( n \)-th
 297 term is  \( O(c^{-n}) \) with  \( c>1 \). Hence choosing
 298 \( n_{\max }=N\frac{\log 2}{\log c}+O(1) \) will ensure that the round-off
 299 error is  \( <2^{-N} \). By our assumption that  \( a(n) \),  \( b(n) \),
 300 \( p(n) \),  \( q(n) \) are integers with  \( O(\log n) \) bits, the integers
 301 \( P \),  \( Q \),  \( B \),  \( T \) belonging to the interval
 302 \( n_{1}\leq n<n_{2} \) all have \( O((n_{2}-n_{1})\log n_{2}) \) bits.
 303
 304 The algorithm's recursion depth is  \( d=\frac{\log n_{\max }}{\log 2}+O(1) \).
 305 At recursion depth  \( k \) (\( 1\leq k\leq d \)), integers having each
 306 \( O(\frac{n_{\max }}{2^{k}}\log n_{\max }) \) bits are multiplied. Thus,
 307 the entire computation time \( t \) is
 308 \begin{eqnarray*}
 309 t &=& \sum ^{d}_{k=1}
 310 2^{k-1}O\left( M\left( \frac{n_{\max }}{2^{k}}\log n_{\max }\right)\right)\\
 311 &=& \sum ^{d}_{k=1} O\left( M\left( n_{\max }\log n_{\max }\right) \right)\\
 312 &=& O(\log n_{\max }M(n_{\max }\log n_{\max }))
 313 \end{eqnarray*}
 314 Because of  \( n_{\max }=O( \frac{N}{\log c}) \) and
 315 \begin{eqnarray*}
 316 M\left(\frac{N}{\log c} \log \frac{N}{\log c}\right)
 317 &=& O\left(\frac{1}{\log c} N\, (\log N)^{2}\, \log \log N\right)\\
 318 &=& O\left(\frac{1}{\log c} \log N\, M(N)\right)
 319 \end{eqnarray*}
 320 we have
 321 \[ t = O\left(\frac{1}{\log c} (\log N)^{2}M(N)\right) \]
 322 Considering \(c\) as constant, this is the desired result.
 323
 324 \begin{description}
 325 \item [Checkpointing/Parallelising:]~
 326 \end{description}
 327
 328 A checkpoint can be easily done by storing the (integer) values of
 329 \( n_1 \),  \( n_2 \), \( P \),  \( Q \),  \( B \) and \( T \).
 330 Similarly, if \( m \) processors are available, then the interval
 331 \( [0,n_{max}] \) can be divided into \( m \) pieces of length
 332 \( l = \lfloor n_{max}/m \rfloor \). After each processor \( i \) has
 333 computed the sum of its interval \( [il,(i+1)l] \), the partial sums are
 334 combined to the final result using the rules described above.
 335
 336 \begin{description}
 337 \item [Note:]~
 338 \end{description}
 339
 340 For the special case  \( a(n)=b(n)=1 \), the binary splitting algorithm has
 341 already been documented in \cite{76a}, section 6, and \cite{87}, section 10.2.3.
 342
 343 Explicit computation of  \( P \),  \( Q \),  \( B \),  \( T \) is only required
 344 as a recursion base, for  \( n_{2}-n_{1}<2 \), but avoiding recursions for
 345 \( n_{2}-n_{1}<5 \) gains some percent of execution speed.
 346
 347 The binary splitting algorithm is asymptotically faster than step-by-step
 348 evaluation of the sum -- which has binary complexity  \( O(N^{2}) \) -- because
 349 it pushes as much multiplication work as possible to the region where
 350 multiplication becomes efficient. If the multiplication were implemented
 351 as an  \( M(N)=O(N^{2}) \) algorithm, the binary splitting algorithm would
 352 provide no speedup over step-by-step evaluation.
 353
 354 \begin{description}
 355 \item [Implementation:]~
 356 \end{description}
 357
 358 In the following we present a simplified C++ implementation of
 359 the above algorithm\footnote{A complete implementation can be found in
 360 CLN \cite{96a}. The implementation of the binary-splitting method will
 361 be also available in {\sf LiDIA-1.4}}. The initialisation is done by a
 362 structure {\tt abpq\_series} containing arrays {\tt a}, {\tt b}, {\tt p}
 363 and {\tt q} of multiprecision integers ({\tt bigint}s). The values of
 364 the arrays at the index \( n \) correspond to the values of the functions
 365 \( a \), \( b \), \( p \) and \( q \) at the integer point \( n \).
 366 The (partial) results of the algorithm are stored in the
 367 {\tt abpq\_series\_result} structure.
 368
 369 \begin{verbatim}
 370 // abpq_series is initialised by user
 371 struct { bigint *a, *b, *p, *q;
 372        } abpq_series;
 373
 374 // abpq_series_result holds the partial results
 375 struct { bigint P, Q, B, T;
 376        } abpq_series_result;
 377
 378 // binary splitting summation for abpq_series
 379 void sum_abpq(abpq_series_result & r,
 380               int n1, int n2,
 381               const abpq_series & arg)
 382 {
 383   // check the length of the summation interval
 384   switch (n2 - n1)
 385   {
 386     case 0:
 387       error_handler("summation device",
 388                "sum_abpq:: n2-n1 should be > 0.");
 389       break;
 390
 391     case 1: // the result at the point n1
 392       r.P = arg.p[n1];
 393       r.Q = arg.q[n1];
 394       r.B = arg.b[n1];
 395       r.T = arg.a[n1] * arg.p[n1];
 396       break;
 397
 398     // cases 2, 3, 4 left out for simplicity
 399
 400     default: // the general case
 401
 402       // the left and the right partial sum
 403       abpq_series_result L, R;
 404
 405       // find the middle of the interval
 406       int nm = (n1 + n2) / 2;
 407
 408       // sum left side
 409       sum_abpq(L, n1, nm, arg);
 410
 411       // sum right side
 412       sum_abpq(R, nm, n2, arg);
 413
 414       // put together
 415       r.P = L.P * R.P;
 416       r.Q = L.Q * R.Q;
 417       r.B = L.B * R.B;
 418       r.T = R.B * R.Q * L.T + L.B * L.P * R.T;
 419       break;
 420   }
 421 }
 422 \end{verbatim}
 423
 424 Note that the multiprecision integers could be replaced here by integer
 425 polynomials, or by any other ring providing the operators \( = \) (assignment),
 426 \( + \) (addition) and \( * \) (multiplication). For example, one could regard
 427 a bivariate polynomial over the integers as a series over the second variable,
 428 with polynomials over the first variable as its coefficients. This would result
 429 an accelerated algorithm for summing bivariate (and thus multivariate)
 430 polynomials.
 431
 432 \subsection{Example: The factorial}
 433
 434 This is the most classical example of the binary splitting algorithm and was
 435 probably known long before \cite{87}.
 436
 437 Computation of the factorial is best done using the binary splitting algorithm,
 438 combined with a reduction of the even factors into odd factors and
 439 multiplication with a power of 2, according to the formula
 440
 441 \[
 442 n!=2^{n-\sigma _{2}(n)}\cdot \prod _{k\geq 1}
 443 \left( \prod _{\frac{n}{2^{k}}<2m+1\leq \frac{n}{2^{k-1}}}(2m+1)\right) ^{k}\]
 444 and where the products
 445 \[
 446 P(n_{1},n_{2})=\prod _{n_{1}<m\leq n_{2}}(2m+1)\]
 447 are evaluated according to the binary splitting algorithm:
 448 \( P(n_{1},n_{2})=P(n_{1},n_{m})P(n_{m},n_{2}) \) with
 449 \( n_{m}=\left\lfloor \frac{n_{1}+n_{2}}{2}\right\rfloor  \)
 450 if  \( n_{2}-n_{1}\geq 5 \).
 451
 452 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 453 \subsection{Example: Elementary functions at rational points}
 454
 455 The binary splitting algorithm can be applied to the fast computation of the
 456 elementary functions at rational points  \( x=\frac{u}{v} \), simply
 457 by using the power series. We present how this can be done for
 458 \( \exp (x) \), \( \ln (x) \), \( \sin (x) \), \( \cos (x) \),
 459 \( \arctan (x) \), \( \sinh (x) \) and \( \cosh (x) \). The calculation
 460 of other elementary functions is similar (or it can be reduced to the
 461 calculation of these functions).
 462
 463
 464 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 465 \subsubsection{\( \exp (x) \) for rational \( x \)}
 466
 467 This is a direct application of the above algorithm with  \( a(n)=1 \),
 468 \( b(n)=1 \), \( p(0)=q(0)=1 \), and \( p(n)=u \), \( q(n)=nv \) for \( n>0 \).
 469 Because the series is not only linearly convergent -- \( \exp (x) \) is an
 470 entire function --,  \( n_{\max }=O(\frac{N}{\log N + \log \frac{1}{|x|}}) \),
 471 hence the bit complexity is
 472 \[ O\left(\frac{(\log N)^2}{\log N + \log \frac{1}{|x|}} M(N)\right) \]
 473 Considering \(x\) as constant, this is  \( O(\log N\: M(N)) \).
 474
 475
 476 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 477 \subsubsection{\( \exp (x) \) for real \( x \)}
 478
 479 This can be computed using the addition theorem for exp, by a trick due to
 480 Brent \cite{76a} (see also \cite{87}, section 10.2, exercise 8). Write
 481
 482 \[
 483 x=x_{0}+\sum _{k=0}^{\infty }\frac{u_{k}}{v_{k}}\]
 484 with  \( x_{0} \) integer,  \( v_{k}=2^{2^{k}} \) and
 485 \( |u_{k}|<2^{2^{k-1}} \), and compute
 486 \[
 487 \exp (x)=
 488 \exp (x_{0})\cdot \prod _{k\geq 0}\exp \left( \frac{u_{k}}{v_{k}}\right) \]
 489
 490 This algorithm has bit complexity
 491 \[ O\left(\sum\limits_{k=0}^{O(\log N)} \frac{(\log N)^2}{\log N + 2^k} M(N)\right)
 492   = O((\log N)^{2}M(N)) \]
 493
 494
 495 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 496 \subsubsection{ \( \ln (x) \) for rational  \( x \)}
 497
 498 For rational  \( |x-1|<1 \), the binary splitting algorithm can also be applied
 499 directly to the power series for  \( \ln (x) \). Write  \( x-1=\frac{u}{v} \)
 500 and compute the series with  \( a(n)=1 \),  \( b(n)=n+1 \),  \( q(n)=v \),
 501 \( p(0)=u \), and  \( p(n)=-u \) for  \( n>0 \).
 502
 503 This algorithm has bit complexity  \( O((\log N)^{2}M(N)) \).
 504
 505
 506 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 507 \subsubsection{\( \ln (x) \) for real \( x \)}
 508
 509 This can be computed using the ``inverse'' Brent trick:
 510
 511 Start with  \( y:=0 \).
 512
 513 As long as  \( x\neq 1 \) within the actual precision, choose  \( k \)
 514 maximal with  \( |x-1|<2^{-k} \). Put \( z=2^{-2k}\left[ 2^{2k}(x-1)\right]  \),
 515 i.e. let  \( z \) contain the first  \( k \) significant bits of  \( x-1 \).
 516 \( z \) is a good approximation for  \( \ln (x) \). Set  \( y:=y+z \) and
 517 \( x:=x\cdot \exp (-z) \).
 518
 519 Since  \( x\cdot \exp (y) \) is an invariant of the algorithm, the final
 520 \( y \) is the desired value  \( \ln (x) \).
 521
 522 This algorithm has bit complexity
 523 \[ O\left(\sum\limits_{k=0}^{O(\log N)} \frac{(\log N)^2}{\log N + 2^k} M(N)\right)
 524   = O((\log N)^{2}M(N)) \]
 525
 526
 527 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 528 \subsubsection{ \( \sin (x) \),  \( \cos (x) \) for rational  \( x \)}
 529
 530 These are direct applications of the binary splitting algorithm: For
 531 \( \sin (x) \), put  \( a(n)=1 \),  \( b(n)=1 \),  \( p(0)=u \),
 532 \( q(0)=v \), and  \( p(n)=-u^{2} \),  \( q(n)=(2n)(2n+1)v^{2} \) for
 533 \( n>0 \). For  \( \cos (x) \), put  \( a(n)=1 \),  \( b(n)=1 \),
 534 \( p(0)=1 \),  \( q(0)=1 \), and  \( p(n)=-u^{2} \),  \( q(n)=(2n-1)(2n)v^{2} \)
 535 for  \( n>0 \). Of course, when both  \( \sin (x) \) and  \( \cos (x) \) are
 536 needed, one should only compute
 537  \( \sin (x) \) this way, and then set
 538 \( \cos (x)=\pm \sqrt{1-\sin (x)^{2}} \). This is a 20\% speedup at least.
 539
 540 The bit complexity of these algorithms is  \( O(\log N\: M(N)) \).
 541
 542
 543 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 544 \subsubsection{ \( \sin (x) \),  \( \cos (x) \) for real  \( x \)}
 545
 546 To compute  \( \cos (x)+i\sin (x)=\exp (ix) \) for real  \( x \), again the
 547 addition theorems and Brent's trick
 548 can be used. The resulting algorithm has bit complexity
 549 \( O((\log N)^{2}M(N)) \).
 550
 551
 552 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 553 \subsubsection{ \( \arctan (x) \) for rational  \( x \)}
 554
 555 For rational  \( |x|<1 \), the fastest way to compute  \( \arctan (x) \) with
 556 bit complexity  \( O((\log N)^{2}M(N)) \) is
 557 to apply the binary splitting algorithm directly to the power series
 558 for  \( \arctan (x) \). Put  \( a(n)=1 \),  \( b(n)=2n+1 \),  \( q(n)=1 \),
 559 \( p(0)=x \) and  \( p(n)=-x^{2} \) for  \( n>0 \).
 560
 561
 562 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 563 \subsubsection{ \( \arctan (x) \) for real  \( x \)}
 564
 565 This again can be computed using the ``inverse'' Brent trick:
 566
 567 Start out with  \( z:=\frac{1}{\sqrt{1+x^{2}}}+i\frac{x}{\sqrt{1+x^{2}}} \)
 568 and  \( \varphi :=0 \). During the algorithm  \( z \) will be a complex number
 569 with  \( |z|=1 \) and  \( \re (z)>0 \).
 570
 571 As long as  \( \im (z)\neq 0 \) within the actual precision, choose  \( k \)
 572 maximal with  \( |\im (z)|<2^{-k} \).
 573 Put  \( \alpha =2^{-2k}\left[ 2^{2k}\im (z)\right]  \), i.e. let  \( \alpha  \)
 574 contain the first  \( k \) significant bits of  \( \im (z) \).  \( \alpha  \)
 575 is a good approximation for  \( \arcsin (\im (z)) \). Set
 576 \( \varphi :=\varphi +\alpha  \) and  \( z:=z\cdot \exp (-i\alpha ) \).
 577
 578 Since  \( z\cdot \exp (i\varphi ) \) is an invariant of the algorithm, the
 579 final  \( \varphi  \) is the desired
 580 value  \( \arcsin \frac{x}{\sqrt{1+x^{2}}} \).
 581
 582 This algorithm has bit complexity
 583 \[ O\left(\sum\limits_{k=0}^{O(\log N)} \frac{(\log N)^2}{\log N + 2^k} M(N)\right)
 584   = O((\log N)^{2}M(N)) \]
 585
 586
 587 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 588 \subsubsection{\( \sinh (x) \), \( \cosh (x) \) for rational and real \( x \)}
 589
 590 These can be computed by similar algorithms as  \( \sin (x) \) and
 591 \( \cos (x) \) above, with the same asymptotic bit complexity. The
 592 standard computation, using \( \exp (x) \) and its reciprocal (calculated
 593 by the Newton method) results also to the same complexity and works equally
 594 well in practice.
 595
 596 The bit complexity of these algorithms is  \( O(\log N\: M(N)) \) for rational
 597 \( x \) and \( O((\log N)^{2}M(N)) \) for real \( x \).
 598
 599 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 600 \subsection{Example: Hypergeometric functions at rational points}
 601
 602 The binary splitting algorithm is well suited for the evaluation of a
 603 hypergeometric series
 604
 605 \[
 606 F\left( \begin{array}{ccc}
 607 a_{1}, & \ldots , & a_{r}\\
 608 b_{1}, & \ldots , & b_{s}
 609 \end{array}
 610 \big| x\right) =\sum ^{\infty }_{n=0}
 611 \frac{a_{1}^{\overline{n}}\cdots
 612 a_{r}^{\overline{n}}}{b_{1}^{\overline{n}}\cdots b_{s}^{\overline{n}}}x^{n}\]
 613 with rational coefficients  \( a_{1} \), ...,  \( a_{r} \),  \( b_{1} \),
 614 ...,  \( b_{s} \) at a rational point  \( x \) in the interior of the circle of
 615 convergence. Just put  \( a(n)=1 \),  \( b(n)=1 \),  \( p(0)=q(0)=1 \), and
 616 \( \frac{p(n)}{q(n)}=\frac{(a_{1}+n-1)\cdots
 617 (a_{r}+n-1)x}{(b_{1}+n-1)\cdots (b_{s}+n-1)} \) for  \( n>0 \). The evaluation
 618 can thus be done with
 619 bit complexity  \( O((\log N)^{2}M(N)) \) for
 620 \( r=s \) and  \( O(\log N\: M(N)) \) for  \( r<s \).
 621
 622 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 623 \subsection{Example:  \( \pi  \)}
 624
 625 The Ramanujan series for  \( \pi  \)
 626 \[
 627 \frac{1}{\pi }=\frac{12}{C^{3/2}}\sum^{\infty }_{n=0}
 628 \frac{(-1)^n(6n)!(A+nB)}{(3n)!n!^{3}C^{3n}}\]
 629 with  \( A=13591409 \),  \( B=545140134 \), \( C=640320 \) found by the
 630 Chudnovsky's \footnote{A special case of \cite{87}, formula (5.5.18),
 631 with N=163.} and which is used by the {\sf LiDIA} \cite{95,97,97b} and the Pari
 632 \cite{95b} system to compute \( \pi \), is usually written as an algorithm
 633 of bit complexity  \( O(N^{2}) \). It is, however, possible to apply
 634 binary splitting to the sum. Put \( a(n)=A+nB \),  \( b(n)=1 \),
 635 \( p(0)=1 \),  \( q(0)=1 \), and \( p(n)=-(6n-5)(2n-1)(6n-1) \),
 636 \( q(n)=n^{3}C^3/24 \) for  \( n>0 \).  This reduces the complexity to
 637 \( O((\log N)^{2}M(N)) \). Although this is theoretically slower than
 638 Brent-Salamin's quadratically convergent iteration, which has a bit
 639 complexity of  \( O(\log N\: M(N)) \), in practice the binary splitted
 640 Ramanujan sum is three times faster than Brent-Salamin, at least in the
 641 range from  \( N=1000 \) bits to \( N=1000000 \) bits.
 642
 643
 644 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 645 %%% \subsection{Example: Catalan's constant}
 646
 647 \subsection{Example: Catalan's constant \( G \)}
 648
 649 A linearly convergent sum for Catalan's constant
 650 \[
 651 G:=\sum ^{\infty }_{n=0}\frac{(-1)^{n}}{(2n+1)^{2}}\]
 652 is given in \cite{87}, p. 386:
 653 \[
 654 G = \frac{3}{8}\sum ^{\infty }_{n=0}\frac{1}{{2n \choose n} (2n+1)^{2}}
 655     +\frac{\pi }{8}\log (2+\sqrt{3})
 656 \]
 657
 658 The series is summed using binary splitting, putting \( a(n)=1 \),
 659 \( b(n)=2n+1 \),  \( p(0)=1 \),  \( q(0)=1 \), and
 660 \( p(n)=n \),  \( q(n)=2(2n+1) \) for  \( n>0 \). Thus
 661 \( G \) can be computed with bit complexity  \( O((\log N)^{2}M(N)) \).
 662
 663 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 664 \subsection{Example: The Gamma function at rational points}
 665
 666 For evaluating  \( \Gamma (s) \) for rational  \( s \), we first reduce \( s \)
 667 to the range \( 1\leq s\leq 2 \) by the formula \( \Gamma (s+1)=s\Gamma (s) \).
 668 To compute  \( \Gamma (s) \) with a precision of  \( N \) bits, choose a
 669 positive integer  \( x \) with  \( xe^{-x}<2^{-N} \). Partial integration lets
 670 us write
 671
 672 \begin{eqnarray*}
 673 \Gamma (s)&=& \int ^{\infty }_{0}e^{-t}t^{s-1}dt\\
 674           &=& x^{s}e^{-x}\:\sum ^{\infty }_{n=0}
 675              \frac{x^{n}}{s(s+1)\cdots (s+n)}
 676              +\int^{\infty }_{x}e^{-t}t^{s-1}dt\\
 677 \end{eqnarray*}
 678 The last integral is  \( <xe^{-x}<2^{-N} \). The series is evaluated as a
 679 hypergeometric function (see above); the number of terms to be summed up is
 680 \( O(N) \), since \( x=O(N) \). Thus the entire computation can be done with
 681 bit complexity  \( O((\log N)^{2}M(N)) \).
 682
 683 \begin{description}
 684 \item [Note:]~
 685 \end{description}
 686
 687 This result is already mentioned in \cite{76b}.
 688
 689 E.~Karatsuba \cite{91} extends this result to \( \Gamma (s) \) for algebraic
 690 \( s \).
 691
 692 For \( \Gamma (s) \) there is no checkpointing possible because of the
 693 dependency on \( x \) in the binary splitting.
 694
 695
 696 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 697 \subsection{Example: The Riemann Zeta value \( \zeta (3) \) \label{zeta}}
 698
 699 Recently, Doron Zeilberger's method of ``creative telescoping'' has been
 700 applied to Riemann's zeta function at  \( s=3 \) (see \cite{96c}), which is
 701 also known as {\em Ap{\'e}ry's constant}:
 702 \[
 703 \zeta (3)=
 704 \frac{1}{2}\sum ^{\infty }_{n=1}
 705                \frac{(-1)^{n-1}(205n^{2}-160n+32)}{n^{5}{2n \choose n}^{5}}
 706 \]
 707
 708 This sum consists of three hypergeometric series. Binary splitting can also be
 709 applied directly, by putting  \( a(n)=205n^{2}+250n+77 \), \( b(n)=1 \),
 710 \( p(0)=1 \), \( p(n)=-n^{5} \) for  \( n>0 \), and \( q(n)=32(2n+1)^{5} \).
 711 Thus the bit complexity of computing \( \zeta (3) \) is
 712 \( O((\log N)^{2}M(N)) \).
 713
 714 \begin{description}
 715 \item [Note:]~
 716 \end{description}
 717
 718 Using this the authors were able to establish a new record in the
 719 calculation of \( \zeta (3) \) by computing 1,000,000 decimals \cite{96d}.
 720 The computation took 8 hours on a Hewlett Packard 9000/712 machine. After
 721 distributing on a cluster of 4 HP 9000/712 machines the same computation
 722 required only 2.5 hours. The half hour was necessary for reading the partial
 723 results from disk and for recombining them. Again, we have used binary-splitting
 724 for recombining: the 4 partial result produced 2 results which were combined
 725 to the final 1,000,000 decimals value of \( \zeta (3) \).
 726
 727 This example shows the importance of checkpointing. Even if a machine crashes
 728 through the calculation, the results of the other machines are still usable.
 729 Additionally, being able to parallelise the computation reduced the computing
 730 time dramatically.
 731
 732 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 733 \section{Evaluation of linearly convergent series of sums}
 734
 735 The technique presented in the previous section also applies to all linearly
 736 convergent sums of the form
 737
 738 \[
 739 U=\sum ^{\infty }_{n=0}\frac{a(n)}{b(n)}\left( \frac{c(0)}{d(0)}+\cdots
 740 +\frac{c(n)}{d(n)}\right) \frac{p(0)\cdots p(n)}{q(0)\cdots q(n)}\]
 741  where  \( a(n) \),  \( b(n) \),  \( c(n) \),  \( d(n) \),  \( p(n) \),
 742 \( q(n) \) are integers with  \( O(\log n) \) bits. The most often
 743 used case is again that  \( a(n) \),  \( b(n) \),  \( c(n) \),  \( d(n) \),
 744 \( p(n) \),  \( q(n) \) are polynomials in  \( n \) with
 745 integer coefficients.
 746
 747 \begin{description}
 748 \item [Algorithm:]~
 749 \end{description}
 750
 751 Given two index bounds  \( n_{1} \)and  \( n_{2} \), consider the partial sums
 752 \[
 753 S=\sum _{n_{1}\leq n<n_{2}}\frac{a(n)}{b(n)}
 754 \frac{p(n_{1})\cdots p(n)}{q(n_{1})\cdots q(n)}\]
 755 and
 756 \[
 757 U=\sum _{n_{1}\leq n<n_{2}}\frac{a(n)}{b(n)}
 758 \left( \frac{c(n_{1})}{d(n_{1})}+\cdots +\frac{c(n)}{d(n)}\right)
 759 \frac{p(n_{1})\cdots p(n)}{q(n_{1})\cdots q(n)}\]
 760
 761
 762 As above, we compute the integers  \( P={p(n_{1})}\cdots {p(n_{2}-1)} \),
 763 \( Q={q(n_{1})}\cdots {q(n_{2}-1)} \),  \( B={b(n_{1})}\cdots {b(n_{2}-1)} \),
 764 \( T=BQS \),  \( D={d(n_{1})}\cdots {d(n_{2}-1)} \),
 765 \( C=D\left( \frac{c(n_{1})}{d(n_{1})}+\cdots +
 766 \frac{c(n_{2}-1)}{d(n_{2}-1)}\right)  \) and  \( V=DBQU \).
 767 If  \( n_{2}-n_{1}<4 \), these
 768 are computed directly. If  \( n_{2}-n_{1}\geq 4 \), they are computed using
 769 {\em binary splitting}: Choose an index  \( n_{m} \) in the middle of
 770 \( n_{1} \)and  \( n_{2} \), compute the components \( P_{l} \),  \( Q_{l} \),
 771 \( B_{l} \),  \( T_{l} \),  \( D_{l} \),  \( C_{l} \),  \( V_{l} \) belonging
 772 to the interval  \( n_{1}\leq n<n_{m} \), compute the components \( P_{r} \),
 773 \( Q_{r} \),  \( B_{r} \),  \( T_{r} \),  \( D_{r} \),  \( C_{r} \),
 774 \( V_{r} \) belonging to the interval  \( n_{m}\leq n<n_{2} \), and set
 775 \( P=P_{l}P_{r} \),  \( Q=Q_{l}Q_{r} \),  \( B=B_{l}B_{r} \),
 776 \( T=B_{r}Q_{r}T_{l}+B_{l}P_{l}T_{r} \),  \( D=D_{l}D_{r} \),
 777 \( C=C_{l}D_{r}+C_{r}D_{l} \) and
 778 \( V=D_{r}B_{r}Q_{r}V_{l}+D_{r}C_{l}B_{l}P_{l}T_{r}+D_{l}B_{l}P_{l}V_{r} \).
 779
 780 Finally, this algorithm is applied to \( n_{1}=0 \) and
 781 \( n_{2}=n_{\max}=O(N) \), and final floating-point divisions
 782 \( S=\frac{T}{BQ} \) and  \( U=\frac{V}{DBQ} \) are performed.
 783
 784 \begin{description}
 785 \item [Complexity:]~
 786 \end{description}
 787
 788 The bit complexity of computing  \( S \) and  \( U \) with  \( N \) bits of
 789 precision is \( O((\log N)^{2}M(N)) \).
 790
 791 \begin{description}
 792 \item [Proof:]~
 793 \end{description}
 794
 795 By our assumption that  \( a(n) \),  \( b(n) \),  \( c(n) \),  \( d(n) \),
 796 \( p(n) \),  \( q(n) \) are integers with  \( O(\log n) \) bits,
 797 the integers  \( P \),  \( Q \),  \( B \),  \( T \),  \( D \),  \( C \),
 798 \( V \) belonging to the interval  \( n_{1}\leq n<n_{2} \) all have
 799  \( O((n_{2}-n_{1})\log n_{2}) \) bits. The rest of the proof is as in the
 800 previous section.
 801
 802 \begin{description}
 803 \item [Checkpointing/Parallelising:]~
 804 \end{description}
 805
 806 A checkpoint can be easily done by storing the (integer) values of
 807 \( n_1 \),  \( n_2 \), \( P \),  \( Q \),  \( B \), \( T \) and additionally
 808 \( D \),  \( C \),  \( V \). Similarly, if \( m \) processors are available,
 809 then the interval \( [0,n_{max}] \) can be divided into \( m \) pieces of
 810 length \( l = \lfloor n_{max}/m \rfloor \). After each processor \( i \) has
 811 computed the sum of its interval \( [il,(i+1)l] \), the partial sums are
 812 combined to the final result using the rules described above.
 813
 814 \begin{description}
 815 \item [Implementation:]~
 816 \end{description}
 817
 818 The C++ implementation of the above algorithm is very similar
 819 to the previous one. The initialisation is done now by a structure
 820 {\tt abpqcd\_series} containing arrays {\tt a}, {\tt b}, {\tt p},
 821 {\tt q}, {\tt c} and {\tt d} of multiprecision integers. The values of
 822 the arrays at the index \( n \) correspond to the values of the functions
 823 \( a \), \( b \), \( p \), \( q \), \( c \) and \( d \) at the integer point
 824 \( n \). The (partial) results of the algorithm are stored in the
 825 {\tt abpqcd\_series\_result} structure, which now contains 3 new elements
 826 ({\tt C}, {\tt D} and {\tt V}).
 827
 828 \begin{verbatim}
 829 // abpqcd_series is initialised by user
 830 struct { bigint *a, *b, *p, *q, *c, *d;
 831        } abpqcd_series;
 832
 833 // abpqcd_series_result holds the partial results
 834 struct { bigint P, Q, B, T, C, D, V;
 835        } abpqcd_series_result;
 836
 837 void sum_abpqcd(abpqcd_series_result & r,
 838                 int n1, int n2,
 839                 const abpqcd_series & arg)
 840 {
 841   switch (n2 - n1)
 842   {
 843     case 0:
 844       error_handler("summation device",
 845             "sum_abpqcd:: n2-n1 should be > 0.");
 846       break;
 847
 848     case 1: // the result at the point n1
 849       r.P = arg.p[n1];
 850       r.Q = arg.q[n1];
 851       r.B = arg.b[n1];
 852       r.T = arg.a[n1] * arg.p[n1];
 853       r.D = arg.d[n1];
 854       r.C = arg.c[n1];
 855       r.V = arg.a[n1] * arg.c[n1] * arg.p[n1];
 856       break;
 857
 858     // cases 2, 3, 4 left out for simplicity
 859
 860     default: // general case
 861
 862       // the left and the right partial sum
 863       abpqcd_series_result L, R;
 864
 865       // find the middle of the interval
 866       int nm = (n1 + n2) / 2;
 867
 868       // sum left side
 869       sum_abpqcd(L, n1, nm, arg);
 870
 871       // sum right side
 872       sum_abpqcd(R, nm, n2, arg);
 873
 874       // put together
 875       r.P = L.P * R.P;
 876       r.Q = R.Q * L.Q;
 877       r.B = L.B * R.B;
 878       bigint tmp = L.B * L.P * R.T;
 879       r.T = R.B * R.Q * L.T + tmp;
 880       r.D = L.D * R.D;
 881       r.C = L.C * R.D + R.C * L.D;
 882       r.V = R.D * (R.B * R.Q * L.V + L.C * tmp)
 883             + L.D * L.B * L.P * R.V;
 884       break;
 885   }
 886 }
 887 \end{verbatim}
 888
 889 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 890 \subsection{Example: Euler's constant \( C \) \label{eulergamma}}
 891
 892 \begin{description}
 893 \item [Theorem:]~
 894 \end{description}
 895
 896 Let  \( f(x)=\sum ^{\infty }_{n=0}\frac{x^{n}}{n!^{2}} \) and
 897 \( g(x)=\sum ^{\infty }_{n=0}H_{n}\frac{x^{n}}{n!^{2}} \). Then for
 898 \( x\rightarrow \infty  \),
 899 \( \frac{g(x)}{f(x)}=\frac{1}{2}\log x+C+O\left( e^{-4\sqrt{x}}\right)  \).
 900
 901 \begin{description}
 902 \item [Proof:]~
 903 \end{description}
 904
 905 The Laplace method for asymptotic evaluation of exponentially growing
 906 sums and integrals yields
 907 \[
 908 f(x)=
 909 e^{2\sqrt{x}}x^{-\frac{1}{4}}\frac{1}{2\sqrt{\pi }}(1+O(x^{-\frac{1}{4}}))\]
 910  and
 911 \[
 912 g(x)=e^{2\sqrt{x}}x^{-\frac{1}{4}}\frac{1}{2\sqrt{\pi }}
 913 \left(\frac{1}{2}\log x+C+O(\log x\cdot x^{-\frac{1}{4}})\right)\]
 914 On the other hand,  \( h(x):=\frac{g(x)}{f(x)} \) satisfies the
 915 differential equation
 916 \[
 917 xf(x)\cdot h''(x)+(2xf'(x)+f(x))\cdot h'(x)=f'(x)\]
 918 hence
 919 \[
 920 h(x)=\frac{1}{2}\log x+C+c_{2}
 921 \int ^{\infty }_{x}\frac{1}{tf(t)^{2}}dt=\frac{1}{2}\log x+C+O(e^{-4\sqrt{x}})\]
 922
 923
 924 \begin{description}
 925 \item [Algorithm:]~
 926 \end{description}
 927
 928 To compute  \( C \) with a precision of  \( N \) bits, set
 929 \[ x=\left\lceil (N+2)\: \frac{\log 2}{4}\right\rceil ^{2} \]
 930 and evaluate the series for  \( g(x) \) and  \( f(x) \) simultaneously,
 931 using the binary-splitting algorithm,
 932 with  \( a(n)=1 \),  \( b(n)=1 \),  \( c(n)=1 \),  \( d(n)=n+1 \),
 933 \( p(n)=x \),  \( q(n)=(n+1)^{2} \). Let  \( \alpha =3.591121477\ldots  \)
 934 be the solution of the equation  \( -\alpha \log \alpha +\alpha +1=0 \). Then
 935  \( \alpha \sqrt{x}-\frac{1}{4\log \alpha }\log \sqrt{x}+O(1) \)
 936 terms of the series suffice for the relative error to be bounded
 937 by  \( 2^{-N} \).
 938
 939 \begin{description}
 940 \item [Complexity:]~
 941 \end{description}
 942
 943 The bit complexity of this algorithm is  \( O((\log N)^{2}M(N)) \).
 944
 945 \begin{description}
 946 \item [Note:]~
 947 \end{description}
 948
 949 This algorithm was first mentioned in \cite{80}. It is by far
 950 the fastest known algorithm for computing Euler's constant.
 951
 952 For Euler's constant there is no checkpointing possible because
 953 of the dependency on \( x \) in the binary splitting.
 954
 955
 956 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 957 \section{Computational results}
 958
 959 In this section we present some computational results of our CLN and
 960 {\sf LiDIA} implementation of the algorithms presented in this note. We use
 961 the official version (1.3) and an experimental version (1.4a) of {\sf LiDIA}.
 962 We have taken advantage of {\sf LiDIA}'s ability to replace its kernel
 963 (multiprecision arithmetic and memory management) \cite{95,97,97b}, so we were
 964 able to use in both cases CLN's fast integer arithmetic routines.
 965
 966 \subsection{Timings}
 967
 968 The table in Figure \ref{Fig1} shows the running times for the calculation of
 969 \( \exp(1) \), \( \log(2) \), \( \pi \), \( C \), \( G \) and \( \zeta(3) \)
 970 to precision 100, 1000, 10000 and 100000 decimal digits. The timings are given
 971 in seconds and they denote the {\em real} time needed, i.e. system and user
 972 time. The computation was done on an Intel Pentium with 133Hz and 32MB of RAM.
 973
 974 \begin{figure}[htb]
 975 \begin{center}
 976 \begin{tabular}{|l|l|l|l|l|l|l|}
 977 \hline
 978 D      &\( \exp(1) \)&\( \log(2) \)&\( \pi \)&\( C \)  &\( G \)&\( \zeta(3) \)\\
 979 \hline
 980 \hline
 981 \( 10^2 \) &0.0005   & 0.0020      &0.0014   & 0.0309  &0.0179 & 0.0027 \\
 982 \hline
 983 \( 10^3 \) &0.0069   & 0.0474      &0.0141   & 0.8110  &0.3580 & 0.0696 \\
 984 \hline
 985 \( 10^4 \) &0.2566   & 1.9100      &0.6750   & 33.190  &13.370 & 2.5600 \\
 986 \hline
 987 \( 10^5 \) &5.5549   & 45.640      &17.430   & 784.93  &340.33 & 72.970 \\
 988 \hline
 989 \end{tabular}
 990 \caption{{\sf LiDIA-1.4a} timings of computation of constants using
 991 binary-splitting}\label{Fig1}
 992 \end{center}
 993 \end{figure}
 994
 995 The second table (Figure \ref{Fig2}) summarizes the performance of
 996 \( exp(x) \) in various Computer Algebra systems\footnote{We do not list
 997 the timings of {\sf LiDIA-1.4a} since these are comparable to those of CLN.}.
 998 For a fair comparison of the algorithms, both argument and precision are
 999 chosen in such a way, that system--specific optimizations (BCD arithmetic
1000 in Maple, FFT multiplication in CLN, special exact argument handling in
1001 {\sf LiDIA}) do not work. We use \( x = -\sqrt{2} \) and precision
1002 \( 10^{(i/3)} \), with \( i \) running from \( 4 \) to \( 15 \).
1003
1004 \begin{figure}[htb]
1005 \begin{center}
1006 \begin{tabular}{|r|l|l|l|l|}
1007 \hline
1008 D           &  Maple & Pari & {\sf LiDIA-1.3} & CLN \\
1009 \hline
1010 \hline
1011 21          & 0.00090   & 0.00047 & 0.00191           & 0.00075 \\
1012 \hline
1013 46          & 0.00250   & 0.00065 & 0.00239           & 0.00109 \\
1014 \hline
1015 100         & 0.01000   & 0.00160 & 0.00389           & 0.00239 \\
1016 \hline
1017 215         & 0.03100   & 0.00530 & 0.00750           & 0.00690 \\
1018 \hline
1019 464         & 0.11000   & 0.02500 & 0.02050           & 0.02991 \\
1020 \hline
1021 1000        & 0.4000    & 0.2940  & 0.0704            & 0.0861 \\
1022 \hline
1023 2154        & 1.7190    & 0.8980  & 0.2990            & 0.2527 \\
1024 \hline
1025 4641        & 8.121     & 5.941   & 1.510             & 0.906 \\
1026 \hline
1027 10000       & 39.340    & 39.776  & 7.360             & 4.059 \\
1028 \hline
1029 21544       & 172.499   & 280.207 & 39.900            & 15.010 \\
1030 \hline
1031 46415       & 868.841   & 1972.184& 129.000           & 39.848 \\
1032 \hline
1033 100000      & 4873.829  & 21369.197& 437.000           & 106.990 \\
1034 \hline
1035 \end{tabular}
1036 \caption{Timings of computation of \( \exp(-\sqrt{2}) \)}\label{Fig2}
1037 \end{center}
1038 \end{figure}
1039
1040 MapleV R3 is the slowest system in this comparison. This is probably due to
1041 the BCD arithmetic it uses. However, Maple seems to have an asymptotically
1042 better algorithm for \( exp (x) \) for numbers having more than 10000 decimals.
1043 In this range it outperforms Pari-1.39.03, which is the fastest system in the
1044 0--200 decimals range.
1045
1046 The comparison indicating the strength of binary-splitting is between
1047 {\sf LiDIA-1.3} and CLN itself. Having the same kernel, the only
1048 difference is here that {\sf LiDIA-1.3} uses Brent's \( O(\sqrt{n}M(n)) \)
1049 for \( \exp(x) \), whereas CLN changes from Brent's method to a
1050 binary-splitting version for large numbers.
1051
1052 As expected in the range of 1000--100000 decimals CLN outperforms
1053 {\sf LiDIA-1.3} by far. The fact that {\sf LiDIA-1.2.1} is faster
1054 in the range of 200--1000 decimals (also in some trig. functions)
1055 is probably due to a better optimized \( O(\sqrt{n}M(n)) \) method
1056 for \( \exp(x) \).
1057
1058 \subsection {Distributed computing of \( \zeta (3) \)}
1059
1060 Using the method described in \ref{zeta} the authors were the first to
1061 compute 1,000,000 decimals of \( \zeta (3) \) \cite{96d}.
1062 The computation took 8 hours on a Hewlett Packard 9000/712 machine. After
1063 distributing on a cluster of 4 HP 9000/712 machines the same computation
1064 required only 2.5 hours. The half hour was necessary for reading the partial
1065 results from disk and for recombining them. Again, we have used binary-splitting
1066 for recombining: the 4 partial result produced 2 results which were combined
1067 to the final 1,000,000 decimals value of \( \zeta (3) \).
1068
1069 This example shows the importance of checkpointing. Even if a machine crashes
1070 through the calculation, the results of the other machines are still usable.
1071 Additionally, being able to parallelise the computation reduced the computing
1072 time dramatically.
1073
1074 \subsection{Euler's constant \( C \)}
1075
1076 We have implemented a version of Brent's and McMillan's algorithm \cite{80} and
1077 a version accelerated by binary-splitting as shown in \ref{eulergamma}.
1078
1079 The computation of \( C \) was done twice on a SPARC-Ultra machine
1080 with 167 MHz and 256 MB of RAM. The first computation using the non-acellerated
1081 version required 160 hours. The result of this computation was then verified
1082 by the binary splitting version in (only) 14 hours.
1083
1084 The first 475006 partial quotients of the continued fraction of \( C \)
1085 were computed on an Intel Pentium with 133 MHz and 32 MB of RAM in 3 hours
1086 using a programm by H. te Riele based on \cite{96e}, which was translated to
1087 {\sf LiDIA} for efficiency reasons. Computing the 475006th
1088 convergent produced the following improved theorem:
1089
1090 \medskip
1091
1092 \centerline{If \( C \) is a rational number, \(C=p/q\), then \( |q| > 10^{244663} \)}
1093
1094 \medskip
1095
1096 Details of this computation (including statistics on the partial
1097 quotients) can be found in \cite{98}.
1098
1099
1100 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1101 \section{Conclusions}
1102
1103 Although powerful, the binary splitting method has not been widely used.
1104 Especially, no information existed on the applicability of this method.
1105
1106 In this note we presented a generic binary-splitting summation device for
1107 evaluating two types of linearly convergent series. From this we derived simple
1108 and computationally efficient algorithms for the evaluation of elementary
1109 functions and constants. These algorithms work with {\em exact}
1110 objects, making them suitable for use within Computer Algebra systems.
1111
1112 We have shown that the practical performance of our algorithms is
1113 superior to current system implementations. In addition to existing methods,
1114 our algorithms provide the possibility of checkpointing and parallelising.
1115 These features can be useful for huge calculations, such as those done in
1116 analytic number theory research.
1117
1118 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1119 \section{Thanks}
1120
1121 The authors would like to thank J\"org Arndt, for pointing us to
1122 chapter 10 in \cite{87}. We would also like to thank Richard P. Brent for
1123 his comments and Hermann te Riele for providing us his program for the
1124 continued fraction computation of Euler's constant.
1125
1126 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1127 \bibliography{binsplit}
1128 \bibliographystyle{acm}
1129
1130 \end{document}