UNIVERSITY OF CALIFORNIA

Los Angeles

An FFT Extension of the Elliptic Curve Metho d of Factorization

A dissertation submitted in partial satisfaction of the

requirements for the degree Do ctor of Philosophy

in Mathematics

by

Peter Lawrence Montgomery

c

Copyright by

Peter Lawrence Montgomery

The dissertation of Peter Lawrence Montgomery is approved

Bryan C Ellickson

Milos D Ercegovac

Basil Gordon

Murray M Schacher

David G Cantor Committee Chair

University of California Los Angeles

ii

TABLE OF CONTENTS

Introduction

Summary of main results

Acronyms and notations

Elliptic Curve Metho d and its History

Lenstras original algorithm

Step Brent and Montgomery improvements

Weierstrass and Montgomery parameterizations

Fast Polynomial Arithmetic

Minimal time for p olynomial multiplication

Circular convolutions

FFT for p olynomial multiplication

Circular convolutions over ZN Z

Polynomial recipro cals and division

Constructing a monic p olynomial from its ro ots

Evaluating a p olynomial at many p oints

Polynomial GCDs over a eld

Complexity analysis of fast GCD algorithm

Connection with p olynomial resultants

Polynomial GCDs over ZN Z

Opp ortunities for optimization and parallelization

Application to ECM

Checking two lists for matches mo dulo p where pjN

Use of fast p olynomial evaluation

Construction of p olynomials

Selection and Generation of Multiples of Q

Use of k th p owers or Dickson p olynomials

Polynomial divisors of g X g Y

k k

Prime divisors of g X g Y

k k

Comparative computational costs

k k

Using p owers of and

Achieving parallelism iii

Separate arithmetic progressions

Use of arithmetic progressions and Dickson p olynomials

Evaluation of fm Q g where m is a p olynomial function

i x i

Implementation choices

Selection of Curve

Torsion subgroup of order and p ositive rank over Q

Torsion subgroup of order and p ositive rank over Q

Numerical comparison of torsion subgroups of orders and

Selection of Search Limits B d d

Dickmans function

Estimated success probability p er curve

Estimated time p er curve

Estimated optimal parameters

MultiplePrecision and Mo dular Arithmetic

Arithmetic mo dulo N

Doublelength multiplication

Arithmetic mo dulo small primes

Finding remainders mo dulo p

i

Reconstructing remainder mo dulo N

Vectorized carry propagation

Results

Timing

Performance on RSA Factoring Challenge

Additional ndings

Bibliography iv

LIST OF FIGURES

Group law on y x x

Algorithm for p olynomial recipro cals

Algorithm HALFGCD

Algorithm for p olynomial GCDs

Computing GX mo d F X in pieces of degree d

Dep endencies using two arithmetic progressions and doubling

Finite dierences of p olynomial function P X X

Dep endencies when up dating an upward diagonal

Dep endencies when up dating a downward diagonal

Dep endencies when using geometric progression

Pro cedure for multiplication mo dulo N

Straightforward multipleprecision addition and subtraction

Vectorized carry propagation v

LIST OF TABLES

Largest ECM factors found

Acronyms

Notations part I

Notations part I I

Notations part I I I

Trials needed to reach condence level

Some ways to ensure that curves group order is divisible by

Power of dividing group order

Power of dividing group order when torsion subgroup has order

Power of dividing group order when torsion subgroup has order

Estimated time p er curve milliseconds

Estimated optimal parameters

Times for p olynomial op erations on RS seconds

Comparative Alliant FX times with one and ve pro cessors

Some factors of digit partition numbers

How often factors were found by ECM with FFT

Actual and exp ected numbers of prime factors by size

Additional factors found vi

ACKNOWLEDGEMENTS

This work was sp onsored in part by US Army fellowship DAALG

Thanks to Unisys and UCLA for letting me use their facilities

during this work Thanks to Professors David Cantor Basil Gordon and Samuel

S Wagsta Jr for reviewing early drafts hereof Thanks to Professor Alfred W

Hales for b eing an alternate on my committee even though his services were not

needed vii

VITA

September Born San Francisco California

December Ranked among top ve participants in

William Lowell Putnam Mathematical Comp etition

June AB with honors Mathematics

University of California Berkeley

September MA Mathematics University of California Berkeley

Scientic and system programming at

System Development Corp oration SDC

Huntsville Alabama

Transferred to SDC later renamed Unisys

in Santa Monica California

Did simulations and formal verication

Graduate student in PhD program

Department of Mathematics

University of California Los Angeles

Teaching Asso ciate Summer

Coached UCLA Putnam team

Joined Department of Mathematics

Oregon State University Corvallis Oregon

PUBLICATIONS

John Brillhart Peter L Montgomery Rob ert D Silverman January

Tables of Fib onacci and Lucas factorizations Mathematics of Computation

SS

P Erdos R L Graham P Montgomery B L Rothschild J Sp encer E G

Straus May Euclidean Ramsey Theorems I Journal of Combinato

rial Theory Series A viii

Euclidean Ramsey Theorems II In Collo quia Mathematica So ci

etatis JanosBolyai Innite and Finite Sets vol I Edited by A Ha jnal

R Rado Vera T Sos NorthHolland AmsterdamLondon pp

Euclidean Ramsey Theorems III In Collo quia Mathematica So ci

etatis Janos Bolyai Innite and Finite Sets vol I Edited by A Ha jnal

R Rado Vera T Sos NorthHolland AmsterdamLondon pp

Ronald Evans Peter Montgomery May Problem American Math

ematical Monthly

Peter L Montgomery December Problem E American Mathematical

Monthly

December Evaluation of b o olean expressions on ones complement

machines SIGPLAN Notices

September Letter to the Editor SIGPLAN Notices

November Prop osal Mathematics Magazine

April Mo dular multiplication without trial division Mathematics of

Computation

January Sp eeding the Pollard and elliptic curve metho ds of factoriza

tion Mathematics of Computation

August Design of an FFT continuation to the ECM metho d of factor

ization Abstract AMS Abstracts

p

to app ear July New solutions of a mo d p Mathematics of

Computation

Peter L Montgomery Rob ert D Silverman April An FFT extension to

the P factoring algorithm Mathematics of Computation

ix

ABSTRACT OF THE DISSERTATION

An FFT Extension of the Elliptic Curve Metho d of Factorization

by

Peter Lawrence Montgomery

Do ctor of Philosophy in Mathematics

University of California Los Angeles

Professor David G Cantor Chair

Factorization of arbitrary integers is b elieved to b e a hard problem The Elliptic

Curve Metho d ECM discovered by Jr in is the b est

known metho d for nding to digit factors of a large integer N The ECM

algorithm has two main steps It computes a large multiple of an element P of an

elliptic curve group mo dulo N obtaining another element Q Step succeeds if we

strike the identity element of the group mo dulo p for some prime divisor p of N If

Step is unsuccessful then Step compares multiples of Q lo oking for a duplicate

mo dulo some pjN This thesis describ es how to apply convolutions mo dulo N and

fast p olynomial arithmetic algorithms in the search This eectively increases the

range of ECM by a factor of with ab out twice the combined Step Step

execution time previously required but more memory The revised algorithm

was tested by trying it on the RSA Factoring Challenge list

We discuss many architectural considerations relating to the implementation

such as the identifying which p ortions of the computation can b e vectorized or

parallelized We also discuss the algorithms for computer arithmetic

We give a detailed analysis of intermediate results of the fast p olynomial GCD

algorithm

We give a family of elliptic curves with torsion group of order and p ositive

rank over Q and compare the smo othness of their orders to the smo othness of

curves with torsion group of order x

CHAPTER

Introduction

An integer N is called comp osite if is a pro duct of two smaller p ositive

integers otherwise it is called prime The Fundamental Theorem of Arithmetic

p states that any p ositive integer has a unique factorization into primes except

for order But its pro of gives no clue ab out how to nd these prime factors

This problem has interested mathematicians for centuries In his Disquitiones

Arithmeticae for example Gauss p wrote

The problem of distinguishing prime numbers from comp osites and of

resolving comp osite numbers into their prime factors is one of the most

useful in all of arithmetic The dignity of science seems to demand that

every aid to the solution of such an elegant and celebrated problem b e

zealously cultivated

As Gauss observes there are two problems here checking whether a number

is prime and factoring a number into primes if it is not itself prime Fermats

Theorem states that if p is prime and a is an integer not divisible by p then

p

a mo d p An integer p which fails this test for some a cannot

b e prime and must therefore b e comp osite M O Rabin used a variation of

this observation p to give a probabilistic primality test whose running

time is p olynomial in log p Rabins algorithm is termed probabilistic b ecause

it may err wrongly pro claiming that a number is prime when that number is

really comp osite but never the reverse error the failure probability can b e made

arbitrarily small through rep eated execution of the algorithm with dierent random

a There has b een considerable recent progress in algorithms which rigorously prove

that a number p which passes Rabins test is truly prime Although none of the

latter algorithms run in p olynomial time the state of the art is only slightly worse

For example Morain p used elliptic curves to prove that

d e p

where is Eulers constant and p is an digit prime In April Morain

proved that the digit partition number p is prime using four ma

chine years of time on a network of SUN s

The second problem that of splitting an arbitrary comp osite integer N into

its prime factors is b elieved to b e much harder The problem was once primarily

of academic interest but has gained attention since the publication of a

publickey cryptosystem whose strength dep ends on the diculty of factorization

Three ma jor factorization algorithms were introduced b etween and

Quadratic Sieve Elliptic Curve Metho d Number Field Sieve

The Quadratic Sieve algorithm and the Number Field Sieve algorithm

take time dep endent on the size of the integer N b eing factored Each

algorithm nds many congruences x y mo d N where x and y are either

i i i i

squares or pro ducts of elements from a factor base After enough such congru

ences have b een found selected congruences are multiplied together to get another

congruence X Y mo d N in which all elements of the factor base o ccur to even

exp onents b oth in X and in Y this selection is done using linear algebra mo dulo

p p

Then X and Y are integers which can b e computed mo dulo N Any prime

p p p p

divisor of N must divide either gcd X Y N or gcd X Y N if either

GCD is nontrivial then one has a factor of N The two metho ds dier in how

the congruences x y are found The Quadratic Sieve algorithm has b een used

i i

for N as large as decimal digits using a worldwide network of computers

That network also factored the digit Fermat number using Number

Field Sieve

The time for Elliptic Curve Metho d ECM dep ends primarily on the size of

the prime factor p of N The ECM algorithm requires considerable arithmetic

mo dulo N if p is a prime divisor of N then ECM is really doing arithmetic

mo dulo p since there is a natural ring homomorphism from ZN Z to ZpZ

GFp Over the eld GFp the algorithm op erates in an ab elian group whose

order is p O p If this order is suciently smo oth ie if the group order

has no large prime divisors then ECM usually nds p When the algorithm is

unsuccessful it can b e rep eated using a presumably dierent group The group

order seems more likely to b e smo oth when the prime p and hence the order is

small in which case fewer trials are needed to nd p but see Section The

time for ECM also grows with the time to do arithmetic mo dulo N but this growth

is less severe

The Quadratic Sieve algorithm and improved computer hardware were ma jor

ingredients in causing the smallest comp osite cofactor in App endix C of a b o ok

n

listing known factors of b for selected b and n to leap from digits in its

edition to digits in its edition Between those years the ECM algorithm

found a thousand previously unknown prime factors for these tables Even when

ECM found only some factors of a number the cofactor was sometimes suciently

small for Quadratic Sieve to complete the work quickly

As of April seven calendar years and hundreds of machine years after

ECMs introduction three digit factors one digit factor and one digit

factor had b een found these app ear in Table Here F denotes the nth

n

Fib onacci number Lenstra found his digit factor of an digit cofactor of

the partition number p using ECM on a MasPar with pro cessors and

allo cating pro cessors p er curve Rusin found the digit factor of the Cun

ningham number on a SPARC using the program in with limits

B and B

Factor Of Discoverers

Arjen K Lenstra

Mark S Manasse

F Rob ert D Silverman

F Peter L Montgomery

p Arjen K Lenstra

Dave Rusin

Table Largest ECM factors found

For a history of factorization and primality testing algorithms see and

Summary of main results

This work implements the algorithm summarized in extending the work

of With optimal parameters the metho d describ ed in this thesis b eats its

predecessor when searching for prime factors over digits The ma jor factors

found during this work app ear in Tables and The largest factor found

was a digit factor of the partition number p A digit factor and a

digit factor were found during Step but no other factor over digits was

found during six months of runs primarily on a DEC

Acronyms and notations

Table lists acronyms used in this rep ort Tables and

list some notations used Nonstandard acronyms and notations are dened where

rst used

The notation a b mo d c means that a b is divisible by c The notation

b mo d c without parentheses denotes the nonnegative remainder up on dividing

the integer b by the p ositive integer c it satises

b mo d c b cbbcc and b mo d c c

For p olynomials B X mo d C X denotes the remainder up on dividing B X by

C X it satises

B X

B X mo d C X B X C X

C X

deg B X mo d C X deg C X

AIX Op erating system on IBM RS

AMS American Mathematical So ciety

DEC Digital Equipment Corp oration computer manufacturer

DP Desired prop erty Chapter

ECM Elliptic Curve Metho d

ECMFFT Elliptic Curve Metho d with Fast Fourier Transform extension

FFT Fast Fourier Transform

GCD Greatest common divisor

HALFGCD Polynomial halfGCD algorithm Figure

HG HALFGCD input or output prop erty Section

IBM International Business Machines computer manufacturer

LIFO Last in first out

MIMD Multiple instruction multiple data parallel architecture

MIPS Computer manufacturer

MODMULN Mo dular multiplication algorithm Figure

NA Not applicable

RECIP Polynomial recipro cal Section

POLYEVAL Polynomial evaluation algorithm Section

POLYGCD Polynomial GCD algorithm Figure

RSA Rivest Shamir and Adleman publickey cryptosystem

SIMD Single instruction multiple data parallel architecture

UCLA University of California Los Angeles

Table Acronyms

A Co ecient of X Z in homogeneous form of elliptic curve

B Co ecient of Y Z in homogeneous form of elliptic curve

B Upp er b ound for prime p owers during Step

C Field of complex numbers

Cost Cost p er elliptic curve see

d d Degree of F X resp GX in Section

dn Number of divisors of n including and n

E Reduction of elliptic curve E mo dulo a prime p

p

EORx y Bitwise exclusive OR two or more arguments

F X Polynomial mo dulo which computations are done in Section

F nth Fib onacci number F F F F F if n

n n n n

GX Latest p olynomial remainder in Section

g X Dickson p olynomial Section

k

gcd m n Greatest common divisor of m and n

k

GFp Field of p elements p a prime k usually k

k k

GFp Nonzero elements of GFp

H X Polynomial multiplied by GX mo dulo F X in Section

I n n identity matrix

n

p

ln x ln ln x dened for x Lx exp

M n Time to multiply two p olynomials each of degree at most n

in a xed ring R Chapter

m Multiples of Q used to compute xco ordinates for F in Section

i

Table Notations part I

N The integer b eing factored or mo dulo which arithmetic is done

n Multiples of Q used to compute xco ordinates for H in Section

j

m n zero matrix

mn

O Point at innity on elliptic curve group identity element

O f n Any function g n such that g nf n is b ounded as n

of n Any function g n such that lim g nf n

n

P X Polynomial used to select multiples of Q in Chapter

Pr Success probability p er curve see

succ

pn Number of unordered partitions of an integer n

Q Field of rational numbers

Q Point on elliptic curve group as output from Step

m Q xco ordinate of m Q in elliptic curve group

x

Q Reduction of Q mo dulo a prime p

p

q Order of Q mo dulo some prime p

q Upp er b ound for q in Chapter

max

R Commutative ring within Chapter

ResF G Resultant of two p olynomials see

T F m n Toeplitz matrix from co ecients of p olynomial F

mnk

t Time for addition subtraction or negation in ring R

add

t Time for nding a multiplicative inverse when it exists in ring R

inv

t Time for multiplication in ring R

mul

Z Ring of integers

ZN Z Ring of integers mo dulo N

Table Notations part II

p

8

An th ro ot of mo dulo q

if n has rep eated factors

n Mobiusfunction

d

if n has d distinct prime factors

n Eulers totient function

Q

nd

d d

X Y X Y nth cyclotomic p olynomial in two variables

djn n

Dickman function see Section

P

s

n s Zeta function

n

bxc Largest integer n such that n x x real

dxe Least integer n such that n x x real

F X

Polynomial quotient remainder discarded

GX

F G p olynomials in X

x y

y x Both mean x is much less than y

x y

x y Both mean x is approximately y is stronger than

xjy x divides y x y integers

xy x do es not divide y x y integers

Table Notations part III

CHAPTER

Elliptic Curve Metho d and its History

An elliptic curve E over a eld K is dened by a cubic equation

Y a XY a Y b X b X b X b

with co ecients in K where b and where the discriminant do es not vanish

ie the right side of must not have rep eated ro ots after completing the

square in Y on the left We assume henceforth that charK Then the

linear change of variable

y b Y a X a x b X b a

converts to the Weierstrass form

y x ax b

where a b K and a b The curve consists of all p oints x y satisfying

together with a point at innity denoted by O

The p oints on an elliptic curve form an ab elian group with identity element O

if we dene the group law suitably pp Supp ose E is given by

The negative of O is O O the negative of any other p oint P x y on E

is its reection with resp ect to the xaxis P x y The sum P P of

two p oints P P E is dened by pp

i If P O then P P P

ii If P O then P P P

iii If P P then P P O

iv If none of i ii iii holds then P P x y where

y y

if x x

x x

x a

if x x

y

x x x

y x x y

The parameter in is chosen so that the line y x x y passes

through P and P if P P and is tangent to E at P if P P This line

intersects the curve at a third p oint x y P which may equal P or P

The group law is dened so that if a straight line passes through three p oints of

E then those three p oints sum to O Figure illustrates the group law by

computing P Q in two ways for P and Q on the curve

y x x

When K GFp is the eld of p elements the group E has nite order Hasse

p proved that the order of this group satises

p

p p jE j

Hence the group order is numerically very close to p The actual group order varies

with the choice of curve

Lenstras original algorithm

Lenstra announced the ECM algorithm in February and published it in

It is mo deled after Pollards P algorithm pp With prop er

choice of parameters the exp ected amount of arithmetic mo dulo N required to

o

nd a prime factor p of N by ECM is Lp p where

q

ln p ln ln p Lp exp

When p N this complexity matches that of the Quadratic Sieve algorithm

mentioned in Chapter but with higher constants For p O N where

the ECM algorithm is asymptotically faster than Quadratic Sieve as N

Lenstra pp generalizes the denition of elliptic curve to work over a

ring ZN Zwith gcd N as well as over a eld He denes a pseudoaddition

r

P

Q

r

P

r r r

P Q P Q

A

A

A

A

A

A

r r r

P P Q P Q

A

A

A

r

A

Q

A

A

A

A

A

A

A

A

A

A

A

A

A

A

r A

P

P

P Q

P Q

P P Q

Figure Group law on y x x

on these curves by emulating the algebraic rules for adding two p oints over a eld

except when might encounter a nonzero noninvertible denominator Any

such denominator must b e a zero divisor mo dulo N which is fortuitous rather

than disastrous since the greatest common divisor algorithm will reveal a non

trivial factor of N From this pseudoaddition Lenstra denes a multiplication

which is algebraically equivalent to ordinary group multiplication unless a zero

divisor of N is encountered

Lenstra selects a random curve E over ZN Z with known initial p oint P

Compute Q R P where R is a p ositive integer divisible by all prime p owers

b elow some b ound B This computation can fail only if some denominator is a

zero divisor in which case a factor is found For each prime factor p of N let

E denote the reduction of E mo dulo p see Chapter for denition If

p

any order jE j divides R then the reduction Q of Q in E is the identity

p p p

by Lagranges Theorem ab out subgroup orders When the reduction of Q is the

identity in E for some but not all primes p dividing N then Qs denominator

p

is divisible by some but not all prime factors of N triggering the aforementioned

failure

Step Brent and Montgomery improvements

Lenstras algorithm multiplies an initial p oint P on a curve E by an integer R

obtaining Q R P a pro cess called Step Step fails if no zero divisor is

encountered It succeeds if the group order jE j divides R for some prime pjN

p

Supp ose that jE j R but jE j j Rq for some prime q with q not to o large

p p

and pjN Then

q Q q R P q R P O

in E So the reduction Q has order dividing q and must either b e the identity

p p

or have order q

Both Brent and Montgomery observed early that Lenstras algorithm

can b e mo died to take advantage of such o ccurrences Each suggests computing

several multiples n Q for selected n If some n n mo d q then the p oints

i i i j

n Q and n Q agree up on reduction mo dulo p unless n n these are unlikely to

i j i j

match mo dulo other primes dividing N so p can b e found by comparing n Q and

i

n Q The match can b e done by testing the dierence of their xco ordinates then

j

the match also succeeds if q jn n This entire pro cess is nicknamed Step

i j

Brent and Montgomery dier on how to select the n Montgomery suggests

i

comparing multiples of Q along an arithmetic progression to a few xed multiples

of Q Brent suggests the use of semirandom multiples This question is discussed

in detail in Chapter

Weierstrass and Montgomery parameterizations

When P P uses one inversion and two multiplications one a squar

ing it uses another squaring if P P The inversion may b e avoided by using

homogeneous co ordinates in which each co ordinate is represented as a quotient

of two elements of K and using rational arithmetic in Straightforward

attempts whether using a common denominator or separate denominators for x

and y use ab out multiplications in place of the inversion in

Montgomery pp suggested an alternate parameterization which

allows the use of homogeneous co ordinates ie no inversions while keeping the

number of multiplications small To motivate it consider the ane curve

y b x b x b x b b

If P x y and P x y are two p oints on with x x with sum

P x y and dierence P x y then

y y

b

x x

x x x

b

b x x x x b x x b x x b y y

b x x

y y

b

x x

x x x

b

b x x x x b x x b x x b y y

b x x

A straightforward calculation gives

b x x b b b x b x b

x x

b x x

If we require b b and b then this simplies to

x x

x x

x x

This allows one to compute the xco ordinate x of P P from those of P P

and P P with a few multiplications inversions and squarings

By itself costs more that but imp oses little overhead

when switching to homogeneous co ordinates Up on setting b b b B

b AB in and putting y Y Z x XZ we obtain

BY Z X X AX Z Z

This is an elliptic curve if and only if B and A

Equation is said to b e in homogeneous or projective form b ecause it is

invariant up on replacing X Y Z by k X k Y k Z for any k Its co ordinates

are often written X Y Z rather than X Y Z The group identity element

is O The negative of P X Y Z is P X Y Z

If P X Y Z and P X Y Z are two p oints on with

distinct XZ ratios hence not equal or negatives of each other and with dierence

P P X Y Z then their sum P P X Y Z satises

Z X X Z Z X

Z X X Z Z X

Z X Z X Z X Z X Z

X Z X Z X Z X Z X

by Montgomery also gives a doubling rule if P X Y Z then

P X Y Z is given by

Z X X

AX Z Z Z X Z X

X Z X Z

X Z X Z A X Z

where X Z X Z X Z

He prop oses computing only the X Z ratios By and one can

add two p oints on with six multiplications if their dierence is known and

one can double a p oint with ve multiplications if A is known He computes

large multiples of a p oint using the metho ds in

Not every elliptic curve of the form can b e linearly transformed to form

unless the base eld K is algebraically closed or has other nice prop erties

But ECM lets its user choose the curve and there is no prohibition against selecting

one of form The curve selection strategy is discussed in detail in Chapter

CHAPTER

Fast Polynomial Arithmetic

Our algorithms will require fast p olynomial arithmetic over the ring ZN Z

including multiplication remaindering and greatest common divisor This chapter

summarizes some of the fast algorithms found in the literature emphasizing how

they apply to this ring It includes a detailed analysis of intermediate results of

the fast GCD algorithm

All rings in this chapter are assumed commutative Polynomials are univariate

unless otherwise indicated

Minimal time for p olynomial multiplication

Let M n b e the time required to multiply two p olynomials F X and GX

of degrees at most n over a given ring R ie to compute the co ecients of the

pro duct given the co ecients of the inputs We assume that a ring addition or

subtraction can b e computed in time t a ring multiplication in time t and

add mul

a multiplicative inversion when it exists in time t These times are assumed

inv

constant for any given ring R but may dep end on R

Since any multiplication algorithm must read all n input co ecients we have

the trivial lower b ound M n O n The straightforward algorithm uses n

multiplications and n additions in the ring giving the upp er b ound

M n n t n t

mul add

Karatsuba pp pp observed that we can do b etter by

divide and conquer Supp ose that n m is even Write

m m

F X F X F X X and GX G X G X X

where deg F deg F degG deg G m Then

m m

FG F F X G G X

m m

F G F G F G X F G X

m m

X F G X F G F F G G F G F G

The p olynomials F G F G F F and G G all have degrees less than

m n This technique gives

M n M m m m t M n n t

add add

Together with M t the solution of this recurrence is

mul

n n n n

M t t n

mul add

so

log

2

M n O n t t O n t t

mul add mul add

If the original b ound n on the degrees is not a p ower of then we can increase

the b ound and still achieve This metho d works over any ring

We will assume in subsequent analyses that the M n function satises

aM n M an a M n

for a p

Circular convolutions

We start by dening a circular convolution of two vectors called cyclic convo

lution in p and positive wrapped convolution in p

T T

Denition Let f f f f f and g g g g g

n n

be two nvectors over a ring R Their circular convolution of length n f g is

dened to be the nvector h h h h h where

n

X

h f g k n

k i j

ij k mo d n

Circular convolutions of length n are really p olynomial multiplications mo d

n

ulo X If f g and h are as in Denition and if

n n n

X X X

i i i

h X g X H X f X GX F X

i i i

i i i

n

then h f g is equivalent to H X F X GX mo d X For example

if h f g then

n n

X X X

k k

H X h X f g X

k i j

k k

ij k mo d n

n n n

X X X X

ij ij

f g X f g X

i j i j

j i

k

ij k mo d n

n

F X GX mo d X

If we can p erform fast circular convolutions then we have an algorithm for

fast p olynomial multiplication Given two p olynomials F and G choose n

deg F deg G Use a circular convolution of length n after padding with

n

leading zeros to form F X GX mo d X this must equal F X GX

since deg FG n We can also use n deg F degG if we compute the

leading or constant co ecient of the pro duct directly

FFT for p olynomial multiplication

The Fast Fourier Transform FFT algorithm pp is the basis of the

fastest known p olynomial multiplication algorithms If the base ring R satises cer

tain conditions then the FFT algorithm executes a circular convolution of length n

using O n log n op erations additions and multiplications in the base ring This

is asymptotically b etter than the O n b ound for M n from Section

Sp ecically when the length n is a p ower of and n the FFT algorithm

requires i that have a multiplicative inverse in R and ii that there exist a

n

known or easily computable element such that When R C is

in

the eld of complex numbers these are satised for e The conditions are

also satised when R GFp is the eld of p elements if p mo d n and is

the p nth p ower of a primitive element of the eld

P P

n n

i i

Given two p olynomials F X f X and GX g X of degree

i i

i i

at most n the FFT algorithm constructs the nvectors

T

n

f F F F F

T

n

g G G G G

T T

from the co ecient vectors f f f f and g g g g

n n

These vectors f and g are the discrete Fourier transforms DFT of f and g A

straightforward computation p shows that the p ointwise pro duct

T

n n

F G F G F G F G

of f and g is the DFT of the circular convolution h f g From h an inverse

discrete Fourier transform ie an interpolation pro duces h and hence the co e

n

cients of H X F X GX mo d X The b eauty of the FFT is that the

evaluation and interpolation stages take only

n log n n t n log nt

mul add

plus an extra n multiplications by n during the interpolation This assumes

that all ro ots of unity are known and do es not count trivial multiplies by

Adding the time for yields a b ound of

n log n n t n log nt

mul add

n

for multiplication mo dulo X Replacing n by n gives

M n n log nt t

mul add

if the ring satises the requirements and n is a p ower of See pp for

details

Circular convolutions over ZN Z

We would like to apply the metho ds of Section to the ring R ZN Z where

N is the integer we are attempting to factor The FFT has two requirements i

must have a multiplicative inverse in R and ii there must b e a known element

n

with Requirement i is satised since our N will b e o dd Requirement

ii will usually not b e satised

Montgomery and Silverman pp p erform circular convolutions

mo dulo N by executing several such convolutions mo dulo small primes and using

the Chinese Remainder Theorem to get a result mo dulo N The small primes are

selected so that the requirements in Section are satised

Let n b e a p ower of Supp ose that we want a p olynomial pro duct F X GX

n

mo d N X where deg F n and deg G n and where all co ecients

K

of F and G are in the interval N Select distinct primes fp g such that

i

i

each p mo d n and

i

K

Y

p nN P

i

i

Here dep ends on the precision of oating p oint arithmetic Perform a single

precision circular convolution mo dulo each p using the metho ds of Section

i

after reducing the co ecients of F and G mo dulo p The next step is to construct

i

n

X

j n

h X mo d X F X GX

j

j

over Z from the known pro ducts

n

X

j n

h X mo d p X F X GX

ij i

j

Since h h mo d p the Chinese Remainder Theorem shows that

j ij i

K

X

P

h y mo d P

j ij

p

i

i

where

y P p h mo d p

ij i ij i

The b ounds h nN P lead to the explicit formula equation

j

K K

X X

y P

ij

y P h

ij j

p p

i i

i i

We can get h mo d N directly from A more computationally convenient

j

formula for h mo d N is

j

K K

X X

P y

ij

h mo d N y P mo d N mo d N

j ij

p p

i i

i i

The co ecients P p mo d p in can b e precomputed and stored in

i i

tables So can P p mo d N and the th through K st multiples of

i

K

P mo d N in With these tables computation of an h from fh g

j ij

i

requires at most

i K mo dular multiplications to get the y with y p

ij ij i

K

X

y

ij

ii O K oating p oint op erations to evaluate

p

i

i

iii K multiplications of y by residues in N

ij

iv K additions of numbers in N

h i

P

K

v One reduction mo dulo N of a integer in N K p

i

i

The cost is O K for i and ii O K log N for iii and iv and O log N log K

for v Since there are n dierent co ecients h the cost of the Chinese Remainder

j

Theorem is O nK log N Reducing the n input co ecients of F X and GX

mo dulo the K primes p also takes time O nK log N A circular convolution of

i

length n mo dulo p takes O n log n op erations in GFp each of assumed cost

i i

O Hence the total time for a circular convolution mo dulo N is

O nK log N O K n log n O nK log nN

Since K O log nN by this time b ound is

O n log nN log nN O nlog n log N

David Cantor mentioned another algorithm for p olynomial pro ducts mo dulo N

P

n

i

due to David Robbins Supp ose we are given F X f X and GX

i

i

P

n

i

g X where f g N Select a radix R and write

i i i

i

X X

j j

g R f R g f

ij ij i i

j j

The co ecients of F X GX and hence those of F X GX mo d N can b e found

from those of the p olynomial pro duct

n n

X X X X

i j i j

A A

g X Y f X Y

ij ij

j j i i

up on substituting Y R This p olynomial pro duct in can b e found using

metho ds for univariate p olynomial multiplication after setting Y X This

converts the original problem from a convolution of length n with co ecients at

most N to a convolution of length ab out d n log N with co ecients

R

at most R The co ecients of the latter pro duct are b ounded by dR

The latter convolution can b e carried out by mo dular arithmetic it can also b e

done by oating p oint arithmetic if R is suciently small This allows the use of

commercial FFT routines which are often optimized for a particular architecture

For example when bit hardware oating p oint arithmetic with mantissa circa

bits b ecomes available then R should b e feasible even if n degree

and digit integers

The estimated time for Robbinss algorithm is

O d log d log dR

O n log N log n log log N

if R is kept xed and we approximate the time for oating p oint arithmetic linearly

in the number of bits of precision required This is as go o d as except

p ossibly for the constant factor as n with N xed but b etter as N

with n xed Either algorithm might b e b etter in practice

Polynomial recipro cals and division

Algorithms for fast p olynomial division use Newtons metho d to compute a

recipro cal followed by a multiplication to compute the desired quotient The

remainder if desired can b e found after another p olynomial multiplication and

subtraction

If F X is a p olynomial of degree n whose leading co ecient is invertible

then its reciprocal p is dened to b e the p olynomial quotient

n

X

RECIPF

F X

n

Observe that F X X is a nite Laurent series with invertible constant term

n n

and RECIPF X has precisely the terms through X in the Laurent series

n

for The rst term of the latter recipro cal can b e found directly F X X

we can then apply Newtons metho d at any time to double the accuracy See

p When n is a p ower of and F has degree n the algorithm in Figure

gives all n co ecients of RECIPF

P

n

j

pro cedure RECIP f X

j

j

Cmt Assume that n is a p ower of and f is invertible

n

R X f

n

e f f

n n

for k n do

P P

k k

k j j

f X h X R X let

nj j k

j j

P

k

j

k

h X R X R X X

j k k k

j

h f f f Use h if k e e

k n nk n k k

k

end for

return XR X e f

n n n

end RECIP

Figure Algorithm for p olynomial recipro cals

It is straightforward to check that R X has degree k It is more tedious

k

to verify the inductive assertion

nk

n deg XR X e f F X X

k k n

for k n One pro of denes R X XR X e f the algorithm

k k n

k

in Figure is equivalent to

f X

n

R X

f f

n

n

k j

nk

F X X R X

k

k

R X X R X k

k k

k

X

When deg F X is not a p ower of it can b e scaled up to the next p ower

of by multiplying by a p ower of X and adjusting the recipro cal accordingly

The computations of e in Figure can then b e suppressed since the constant

k

co ecient of the output is not used in this case

To compute a quotient bGX F X c where deg GX n and where

F X has degree n with invertible leading co ecient one can use the identity

k j

n

RECIPF X GX X

GX

n

F X X

To prove let H X b e its right side We must show that deg GX

F X H X n Each summand on the right side of

GX

n n n

GX F X H X X GX X X

n

X

GX

n

F X RECIPF X X

n

X

GX

n

F X H X X RECIPF X

n

X

is a pro duct of p olynomials of degrees at most n and n so its left side has

degree at most n as desired

Assuming the times for p olynomial recipro cal and division are b ounded

by t O M n t t log n p

inv mul add

Constructing a monic p olynomial from its ro ots

Let R b e a ring and n b e a p ower of Given a R for i n we can

i

compute the co ecients of

n

Y

X a F X

i

i

in time

M n nt log n nt

add add

by rep eatedly invoking the pro cedure for multiplication

The idea is to multiply two factors of equal degree at each stage Dene

d

Y

X a F X

ij id

j

whenever djn and i is a multiple of d with i n d We are given the a and

i

can compute the F X X a with n negations Construct

i i

F X F X F X i d d n d

id id idd

from fF X g for i d d n d by pairwise multiplying nd pairs of

id

monic p olynomials each of degree d Rep eat this pro cedure log n times to get

F X F X

n

We can multiply two monic p olynomials of degree d in time M d dt by

add

dropping the leading co ecients b efore doing the multiply Doing this nd times

as while replacing d by d costs ndM d nt M n nt by

add add

Rep eating this log d times for d n gives the b ound

This algorithm needs temp orary storage for at most O n ring elements since

we can store the n co ecients of fF X g excluding the leading s atop the n

id

co ecients of fF X g

id

The b ound O M n nt log n applies even if n is not a p ower of since

add

we can app end some zeros to fa g b eforehand and divide by a p ower of X at the

i

end

Evaluating a p olynomial at many p oints

Let GX b e a p olynomial of degree at most n over a ring R where n is a

p ower of Given a R for i n we claim that we can evaluate all Ga in

i i

total time

M n nt log n nt O M n

add add

using fast p olynomial techniques pp We call the resulting algorithm

POLYEVAL

Q

n

First we form the fF X g in obtaining F X X a Invert

id i

i

F X to get RECIPF X RECIPF X as in Section We are given

n

GX mo d F X since we assume deg GX n

n

POLYEVAL pro ceeds recursively in reverse order to that used during the con

struction of F X Given GX mo d F X and RECIPF X compute

id id

j k

d

F X RECIPF X X

idd id

RECIPF X

id

d

X

and RECIPF X similarly as justied by Next compute

idd

GX mo d F X GX mo d F X mo d F X

id id id

GX mo d F X GX mo d F X mo d F X

idd id idd

using the recipro cals from and the metho ds of

Each step of this backwards recursion uses six pro ducts of p olynomials which

either are monic of degree d or have degree at most d for a combined time of

M d An extra d additions are used p er recipro cal in and p er remainder

d

in to multiply by the X terms There are also d subtractions required

p er remainder Therefore the time for and is b ounded by M d

dt These op erations are rep eated nd times when replacing d by d so the

add

net time is b ounded by ndM d nt M n nt Summing this

add add

over all log n levels of the recursion and adding the time for constructing F X

and its recipro cal we get the b ound

d d

If p olynomial multiplication is done mo dulo X or X using an FFT

algorithm then there are many rep eated op erations in this calculation due to mul

tiplying two p olynomials by the same p olynomial While using to replace d

by d and later using and to replace d by d it suces to compute

six forward transforms of length d for the p olynomials

F X F X

id idd

GX mo d F X RECIPF X

id id

d d

X X

RECIPF X RECIPF X

id idd

and four forward transforms of length d for the p olynomials

F X F X

id idd

GX mo d F X GX mo d F X

id id

F X F X

id idd

Since the remainders in have degree at most d it suces to compute

d

them mo dulo X The transforms of F X and F X of length d can

id idd

b e obtained from the corresp onding transforms of length d by extracting every

second element since the transform evaluates the p olynomial at the dth ro ots of

unity which are a subset of the dth ro ots of unity We also need ve p ointwise

pro ducts and reverse transforms ie interpolations of length d to get

F X

id

RECIPF X RECIPF X

id idd

GX mo d F X GX mo d F X

id id

F X F X

id idd

and two of these of length d for computing the nal remainders If we

approximate the cost of a forward transform or of a p ointwise pro duct and inter

p olation of length d as as M d then the leading co ecient of our estimated

time is

M dd M d M d

allowing us to replace the M n log n in by M n log n

The temp orary storage requirements of this algorithm are O n log n ring el

ements for storing all co ecients of fF X g That is intermediate storage re

id

quirements are O log n times the input size Since the F X p olynomials are

id

reused only once these p olynomials or their forward transforms if using an FFT

can b e saved in external storage rather than main memory if the storage system

supp orts LIFO access

If the a have a sp ecial pattern then we may b e able to evaluate all Ga in

i i

less time than If the a form a geometric progression then all Ga can

i i

b e found using one convolution of length n and O n extra multiplications

exercise p If instead the a form an arithmetic progression

i

then we can do it with deg G additions p er Ga after suitable initialization cf

i

Section

Polynomial GCDs over a eld

The standard Euclidean algorithm for p olynomial greatest common divisors

GCDs pp takes O n op erations when applied to two p olynomials

of degree at most n Mo enck found an asymptotically faster algorithm using

fast multiplication and division algorithms

To simplify the presentation this section assumes that the base ring R is a

eld Section presents the changes required when R ZN Z

Denition Let M be a matrix of polynomials over a eld Then the degree

of M written deg M is the maximum of the degrees of the entries of M

Denition A matrix M of polynomials over a eld is lopsided of

degree n if n and M I or if n and

B C

m m

B C

M B C

A

m m

where

deg m n deg m n deg m n deg m n

For purposes of this denition the zero polynomial has degree

Lemma If M is lopsided of degree n and M is lopsided of degree n then

M M is lopsided of degree n n

Proof Straightforward Since all computations are over a eld the pro duct of

two p olynomials of degrees n and n has degree n n 

Figure describ es recursive pro cedure HALFGCD which is an optimized

version of pro cedure HGCD in p HALFGCD has two p olynomial inputs

U and V and an integer input d telling how much to reduce one of the degrees

red

b efore exiting it has two p olynomial outputs U and V and a matrix

out out

output M HALFGCD has two input requirements

out

HG U and V are p olynomials in X over a eld R with deg U deg V

HG d is a p ositive integer with d ddeg U e

red red

Theorem up coming shows that HALFGCDs outputs satisfy

HG M is a lopsided matrix of degree deg U deg U and determinant

out out

B C C B

U U

out

B C C B

HG B C C M B

out

A A

V V

out

pro cedure HALFGCDU V d M U V

red out out out

Cmt Given U V d satisfying HG and HG

red

Cmt nds M U V satisfying HG HG HG

out out out

if deg V deg U d then Cmt Degree already small enough

red

U U

out

M and

out

V V

out

else Cmt Make two recursive calls to HALFGCD

Cmt Cho ose n with n deg U d as large as convenient

red

0 0

Cmt Cho ose d with d d as close to dd e as convenient

red red

U V

0 0 0 0

HALFGCD d M U V

n n

X X

0 0 0 0

Cmt deg V deg U n d deg U deg U n deg M

00 0

d deg V deg U n d

red

00

if d then Cmt One recursive call is enough

0

n

U U mo d X U

out

0 n

M M and X M

out out

0

n

V V mo d X V

out

else

0

U

0 0 0 0 0 0

and W U QV Cmt W U mo d V Q

0

V

0 0 00 00 00 00

HALFGCD V W d M V W

00 0

M M M

out

Q

00

n

U V U mo d X U

out

n

X M M

out out

00

n

V W V mo d X V

out

0 0

Cmt deg V deg U d deg U deg V n deg M

out red out

end if

end if

end HALFGCD

Figure Algorithm HALFGCD

HG deg V deg U d deg U

out red out

A corollary of HG and HG is deg M d this is useful when

out red

deciding how much storage to allo cate for the matrix A consequence of HG

and HG is gcdU V gcdU V

out out

Algorithm POLYGCD of Figure computes an arbitrary p olynomial GCD

by rep eatedly invoking HALFGCD Each iteration of POLYGCDs main lo op calls

HALFGCD with a pair fU V g such that gcdU V gcdU V where U

and V are POLYGCDs original inputs The output of HALFGCD is another pair

fU V g with gcdU V gcdU V and where

deg V deg U ddegU e bdeg U c

Unless V b oth the new U V and the new V U mo d V have degrees

at most half the degree of the old U so maxdegU deg V drops quickly

pro cedure POLYGCDU V

Cmt Assume that deg U deg V

U U V V

while V do

Cmt By induction deg U deg V

HALFGCDU V ddeg U e M U V

out

if V then

U U V V

else

U V V U mo d V

end if

end while

return U X normalized to a monic p olynomial

Figure Algorithm for p olynomial GCDs

Theorem Procedure HALFGCD always terminates after being invoked with

inputs satisfying HG and HG Its outputs M U and V satisfy HG

out out out

HG HG

Proof long We pro ceed by induction on d

red

The simplest case is when deg V deg U d Then M I is

red out

the identity matrix while U U and V V Since deg M and

out out out

detM conditions HG to HG are trivially satised

out

Otherwise the algorithm selects two integers n and d such that

n deg U d and d d

red red

The range for n is nonempty since d deg U by HG The range for

red

d is nonempty since failure of the if means that d deg U degV which is

red

p ositive by HG A corollary of is

n d degU

n

Next the pro cedure calls itself with U replaced by bU X c with V replaced by

n

bV X c and with d replaced by d Requirement HG is satised since

red

V U

deg U n deg V n deg deg

n n

X X

by HG Requirement HG is also satised since

k j

n

deg U X

d deg U n

d

by Since d d by we may assume by induction that the outputs

red

from the rst recursive call satisfy

degM deg U n deg U

detM

n

C B C B

bU X c U

C B C B

C B C M B

A A

n

bV X c V

deg V deg U n d deg U

with M lopsided

The algorithm denes d by

d deg V degU n d

red

This satises

d d d d

red red

by and If the second if in HALFGCD succeeds ie if d

then the pro cedure exits with M M which has determinant by

out

It returns

n

C B C B C B

U mo d X U U

out

C B C B C B

n

C B C M B C X B

A A A

n

V mo d X V V

out

n

n

B C C B

U mo d X bU X c

B C C B

n

B C C M B X M

A A

n

n

V mo d X bV X c

n n n

C C B B

U X bU X c U mo d X

C C B B

C C M B B M

out

A A

n n n

V X bV X c V mo d X

this proves HG

We claim that

degU n deg U

out

n

The term X U of U in the rst line of has degree exactly n deg U

out

n

B C

U mo d X

B C

while the contribution from M B C has degree at most

A

n

V mo d X

deg M n deg U deg U n d

degU d n degU

by twice and This proves and also HG since

degM degM deg U n deg U deg U deg U

out out

by

The second inequality in HG follows from

deg U n deg U deg U d degU d

out red

by and The rst inequality in HG follows from

since

deg V maxn deg V deg M n

out

maxd deg U d deg U degU

red

maxdegU d n d

red

maxdegU d degU d degU d

red red red

by and recall that d

The nal sub case o ccurs when d The pro cedure computes the quotient

Q bU V c and remainder W U QV so that

deg Q deg U deg V

deg W degV

Then it calls itself recursively with U V d replaced by V W d resp ectively

red

From and it follows that

d deg V deg U n d d

red red

deg V n deg U d

red

deg V

Hence d ddeg V e fullling requirement HG for the recursive call Since

d d by we may assume by induction that the outputs M V and

red

W satisfy

deg M deg V deg V

detM

C B C B

V V

C B C B

C B C M B

A A

W W

deg W deg V d deg V

with M lopsided We substitute the denition of d from to deduce

deg W deg U n d deg V

red

The pro cedure exits with

B C

B C

B M M C M

out

A

Q

Its determinant is detM detM by and Lemma

together with and shows that M is lopsided of degree

out

deg M deg M deg Q deg M

out

deg V degV deg U deg V

degU n deg U

deg U n degV

The pro cedure also exits with

n

C B C C B B

U mo d X V U

out

C B C C B B

n

C M B C C X B B

out

A A A

n

V mo d X W V

out

n

C B B C

U mo d X V

C B B C

n

C M B B X M C

out

A A

n

W V mo d X

n

C B C B C B

U mo d X U

C B C B C B

n

C M B C B C B X M

out

A A A

n

V mo d X V Q

n

n

C B B C C B

U mo d X bU X c

C B B C C B

n

C M B B C C M B X M

out

A A A

n

n

bV X c V mo d X Q

n n n

C C B B

U X bU X c U mo d X

C C B B

C C M B M B

out out

A A

n n n

V X bV X c V mo d X

fullling HG

We claim that

deg U n deg V

out

n

B C

U mo d X

B C

n

B The contributions to U in comes from X V and M C

out out

A

n

V mo d X

the latter has degree at most

deg M n deg U deg V n d

out red

degU d degV n

red

by twice and

The formula for deg M in HG follows from and The

out

second inequality of HG follows from and The rst inequality

of HG follows from

deg V maxn degW deg M n

out out

maxdegU d deg U d deg U d

red red red

by and 

Complexity analysis of fast GCD algorithm

Aho et al p show that HALFGCDs time is O M deg U log deg U

for deg V deg U if n and d are selected optimally within the algorithm

Algorithm POLYGCD also has this time b ound In order to provide a more pre

cise b ound with explicit constants we make the following simplifying assumptions

ab out HALFGCDs b ehavior

a d is a p ower of

red

b deg U mo d d and deg U deg V

red

c When d the algorithm chooses n deg U d and d d

red red red

d The outputs at each level of recursion satisfy

deg M d

out red

deg U deg U deg M

out out

degV deg U

out out

The division of U by V yields a quotient Q of degree and a remainder W

of degree deg V deg U

Assumptions a and c can b e programmed Assumptions b and d say that

the leading co ecients of certain inputs and intermediate results never vanish

these are often valid when applying the algorithm with random inputs if the ring

is large p but are not guaranteed and indeed are not satised when the

eventual p olynomial GCD has p ositive degree

Given these assumptions we claim that the time for HALFGCD is b ounded by

degU d M d d t d t

red red red add red inv

d log d t d log d t

red red mul red red add

M d log d

red red

for d There is no cost for d

red red

Our assumptions mean that HALFGCD always takes b oth else branches in

Figure unless d Hence it calls itself twice recursively The rst

red

recursive call has d d with inputs of degrees d d and

red red

d d Its output matrix M is lopsided of degree d while U and

red

V have degrees d and d resp ectively The algorithm next computes

d d The quotient Q is linear and W has degree d The second

recursive call returns a lopsided matrix M of degree d and outputs V and

W of degrees d and d resp ectively The algorithm then constructs the

outputs M U and V in this simplied scenario

out out out

Excluding the recursive calls the work in HALFGCD consists of computing

Q and W computing M and computing U and V Using the

out out out

classical algorithm for p olynomial division since Q is assumed linear the cost of

is

t t deg V t t

inv mul mul add

t t d t t

inv mul mul add

When computing M b oth M and M are lopsided of p ositive degree d

out

while Q is linear Using classical metho ds to multiply M takes time

Q

d t t

mul add

It remains to multiply two lopsided matrices of p olynomials of degrees d and

d the degrees of their entries are

B C B C

d d d d

B C B C

B C B C

A A

d d d d

We can compute M using eight multiplications of p olynomials of degree at most

out

d with d additions to combine these pro ducts followed by d mul

tiplications and additions to multiply by the leading co ecient of the p olynomial

of degree d This shows that the time for computing M is b ounded by

out

d t t M d d t d t t

mul add add mul add

M d d t d t

mul add

To construct U and V in we multiply the lopsided matrix M of degree

out out out

d by a vector with two comp onents of degree at most n Assumptions b

red

and c ensure that n mo d d so we can split the computation into nd

red red

matrixvector pro ducts where all degrees are b ounded by d Each such

red

pro duct can b e computed in time M d d t The upp er half of

red red add

each vector pro duct must b e added to another vector giving a total time b ounded

by

n

M d d t d t

red red add red add

d

red

nM d d nt

red red add

Adding and while using d d and n deg U

red

d gives a cost excluding recursive calls of

red

t t d t t

inv mul mul add

M d d t d t

mul add

nd M d nt

red red add

nM d d t M d

red red add red

t d t d t

inv mul add

degU d M d d t

red red red add

M d t d t d t

red inv red mul red add

Both recursive calls replace d by d d one replaces deg U by d

red red

and one bydegU Assuming by induction that the time b ound applies

to these calls their combined time is

d d M d d t

add

d t d log d t

inv mul

d log d t M d log d

add

d t d log d t

inv mul

d log d t M d log d d t

add add

d t d log d t

red inv red red mul

d log d t M d log d

red red add red red

When d the recursive cost in simplies to zero which is correct

red

since the recursive calls for d and and d are free Adding and

gives a cost of

deg U d M d d t

red red red add

M d t d t d t

red inv red mul red add

d t d log d t

red inv red red mul

d log d t log d M d

red red add red red

deg U d M d d t d t

red red red add red inv

d log d t d log d t

red red mul red red add

M d log d

red red

in agreement with This completes the induction

To estimate the time of Algorithm POLYGCD Figure we use the leading

terms

deg U d M d d t d t

red red red add red inv

d t t log d M d log d

red mul add red red red

from our estimate The calls from POLYGCD to HALFGCD always have

deg U d allowing us to neglect the rst term of the congruence

red

assumption b need not hold for the call from POLYGCD to HALFGCD but

b do es hold for the recursive calls from HALFGCD to itself If the original

degrees passed to POLYGCD are at most d then HALFGCD is called successively

with d d d d The p olynomial divisions within POLYGCD have

red

negligible cost unless the quotients have large degree Its total estimated time is

dt t M d log d

mul add

This is ab out times as long as the estimated time for constructing a

p olynomial of degree d from its ro ots Tables and suggest that the

actual time ratio is closer to

The storage costs of HALFGCD and POLYGCD are prop ortional to the size

of the input data if d is chosen so that the degrees of the p olynomials are ap

proximately halved at successive levels of the recursion The temp orary storage

requirements within HALFGCD are O deg U ring elements Summing over all

levels of recursion the combined temp orary storage requirement is

O deg U O deg U O deg U O deg U

The implied constant is rather large since we need storage for the matrices of

p olynomials and for the transforms used during the convolutions

Connection with p olynomial resultants

P

d

j

If F X f X is a p olynomial of degree d and m n k are integers let

j

j

T F denote the m n Toeplitzlike matrix ft g in which t f

mnk ij ij ij nmk

where f is interpreted as if i or i d Each row has some p olynomial

i

co ecients which shift to the right by one column as we go down one row The

k

entry in its lower right corner is the co ecient of X Pictorially

C B

f f f f f

k nm k nm k nm k m k m

C B

C B

C B

C B

C B

f f f f f

k nm k nm k nm k m k m

C B

C B

C B

C B

T F

mnk

C B

C B

C B

C B

C B

f f f f f

C B

k n k n k n k k

C B

C B

A

f f f f f

k n k n k n k k

If F X and GX are p olynomials in X of degrees m and n resp ectively then

their resultant is the m n m n determinant

T F

nmn

ResF G

T G

mmn

The resultant of F and G vanishes if and only if F and G share a common p oly

nomial factor p

Output condition HG can b e expressed in terms of p olynomial resultants

Supp ose Algorithm HALFGCD of Figure is called with two p olynomials U

and V where d deg U deg V Supp ose the outputs are U V and

out out

m X m X

If detM m then M

out out

m X m X

deg m deg m deg m deg m m deg V degU d m

out out

by HG and HG We also know that

m deg M d dde d

out red

and hence d m The matrixvector equation HG implies the following

equation where I denotes the m m identity matrix and denotes the m n

m mn

zero matrix

C B C B

I T U

md m mdm mdm mdm

C B C B

C B C B

C B C B

C B C B

C B C B C B

T m T m T U T U

dmd dmd dd dmd out

C B C B C B

C B C B C B

C B C B

A

C B C B

I T V T V

C B C B

md m mdm dd mdm mdm

C B C B

C B C B

A A

T m T V T m

dmd dmd out dmd

All three matrices in are d d The rst determinant

dm dm

m m m m detM

out

After removing the rows and columns with the I blo cks the remaining matrix has

m

four d m d m lower triangular blo cks whose diagonals are m m

m and m The second determinant is ResU V times the degU

deg V th p ower of the leading co ecient of U Hence the third determinant in

vanishes if and only if ResU V Because deg U d m and

out

deg V d m the T U and T V blo cks b egin with m

out dmd out dmd out

columns of zeros The third determinant in can b e therefore decomp osed

into the pro duct of the m m and d m d m determinants

T U T U

mmdm dmdm out

det det

T V T V

mmdm dmdm out

The right determinant in is ResU V times the deg U

out out out

deg V th p ower of the leading co ecient of U From HG and HG

out out

we know that the common factors of U and V are precisely the common fac

out out

tors of U and V Therefore it is plausible that the rst determinant in

do es not vanish such is indeed the case This is clear when m since the empty

matrix has determinant For m the rst determinant vanishes only if its

rows are linearly dep endent which is equivalent to the existence of p olynomials

F X and GX of degree at most m and not b oth zero with

deg F X U X GX V X d m

But U X m X U X m X V X has degree exactly d m while

out

deg m m and deg m m since M is lopsided Comparing

out

degrees on b oth sides of the identity

U X m X F X m X GX

m X F X U X GX V X GX m X U X m X V X

gives the contradiction d deg U m d m unless b oth sides of

the identity vanish in which case m X F X m X GX Since m X

and m X have no common factor by HG there must exist QX such that

F X QX m X and GX QX m X Plugging these into

shows that deg QX U X d m a contradiction

out

Knuth pp discusses the intermediate p olynomials which o ccur in

the Euclidean GCD algorithm in terms of resultants

Schwartz pp describ es how to compute the resultant of two p oly

nomials using Mo encks fast GCD algorithm

Polynomial GCDs over ZN Z

The concept of a p olynomial GCD over ZN Z is not welldened when N is

comp osite this will b e turned to our advantage in Chapter For example supp ose

we try to compute

gcdX X X mo d

using the Euclidean algorithm We start by dividing X X by X

getting a quotient of X and a remainder of mo dulo Next we try

to divide X by but is not invertible Instead we discover the factor

gcd of The explanation is that

gcdX X X mo d

but gcdX X X X mo d

The GCD has degree mo dulo but degree mo dulo No monic p olynomial

can meet b oth requirements

Nonetheless we try to follow Algorithm POLYGCD in Figure using arith

metic in ZN Z instead of arithmetic over a eld We assume that the original

p olynomials U and V satisfy deg U deg V and that the leading co ecient

of U is invertible mo dulo N indeed U will b e monic in Section The latter

condition ensures that deg U deg V which is requirement HG preceding

Theorem remains true even after the co ecients of U and V are reduced

mo dulo p for some pjN

Theorem Suppose Algorithm POLYGCD is applied to two polynomials

U and V over ZN Z where degU degV and the leading coecient of U is

invertible modulo N Then either i the algorithm nds a monic GCD of U and V

over ZN Z or ii some polynomial division has a nonzero noninvertible leading

coecient When ii occurs and a factor p is found either i or ii remains true

for the ring ZN pZ

Proof Consider the situation just b efore case ii rst o ccurs if ever

We claim that every previous call to HALFGCD had an invertible leading

co ecient for U Every such co ecient was one of

a The leading co ecient of the original U passed to POLYGCD in Figure

b The leading co ecient of a p olynomial passed to HALFGCD for U at a

higher level in the rst recursive call to HALFGCD in Figure

c The leading co ecient of a p olynomial used for a division in the second

recursive call to HALFGCD in Figure and in calls from POLYGCD to

HALFGCD after the rst iteration of the while lo op in Figure

In each case the assertion follows by induction

m m

Letting M we now use HG to obtain

out

m m

C B B C B B C C

U U U m m

out out

C B B C B B C C

C M B B C B B C C

out

A A A A

detM

out

V V V m m

out out

Hence U m U m V Since M is lopsided and degV deg U

out out out out out

by HG the leading co ecient of U is up to sign the pro duct of the leading

co ecients of m and of U Consequently b oth of the latter are invertible

out

mo dulo N

When ii rst o ccurs during computation of bU V c in HALFGCD or of

U mo d V in POLYGCD we nd a prop er but not necessarily prime factor p

of N All previous input and output assertions for HALFGCD and POLYGCD

remain true over the ring ZN pZ the degrees of V and V may drop but

out

those of U U and M remain unchanged and M remains lopsided The

out out out

only way these assertions might fail is if mo d N p in such case the leading

co ecient of V or V was zero mo dulo N not merely a zero divisor of ZN Z

When we nish POLYGCD and are working mo dulo some divisor N of N

U

there will exist a matrix M of determinant over ZN Z such that

V

U

This together with the exit condition V implies that U generates M

V

the same ideal as U and V in the p olynomial ring ZN ZX Since the leading

co ecient of U is invertible mo dulo N this generator can b e normalized to a

monic p olynomial 

Remark Caution Although the degree of U does not change under re

duction modulo N p the degree of V may drop if N is not squarefree For example

this occurs while attempting to divide X by X modulo

Opp ortunities for optimization and parallelization

The classical FFT algorithm over the complex numbers has many opp ortunities

for parallelization Chapter The algorithm in Section shares many of these

opp ortunities working mo dulo the p instead of over C The convolution is done

i

separately for each p these computations can b e arranged in vector or SIMD

i

fashion If one chooses to vectorize over the p then one can ensure unit strides

i

everywhere by storing residues mo dulo dierent primes contiguously

This convolution algorithm reduces each input co ecient mo dulo all the primes

p These reductions can pro ceed concurrently

i

After the individual convolutions mo dulo p are complete the Chinese Re

i

mainder Theorem is used to construct the outputs h in Computations for

j

dierent j are indep endent and can b e parallelized

Algorithm HALFGCD multiplies two matrices or a matrix and a vector When

using FFTlike multiplication one can do forward transforms on all elements of

each matrix p erform the matrix multiplication directly on the Fourier transforms

of the input data and take one inverse Fourier transform p er element of the output

matrix cf Section When multiplying two matrices this trick cuts the

work almost in half When using the algorithm in Section for convolutions over

ZN Z the lower b ound for P in should b e adjusted to reect the largest

p ossible co ecient which might app ear in the matrix pro duct b efore reduction

mo dulo N

Some cost estimates in this chapter are p essimistic For example when mul

tiplying two p olynomials b oth of degree d with d a p ower of we multiplied by

the leading co ecients separately If our convolution algorithm pro duces a pro d

d

uct mo dulo X then the leading or constant co ecient of the pro duct can

d

b e calculated directly and the prop er multiple of X added on Similarly a

p olynomial remainder known to have degree at most d is uniquely determined

d

if we compute it mo dulo X

Both POLYEVAL and the algorithm for constructing a p olynomial from its

ro ots pro ceed by divide and conquer Each divides the problem into two pieces

which may b e attacked indep endently For example we may compute all F

id

for xed d in concurrently The same applies to and For

small d we can compute F for several dierent i at once if the original degree

id

is suciently large On the other hand if d is large then there may b e only a few

values of i but parallelism can b e employed within the convolutions themselves

The p olynomial GCD algorithm also uses divide and conquer but its second

recursive call cannot b egin until the rst is complete severely restricting its p oten

tial parallelism There is some parallelism during the larger convolutions but little

during the highly nested stages of the recursion Whenever Algorithm HALFGCD

computes a quotient Q it inverts the leading co ecient of the denominator no

go o d parallelizable algorithm for mo dular inversion is known This distinction is

reected in Table which shows sp eedups approaching using ve pro ces

sors to construct a p olynomial from its ro ots but at most for the p olynomial

GCD problem

CHAPTER

Application to ECM

Step of ECM multiplies an initial p oint on an elliptic curve by a large integer

obtaining another p oint Q on the curve Step assumes that the reduction Q

p

of Q over GFp has small p ositive order q for some pjN It constructs several

multiples of Q and lo oks at their xco ordinates hoping for a match mo dulo p The

birthday paradox describ ed by Brent pp predicts that if we randomly

p

pick O q integers m then some m m mo d q for i j This leads to a

i i j

match among the xco ordinates of m Q and m Q This chapter describ es

i p j p

how to use fast p olynomial techniques when testing for a match mo dulo p

Checking two lists for matches mo dulo p where pjN

d d

1 2

Supp ose that pjN and we are given two lists fa g and fb g of values

i j

i j

mo dulo N where a N and b N How can we check whether there

i j

exist i and j such that

a b mo d p i d j d

i j

A related question asks for duplicates within either list for example do there exist

solutions of

a a mo d p where i j d

i j

If p were known then we could reduce all a and b mo dulo p sort b oth lists

i j

and lo ok for duplicates This takes time O d d log N log p for the reductions

mo dulo p using classical division algorithms and O d log d d log d log p for

the sorting since the entries have size O log p If d d N then these pro cessing

times are only O log p times the input size O d d log N Another approach

also for known p uses a hash table pp

Lubiw and Racz obtain a lower b ound of n log n op erations when check

ing whether n elements are distinct

For the intended application p is unknown The real question is whether there

exists a factor p of N such that some a b mo d p or a a if so we

i j i j

want to nd p Sorting seems inapplicable since no comparison function is known

Hashing seems inapplicable since we do not know how to hash in such a way that

two numbers congruent mo dulo p will hash to the same value

We can form the p olynomial

d

1

Y

F X X a mo d N

i

i

in time M d d t log d d t as shown in Section Problem

add add

asks whether F has multiple ro ots mo dulo p for some pjN Any such multiple ro ot

of F mo dulo p must b e a ro ot of its formal derivative F mo dulo p We therefore

attempt the computation

gcdF X GX mo d N

where GX F X Problem asks whether F b mo d p for some

j

Q

d

2

j and some pjN Here we attempt with GX X b mo d N

j

j

By Theorem Algorithm POLYGCD in Figure pro duces a monic

p olynomial g X which divides b oth F X and GX or nds a factor p of N not

necessarily prime If we are lucky it may nd several factors of N When a monic

GCD is found either it has degree zero implying there were no solutions of

for any pjN or it has p ositive degree The latter case seems unlikely when N is

not a prime p ower unless either a a for some i j resp a b for some i

i j i j

and j or N has only small prime factors as it means that holds for all pjN

though not necessarily with the same i and j

Use of fast p olynomial evaluation

Brent p suggests another way to do the match After constructing F X

and GX as ab ove check whether Ga mo d p for some pjN and some i

i

with i d That is check whether

n

Y

Ga gcd N

i

i

Algorithm POLYEVAL of Section requires that deg GX deg F X

this can b e ensured by taking a remainder initially Since POLYEVAL computes

RECIPF X the additional overhead for the p olynomial division is small

We saw earlier that estimates time M n log n for POLYEVAL

whereas estimates time M n log n for POLYGCD These suggest that

POLYEVAL is faster than POLYGCD while also b eing more parallelizable Sec

tion

POLYEVAL seems simpler to program that POLYGCD b ecause its recursion

is readily replaced by iteration POLYEVAL is also easy to test since one can

indep endently evaluate a few Ga the long way to check results While co ding

i

POLYGCD one must worry ab out nding a factor p of N midway through the

algorithm with all future calculations done mo dulo N p POLYEVAL do es not

require any mo dular inversions and do es not attempt a GCD with N until all

Ga have b een computed avoiding this problem but there is still the prosp ect

i

of encountering a zero divisor while computing the a themselves such as while

i

computing a slop e in

However POLYGCD retains one tangible advantage over POLYEVAL less

storage is required for large d We observed near the end of Section that algo

rithms HALFGCD and POLYGCD use O n ring elements for intermediate storage

when taking the GCD of two p olynomials of degree at most n POLYEVALs stor

age requirements are O n log n ring elements when evaluating a p olynomial of

degree at most n at n p oints as observed in Section This consideration

may diminish in imp ortance as memories grow and parallelism b ecomes more im

p ortant esp since the storage required by POLYGCD and HALFGCD has

a large prop ortionality constant

Table suggests values of d and d when searching for factors of various

sizes The suggested d is b elow until one searches for factors of digits

so the extra storage may b e aordable But the data in Table assume that

POLYGCD is used switching to POLYEVAL aects the cost equation and

the optimal parameters

Construction of p olynomials

Assume that we are using ECM to factor an integer N as in Section Let

Q b e the output of Step of ECM after selecting an initial p oint P on a curve

and computing a large multiple of P If pjN then the reduction Q of Q in E

p p

has nite order say q We hop e that q is not to o large say q

d d

1 2

Select two disjoint sets fm g and fn g of large p ositive integers we will

i j

i j

assume later that d is a p ower of and d jd The selection pro cess is discussed

in detail in Chapter that pro cess need not concern us here Form the two sets

d d

1 2

fa g and fb g where

i j

i j

a m Q i d

i i x

b n Q j d

j j x

are xco ordinates of selected multiples of Q If we are suciently lucky in this

selection then

m n mo d q

i j

for some i and j the corresp onding a and b in satisfy

i j

Next compute the p olynomials

d d

2 1

Y Y

X b mo d N X a mo d N GX F X

j i

j i

These have a common ro ot mo dulo p if holds for some i and j Either

POLYGCD or POLYEVAL can b e used to check for a match

If d d then F X and GX in can b e constructed directly as in

Section then POLYGCD or POLYEVAL can b e applied to F and to G mo d F

reducing G mo dulo F ensures that the constraint deg U deg V in Figure

is satised

This test has approximately

deg F degG d d

opp ortunities for a match if the sets fm g and fn g are chosen carefully there will

i j

b e a match in for most q d d If we want to nd p whenever q q

max

for some preselected q then we need

max

d d C q

max

for some p ositive constant C

If were the only constraint on d and d then we would choose d d

so as to minimize d d and hence the total time for constructing the fa g and

i

fb g in Memory requirements imp ose another constraint If q

j max

and C for example then suggests taking d d For

N each residue mo dulo N o ccupies ab out bytes so a p olynomial of

degree o ccupies megabytes The input to POLYGCD then requires

megabytes and the storage of the output matrix M is also approximately this

out

large if HALFGCD is called with d ddegU e as in Figure More

red

U

out

space is used by the convolutions such as when constructing M and at

out

V

out

the end of Figure This work was run on systems shared by other users and

it is unfriendly to hog the memory even if it is available So one should keep d

small say d

One remedy selects d d with d jd Compute F X as in but mo d

ify the construction of GX as in Figure b elow This keeps gcdF X GX

GX

for j from to d d do

j d

1

Y

H X X b mo d N

j i

ij d

1

GX GX H X mo d F X N

j

end for

Figure Computing GX mo d F X in pieces of degree d

unchanged but leaves deg GX b ounded by d rather than by d If d d

then this can save memory though at a cost of more computation in

Each iteration of the inner lo op in Figure divides a p olynomial of degree

at most d by F X of degree d The discrete forward transform dened in

Section of the recipro cal RECIPF needs to b e computed only once as do es

the discrete forward transform of F itself

On the rst iteration of the inner lo op in Figure the computation of GX

reduces to GX H X F X mo d N since H and F are monic of degree

d A variation initializes

GX F X

instead of GX this variation nds a match if a a mo d p for some

i i

1 2

i i and hence if two of the m or their negatives agree mo dulo q

i

Since the FFT algorithms in Sections and are designed for length a

p ower of it is convenient to choose d as a p ower of POLYEVAL and the

p olynomial construction algorithm in Section also work well with this choice

CHAPTER

Selection and Generation of Multiples of Q

In Section we chose integers d and d with d a p ower of and d jd and

d d Let Q b e the output of Step of ECM and supp ose that the reduction

Q has prime order q for some prime pjN We hop e that q is not to o large say

p

q The algorithm uses xco ordinates m Q for i d and n Q

i x j x

for j d as p olynomial ro ots usually nding p if q jm n for some i j

i j

with

i d and j d

for some i i with m or in the variation to Figure if q jm

i i

2 1

i i d and i i

The algorithm do es not sp ecify how to select the sequences fm g and fn g

i j

The next sections describ e the strategy used and some motivation b ehind it

Some desired prop erties DPs of these sequences are

satisfying m DP Most small and mo derate primes q should divide some m

i i

2 1

or some m n satisfying It is acceptable if q instead divides

i j

some m or n since the computation of that m Q or n Q reveals the

i j i j

factor p while inverting a denominator

If this prop erty holds for all primes q d d say then we can honestly claim

to have B d d in the notation of

DP The sums and dierences m m and m n should have many prime

i i i j

1 2

divisors not just the ones ensured by DP However no such sum or

dierence should b e identically zero No m or n should b e identically zero

i j

The n should b e distinct

j

DP The average time for computing an xco ordinate m Q or n Q should

i x j x

not exceed multiplications mo dulo N This gure was selected by

equating the time for computation of the ro ots of H in to the combined

time for building H and up dating G GH mo d F in Figure the

gure is implementationdependent

DP The ab ove computations of the m Q and n Q should b e amenable to

i x j x

parallel computation on a machine with to parallel pro cessors and shared

memory such as the Alliant FX in UCLAs Department of Mathematics

whose cluster had six pro cessors during the early parts of this study but ve

at the end

DP If there can b e many duplicate n s mo dulo q we try to ensure a match

j

m m mo d q satisfying or m n mo d q satisfying

i i i j

1 2

Ideally either all n are distinct mo dulo q so that we have the

j

full d d opp ortunities for a congruence m n or we have a guaranteed

i j

match

DP Minimize the number of instances where two pairs m n and m n

i j i j

1 1 2 2

share large factors or one divides the other due to algebraic identities

Section tries to achieve these ob jectives while letting each m and n b e a

i j

k th p ower or a value attained by a Dickson p olynomial Some advantages and

disadvantages of each choice are presented Section describ es how to evaluate

successive m Q quickly when m is a p olynomial function of i Section

i i

describ es the choices made in the implementation

Use of k th p owers or Dickson p olynomials

We attempt to satisfy DP by letting m P M and n P N for

i i j j

selected integers fM g and fN g where P is a p olynomial such that P X P Y

i j

k

together have many p olynomial divisors One such p olynomial is P X X

k k

where k is highly comp osite since P X P Y P X P Y X Y has

dk irreducible p olynomial factors p Here dk denotes the number

of divisors of k including and k

Perhaps surprisingly there are other monic p olynomials P of degree k for which

P X P Y have a total of dk irreducible p olynomial factors

Denition Let k be a positive integer For xed dene the Dickson

polynomial g pp by the formal identity

k

k

k

X g X

k

k

X X

It is easy to verify that g is a monic p olynomial of degree k b ecause

k

g X X

g X X

g X X g X g X k

k k k

k

If and k then g X X When the Dickson p olynomials

k

are related to the Chebyshev p olynomials chapter since

k

g cos cos k

k

Polynomial divisors of g X g Y

k k

When k we claim that g X g Y have a total of dk irreducible

k k

factors over Z just as when We illustrate this b elow for k

g X X X X

g X g Y X Y X Y

X XY Y X XY Y

g X g Y X Y X X Y Y X Y

Theorem Let k be a positive integer and x Z Then the polynomial

g X g Y has exactly dk irreducible factors over Z

k k

Proof The case is covered by the theory of cyclotomic p olynomials When

we show that there are at least dk irreducible factors and at most dk

such factors The upp er b ound is clear and applies to any p olynomial of degree k

k k

not just g since the highest total degree terms of g X g Y are X Y

k k k

which has only dk irreducible factors over Z

For existence recall that

Y

k k

X Y X Y

d

djk

where the cyclotomic p olynomial is homogeneous of degree d and satises

d

X Y Y X the minus sign is needed only if d Hence

d d

k k

k k

g U U g V V U V

k k

k k

U V

k k k k k

U V U V

Y

U V U V

d d

djk

We claim that each of the dk factors on the right is a nonconstant p olynomial

function of U U and V V over Q This will show that g X g Y

k k

has at least dk irreducible factors

To prove this claim x djk and dene

g U V U V U V

d d

By homogeneity and the symmetry of X Y in X and Y

d

g U V U V U V

d d

d d

V U V V V U

d d

U V U V

d d

g U V

Therefore g is symmetric in U and U Similarly

g U V U V V U

d d

d d

U U V U U V

d d

g U V

so g is symmetric in V and V Since g is by denition a p olynomial function of

U V U and V over Z g must b e a p olynomial in the symmetric p olynomials

U U U U V V and V V over Q Moreover g cannot

b e constant b ecause U V is homogeneous in U and V but U V is not

d d

homogeneous for 

Corollary If k is a positive integer and Z then g X g Y has

k k

exactly dk dk irreducible factors over Z

Proof From

k k k

g X X X X

k

k k k k

X X

k

g X X

k

k

it follows that g g By Theorem

k

k

g X g Y

k k

g X g Y

k k

g X g Y

k k

has exactly dk dk irreducible factors 

Prime divisors of g X g Y

k k

According to solution to exercise if uX is an irreducible p oly

nomial over Z then the average number of linear factors of uX mo dulo q tends

to as the prime q If uX Y is an irreducible factor of P X P Y

over Z then uX Y remains irreducible over Z as a p olynomial in X after al

most all substitutions of an integer for Y by Hilb erts Irreducibility Theorem

Chapter So for almost all xed Y Z the average number of linear factors

of uX Y mo dulo q tends to as q If P X g X where is a

k

k

xed integer allowing the case where and P X X then for almost

all Y Z the average total number of linear factors of P X P Y mo dulo q

tends to dk as q by Theorem and Corollary When X and Y

are indep endently and randomly selected integers mo dulo a large q then a crude

estimate for the probability that P X P Y mo d q is dk q

For xed q the actual probability dep ends on the residue class of q mo d

ulo k Theorem asserts that this probability is gcdk q q O q

k

when P X X and is gcdk q gcdk q q O q when

P X g X with mo d q Lemma says that the average value

k

of gcdk q or of gcd k q gcdk q as q ranges over the

k residue classes relatively prime to k is dk conrming that the average

probability is dk q O q if the residue class of q mo dulo k is randomly

chosen

Lemma If k then

X

gcdi k k dk

ik

gcdi k

e e

1 2

e

n

b e the prime factorization of k with each e Proof Let k p p p

j

n

e

j

Each i with i k is uniquely identied by its remainders r mo d p for

j

j

Q

e

j

n

gcdr p j n We can compute gcdi k for any such i and

j

j

j

we can compute gcdi k similarly The desired sum is

n n

Y X X Y X

e e

j j

gcdr p gcdr p gcd i k

j j

j j

e

r r r

j n

1 2

j j

ik

r p

e

j

j

j

gcdi k

j r p

j

j

gcdr p

j j

e

j

j gcdr p

j

j

e e

j j

p values of r incongruent to or mo dulo p This inner gcd is for the p

j j

j j

e e

j j

is p for p p values of r etc So the inner sum is

j j

j j

X

e

j

gcdr p

j

j

e

j

r p

j

j

gcdr p

j j

e e e e e c e

j j j j j j j

p p p p p p p p p

j

j j j j j j j

j

e e e e e

j j j j j

p p e p p p

j

j j j j j

e e e

j j j

p p e p p e

j j j

j j j

Q Q

e

j

n n

The claim follows from k p p and dk e 

j j

j

j j

Theorem Let q be a prime not dividing k and Z As X and Y range

over the interval q the number of cases where g X g Y mo d q

k k

is

q gcd k q if q j

gcdk q gcdk q

q O if q

The O term may depend on k but not q

Lemma Let G be a nite cyclic group If g G and k then the

k

equation x g has either exactly gcd k jGj solutions with x G or no such

solution

Proof We may assume that G is the additive group of integers mo dulo jGj

k

since the assertion is invariant under group isomorphism Then x g translates

into k x g mo d jGj This congruence has exactly gcdk jGj solutions when

gcdk jGj divides g and no solution otherwise 

Proof of Theorem We may regard X and Y as elements of GFq

k k

If then g X g Y simplies to X Y For Y the only

k k

solution is X For each Y the equation has a known solution X Y and

must have exactly gcdk q solutions X GFq by Lemma

Supp ose instead that After p ossibly adjusting the O we may assume

that q is o dd Given z GFq we count the solutions of g X z with

k

X GFq We claim that this equation has

k

At most k solutions when z

k

Either no solution or gcdk q solutions when z is a quadratic

residue For any such solution X is a quadratic residue Throughout

this pro of interpret quadratic residue as quadratic residue mo dulo q and

likewise for quadratic nonresidue

k

Either no solution or gcdk q solutions when z is a quadratic

nonresidue For any such solution X is a quadratic nonresidue

Supp ose that we have proved this claim There are at most k values of X for

k

which g X For any such X the p olynomial equation g Y

k k

g X has at most k solutions Y so there are at most k total solutions X Y

k

of this type When we exclude these k values of X there remain q O values

of X for which X is a quadratic residue and another q O values of

X for which X is a quadratic nonresidue For each X in the rst category

there are gcdk q values of Y satisfying g Y g X according to the

k k

claim For the second category this count is gcdk q Hence there are

q O gcd k q q O gcd k q O

total solutions of g X g Y when as asserted by Theorem

k k

k

It remains to prove the claim The case z is easy since the p olynomial

equation g X z of degree k in X can have at most k ro ots for X

k

k

We therefore concentrate on the case where z Supp ose that z

g X with X GFq Write X U U with U GFq Then z

k

k k k k k k k

U U and the assumption z b ecomes U U In particular

U U

Under these conditions we claim that the following are equivalent

i X is a quadratic residue

ii U GF q

iii U U GFq

k k k

iv U U GFq

k

v z is a quadratic residue

The equivalence of i and iii is evident since X U U and the

latter is nonzero Similarly for iv and v The equivalence of ii and iii is

also immediate since X U U GFq and q is assumed o dd For iii and

k k k

iv we observe that the quotient U U U U is in GFq This can

b e shown directly using symmetric functions since the quotient is symmetric in U

and U Or we can pro ceed by induction on k using the identity

k k k k k k k k k

U U U U U U

U U

U U U U U U

k

If z is a quadratic residue then U GFq must satisfy

p

k

z z

k

U

For each choice of Lemma gives us either no solution U GFq or

gcdk q such solutions If there are any solutions for one selection of then

there are also solutions for the other choice since is equivalent to

p

k

z z

k

U

This leads to gcd k q solutions for U and to gcd k q solutions for X

k

when z is a quadratic residue as claimed

k

If instead z is a quadratic nonresidue then U GFq n GFq set

q

q

dierence Fix u GFq such that u this is p ossible since

2

q

and the p olynomial u splits completely over GFq Since the conjugate

q

q q

q

of U is U as well as U we require U Q so U u Hence U u

k k k

is a q st ro ot of unity in GFq The equation z U U implies that

p

k

k

z z U

k

u u

For each choice of sign this has either gcdk q solutions U u in the group of

q st ro ots of unity or no such solution by Lemma As ab ove there are

either zero or gcdk q total solutions leading to gcdk q dierent values

q

of X U U satisfying g X z The equation U u implies that

k

q q

U U so any such X U U U U GFq 

T

and p is the probability If we select T indep endent random pairs fX Y g

q t t

t

that P X P Y mo d q for randomly chosen X and Y then our estimated

T

probability of a match P X P Y for some t b ecomes p When

t t q

T p

q

By Theorem p o we can approximate this probability by e

q

this p dep ends on the residue class of q mo dulo k When the residue class of

q

q is unknown we should average this success probability over all p ossible residue

classes The result is

X

T

gcdk q if exp

k q

q

0

Pr

match

X

gcdk q gcdk q T

if exp

k q

q

0

where q runs over the residue classes mo dulo k which are relatively prime to k

If we expand as a Laurent series in q then the leading term is dk T q

for any choice of by Lemma The averaged success probability is

never larger for than for xed nonzero b ecause for any pair of residues q

and q

T T

exp gcdk q exp gcdk q

q q

T gcd k q gcdk q

exp

q

gcdk q gcdk q T

exp

q

T gcd k q gcd k q

exp

q

by the arithmeticgeometric mean inequality

The dierence in these success probabilities can b ecome signicant for large k

Table gives the number of trials T needed for the probability to reach

and for selected k The values of k selected are those

for which dk exceeds all earlier values

k dk Polynomial

X g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

X q q q q q

g X q q q q q

Table Trials needed to reach condence level

Comparative computational costs

If minimizing the estimated number of trials were the only concern then we

would prefer m g M and n g N with However other desired

i k i j k j

prop erties given early in this chapter imp ose additional requirements such as ease

of computation

Supp ose that we select fM g and fN g randomly sub ject to M N B

i j i j

where the b ound B is preselected Since g is monic of degree k the m and

k i

k

n are b ounded approximately by B Any individual m Q or n Q can b e

j i j

k

computed with O log B group op erations using the same algorithm as used

during Step for a total cost of O d d k log B group op erations But there

is considerable redundancy in this computation if we rep eat it for every m and

i

k

n We can do b etter by viewing the m n B as integers in a radix R

j i j

k j

Approximately R log B group op erations suce to build a table of r R Q

R

j k

sub ject to r R and r R B Each m Q or n Q can then b e computed

i j

k

with another log m log B group op erations Weierstrass co ordinates for a

i

R R

total cost of

k

R d d log B

R

group op erations This cost is minimized when R ln R R d d Using

R d d ln d d reduces the total cost to ab out

d d

k

ln B

lnd d

k

ln B

Weierstrass group op erations and the average cost p er m Q or n Q to

i j

ln d d

k

such op erations for arbitrarily chosen m n B

i j

Since we require B d d to ensure distinctness of the fm g and fn g the

i j

average cost of this algorithm exceeds k group op erations Section shows how to

compute these m Q with an overhead of k Weierstrass group op erations apiece

i

plus initialization costs if the fM g form an arithmetic progression regardless of

i

k

likewise for the fN g If we restrict so that P X X and if jk then

j

we can reduce the overhead to k homogeneous group op erations p er m Q

i x

or n Q by using a dierent sequence fM g as seen b elow

j x i

Lemma If q is an odd prime then is an th power modulo q

p

Proof If is a quadratic residue mo dulo q then

p

If is a quadratic residue mo dulo q then

p

If is a quadratic residue mo dulo q then

Since at least one of these is a quadratic residue the pro of is complete 

p

8

and treat it as an integer Remark We subsequently refer to this root as

p

8

in the analysis Algorithms use only powers of Since Q

x

Q the sign ambiguity wil l not matter

x

Corollary Let k be a positive integer divisible by and let q be an odd

k

prime Then at least one of is a k th power modulo q

p

k

8

k k

k

mo d q then mo d q Conse Proof If

k

k

quently mo d q and one of these is a k th p ower mo dulo q 

If our k is divisible by then we can multiply any previously created k th

k

p ower or negative thereof by to create another such p ower From a p oint P

k

we can construct P using k doubling op erations in the group If we are

x

allowed multiplications mo dulo N p er p oint then we require k or

k since each application of the homogeneous doubling rule uses ve

such multiplications

This trick should b e employed on only one of the sequences fm g and fn g If

i j

k

it is used on b oth sequences then there are many cases where m m and

i i

2 1

k

n n violating DP Since we are assuming that d d it is probably

j j

2 1

cheaper to employ this trick during generation of the fn g than during the fm g

j i

k k

Using p owers of and

A straightforward scheme based on this idea lets

ik

m i d

i

d k j k

1

n j d

j

where jk One can compute the xco ordinates m Q using d k succes

i x

sive cubings and hence d k group op erations which can b e done using

homogeneous co ordinates Another d k doubling steps suce to compute the

d k

1

n Q starting from the known Q If d d then the average cost

j x x

p er xco ordinate is slightly ab ove k group op erations

Prop erty DP is achieved if Q has o dd order since a match n n with

i j

i j d implies that

d k ik d k j k

1 1

d k d k j ik

1 1

m n

d j i

1

ensuring that a match exists in

This scheme also satises DP since any prime q with

q d d gcdq k

ik j k

divides some m m or some m n To prove this consider f g for

i i i j

1 2

i d and j d All of these are nonzero k th p owers mo dulo q

since jk By the Pigeonhole principle there must b e a duplicate amongst these

ik j k

d d values This leads to a congruence where jij d and j

d with i j not b oth zero If j then m m If instead j then

jij

n m

j d i

1

This scheme fails to b e parallel as in DP but such is not imp ortant when

implementing ECM on a sequential architecture This scheme also has many cases

ik j k

where one divides multiple m n some with gcd i d j

i j

1 1

in p ossible violation of DP

Achieving parallelism

The last scheme failed to achieve parallelism in part b ecause each p oint was

k

multiplied by to get the next p oint When using k th p owers ie it

k

would b e desirable to instead multiply several p oints by concurrently

Supp ose that our parallelism requirements dictate multiplying dierent p oints

k

by at once might b e the number of pro cessors available If we have selected

k

the n for j by some alternate means then we can let n n for

j j j

j d Assuming that jd jd our algorithm can resemble

Cho ose k divisible by with dk as large as feasible Construct m Q

i x

k

for some fM g Use these xco ordinates for i d with each m M

i i

i

for ro ots of F X in Section Initialize GX

k

for some fN g Construct n Q for j with each n N

j j x j

j

Perform the next two steps for d in this order

k

If then set n n for j Each corre

j j

sp onding n Q can b e computed from n Q using k doubling

j x j x

steps values for dierent j can b e computed in parallel

If d j ie when d new values of n Q have b een found then

j x

form another H p olynomial and reduce GX H X mo dulo F X as

in Figure

k

For the rest of this section we assume that d and that n m for

j j

j d Then prop erty DP is automatically satised if our p oint Q has

o dd order which can b e ensured by doing enough doublings during Step More

precisely supp ose that some n n where i j d If i d and j d

i j

k k

then m m mo d q implying that m m If i d but j d

i j i j

then m n If i d and j d then n n and we pro ceed by

i j d id j d

1 1 1

induction on j

It remains to select the fm g We should satisfy the cost requirement DP

i

although we may b e able to aord more overhead p er m than p er n since we will

i j

not b e needing as many

There should b e few if any multiplicative relations amongst the fm g If

i

k

for example m m m m then m m shares many factors with

i i i i i i

1 2 3 4 1 2

k

m m for each violating DP This precludes for example dening

i i

3 4

i k ik

M and m M for i d even though such values can b e succes

i i

i

sively computed with k nonparallel group op erations apiece This duplication

k

was not a problem in Section where all p owers of were multiplied by the

same m namely by m

i d

1 1

If we use random values for fM g then the only apparent violations of DP

i

k

o ccur for those constructed from the same M since for example divides

i

all n m That is the only apparent redundant m n o ccur when i j

id i i j

1

mo d d If d then this redundancy o ccurs for under of the m n

i j

an allowance which app ears negligible

But large numbers of random k th p owers may not b e easy to compute e

ciently as desired in DP despite the metho d presented early in Section One

computationally feasible prop osal uses arithmetic progressions for fM g If a and

i

k

b are xed then the values of m ai b are successive values of a p olynomial

i

of degree k and successive m Q can b e computed with k group op erations apiece

i

after suitable initialization see Section That algorithm allows parallelism on

up to k pro cessors but uses Weierstrass rather than homogeneous co ordinates

and hence requires mo dular inversions Its initialization cost dep ends primarily on

k

the magnitude of the largest ai b for i k

However the use of arithmetic progressions for these M may lead to decreased

i

overall eectiveness Prop erty DP requires that most or all small primes q divide

sub ject to or some m n sub ject to The primes m some m

i j i i

2 1

least likely to have this prop erty are those where gcdq k since any

k k

such q divides an M M only if it divides M M Supp ose that we have used

i j

i j

arithmetic progressions for fM g say M ai b Then M M ai j b

i i i j

and M M ai j There are only d distinct sums ai j b for

i j

d

1

i j d and i j compared to the desired d d such sums

There are also very few distinct dierences ai j For small o dd the problem

p ersists since

k k k

n n ai b aj b

id j

1

is divisible by

ai b aj b a i j b

and there are many duplicate sums and dierences on the right There are few

p

8

is duplications of this type when is even and the numerical value of

presumably large mo dulo most primes q but many of the p otential opp ortunities

for a match mo dulo q are b eing wasted when is o dd violating DP and p ossibly

DP

One workaround uses multiple arithmetic progressions for fM g If we use

i

d such progressions each of length then the fM g are essentially random If

i

we use eight or sixteen such progressions then the ab ove troubles o ccur only for

M M where M and M come from the same progression and hence for one

i j i j

eighth or one sixteenth of such pairs which may b e an acceptable tolerance

Separate arithmetic progressions

k

If we abandon our convention that n m for j d but retain

j j

k

n n if j d then we must select the early n as well as the m

j j j i

Unless we restrict fm g prop erty DP need no longer hold

i

One approach uses one arithmetic progression for the fM g and another for

i

p

8

times earlier values That is Subsequent values of N are fN g

j j

j

M a i b i d

i

and

a j b if j

p

8

if j a j b

N

j

a j b if j

We can achieve DP by choosing the progressions so that most small primes have

the form M N For example let M i for i d and N d j

i j i j

p

8

for j with N N for j d This particular example has

j j

many duplicate n since for example

j

k k k k k k

n n N N d N n

Such overlap is rectied by letting N d j take on only o dd multiples

j

of d for j even multiples app ear in later rows of Figure ie H

j

p

8

We should also shift the values in for larger j due to the multiplications by

the rst row say to M i d to increase the number of distinct ratios

i

M N with i d and j d For example if d then M to

i j

M are and whereas the early N app ear in Figure

j

H ro ots N N N N

p p p p

8 8 8 8

H ro ots N N N N

H ro ots N N N N

p p p p

8 8 8 8

H ro ots N N N N

H ro ots N N N N

Figure Dep endencies using two arithmetic progressions and doubling

Use of arithmetic progressions and Dickson p olynomials

k

The trick of multiplying a p oint by a k th p ower or by to get another

k k k

p oint when used the identity XY X Y so that

k k k k

g XY Q XY Q Y X Q Y g X Q

k k

No such identity relates g XY to g X when

k k

The data in Table suggest that Dickson p olynomials g X with

k

k

p erform b etter than X when selecting the m and n if our sole ob jective is

i j

maximize the probability that a random prime q divides some m m satisfying

i i

1 2

or some m n satisfying Of the metho ds of generation considered

i j

in Section with m g M for some fM g resp n g N for some

i k i i j k j

fN g the use of random M requires over k group op erations p er m while the use

j i i

of arithmetic progressions for fM g reduces this cost to just k such op erations after

i

suitable initialization Since theory suggests that all p erform comparably

it is simplest to use or to force nonnegative co ecients We can

let for example

M i i d

i

N d j j d

j

Evaluation of fm Q g where m is a p olynomial function

i x i

Let P X b e a p olynomial in X of degree k Some of the ab ove schemes require

evaluating P i Q for several successive integers i This can b e done using a

x

straightforward mo dication to the technique for evaluating a p olynomial along an

arithmetic progression p at a cost of O k Weierstrass group op erations

p er evaluation after suitable initialization The computations can b e arranged so

that these O k op erations can b e done in parallel

For example consider P i Q for successive i where P X X and k

Tabulate P to P k and take nite dierences as in the left of Figure

Figure Finite dierences of p olynomial function P X X

Because deg P the fourth row of nite dierences is constant The

vector on the rst full upward diagonal of Figure

can b e computed from the top row using k k integer subtractions and the

results used to evaluate Q Q Q Q Q as in Section Each

subsequent upward diagonal such as Q Q Q Q Q can b e

computed from the previous such diagonal with k group op erations by following

the arrows in Figure Once an entire diagonal vector is known a value of

P i Q can b e extracted from its last comp onent

Q Q Q Q Q

Q Q Q Q Q

Figure Dep endencies when up dating an upward diagonal

A problem with this scheme is that the comp onents of the vector cannot b e

up dated in parallel b ecause each comp onent is dep endent on the previous comp o

nent as evidenced by the horizontal arrows in Figure

Fortunately there is an easy remedy If we use downward diagonals rather

than upward diagonals in Figure then we can pro ceed very similarly but all

comp onents can b e up dated in parallel Figure illustrates how the downward

diagonal Q Q Q Q Q can b e calculated from the previous downward

diagonal Q Q Q Q Q using four parallel group op erations

Q Q Q Q Q

Q Q Q Q Q

Figure Dep endencies when up dating a downward diagonal

Whether we use upward or downward diagonals this computation requires

Weierstrass co ordinates rather than homogeneous co ordinates b e

cause Q Q is usually not known when we need to compute Q Q from two

x

known multiples Q and Q of Q Hence each group op eration needs a mo dular

division when computing the slop e in Downward diagonals are preferable

to upward diagonals even on a sequential architecture b ecause the mo dular inver

sions in these divisions are indep endent All but one inversion can b e exchanged

for three mo dular multiplications by rep eatedly using the identities p

x y xy and y xxy

For an arbitrary p olynomial P X of degree k supp ose that we want to enu

merate fP ig for i Dene

P X P X P X P X P X j k

j j j

Each P is a p olynomial of degree k j in particular P is constant We evaluate

j k

P to P k numerically and use those to evaluate P i for j k and

j

i k j Hence we can determine the k vector P P P

k

Each P Q can b e computed as describ ed for random multiples of Q early

j

in Section This gives us the vector v where

v i P i Q P i Q P i Q

k

Next for i we can compute v i from v i using k parallel group

op erations The rst comp onent of v i is the desired P i Q P i Q

Implementation choices

It was decided to implement two schemes Of the values of k app earing in

k

Table cost requirement DP restricts us to k when using P X X

where jk and to k when using P X g X where

k

k

For P X X with k a variation of the scheme in Section was used

The main problem was a lack of parallelism when using only p owers of and If

there are under pro cessors and d then we can let

ik j k

m i j d and j

ij

ik j k

n i j d and j

ij

j k j k

The values of Q and Q for j can b e computed sequentially

x x

using k k nonparallel applications of or

If this is an unacceptable amount of sequential calculation then one can use

additional primes b esides and Once these have b een built k applications of

and k of suce to get each new m Q cost k k

ij x

mo dular multiplications apiece Only k applications of cost k

mo dular multiplications apiece are needed p er new n Q For d d

ij x

the average overall cost drops b elow the multiplications allowed by DP

Figure illustrates the dep endency picture when building the m Q

ij x

The computations along any row except the rst can pro ceed in parallel with

this pro cess rep eated until d values are available

This construction app ears to fail prop erty DP by not ensuring that small

primes divide some m m or m n though they divide with high probability

i i i j

1 2

k k k

m m m m

k k k k k k k

m m m m

k k k k k k k

m m m m

Figure Dep endencies when using geometric progression

by The only apparent algebraic divisibility relation amongst the m n

i j

i k j k i k j k

1 1 2 2

where i and j satisfy o ccurs when the exp onents in

satisfy gcd i j i j for four random integers this event has probability

which app ears acceptably small for DP The

scheme almost satises DP if

i k j k i k j k

1 1 2 2

mo d q

ik j k

then there is a congruence mo d q with j and jij d

and i j not b oth zero If i then n m If we redene

ij

k

m instead of the value in then we have a similar result with

d

1

k

i since m m Hence we have a match in

ij d

1

j k

except p ossibly when i meaning mo d q for some j with j

We can check these cases separately

The other scheme implemented used P X g X with and k

k

The sequences were chosen by m g M and n g N where

i i j j

M i i d

i

N d j j d

j

We claim that this satises DP by ensuring that all small primes q divide

some M N and hence some m n If gcdq and d q d d

i j i j

then we can write q d j i where i d and j d For

q d and q mo d we can represent q instead of q

Selecting N d j d rather than N d j would enable one to

j j

extend the upp er b ound to d d d alb eit at the cost of computing higher

multiples of Q when initializing the algorithm in Section

This scheme seems to b e remarkably free of algebraic identities where one

P X P Y divides another and so seems to satisfy DP

However this scheme may fail DP by having duplicate n mo d q without

j

ever satisfying or

k

Richard Crandall has also used P X X but without the multiplies

k

by allowed by Corollary His preliminary estimates suggest k or

k when B and k or k when B Crandall did not

use an FFT so his asymptotic cost p er test of m n is much higher The data

i j

k

in Chapter suggest that k is suciently high when using P X X

CHAPTER

Selection of Curve

So far we have not sp ecied which elliptic curve to use except that it should

have the form

B y x Ax x

for some A and B with gcdA B N is the ane equivalent of

We also require a known initial p oint Suyama proved that any curve of

the form has order divisible by when reduced mo dulo any prime p since

at least one of B A B A A is a quadratic residue mo dulo p

p

We show how to select curves of form whose torsion group over Q

has order or with a known rational nontorsion p oint hence p ositive rank

over Q Then we present numerical data comparing the actual exp onents of and

which divide the orders of these curves when they are reduced mo dulo a prime

Torsion subgroup of order and p ositive rank over Q

Montgomery pp showed how to select a curve with known initial

p oint and with torsion subgroup ZZ over Q If

a t a a

B where a A

a a t

then has the following rational torsion p oints

O P

t t t t

t t P P

t t t t

t t t t

P P a a

t t a a t t

t t

P P

t t

t t t t

a a P P

t t t t a a

t t t t

P t P t

t t t t

The p oint

p

a a

x y

a a

is on if a t t is a rational square We can achieve

this by letting t u u where u u is a rational square

Torsion subgroup of order and p ositive rank over Q

Bremner discovered that the curve

y xx x

has a torsion subgroup of order and p ositive rank over Q Replacing Bremners

x by x and his y by y gives the curve

x x x x y x x

which has the form More generally if a b c then the curve

a b a b

y x x x x x x

b a a b

has the following rational torsion p oints where P has order and Q has order

b

O Q

a

a b c a bc a bc a b c

P P Q

a b c a b c a b c a b c

c a b

P P Q

ab ab

a b c a bc a bc a b c

P P Q

a b c a b c a b c a b c

a

P P Q

b

a b c a b c a bc a bc

P P Q

a b c a b c a b c a b c

c a b

P P Q

ab ab

a bc a bc a b c a b c

P Q P

a b c a b c a b c a b c

These p oints are distinct for all but nitely many ratios a b c and give a

torsion subgroup of order Mazur showed that this is the largest p ossible torsion

subgroup for an elliptic curve over Q p

Theorem Any el liptic curve with torsion group ZZ ZZ over Q is

equivalent to one of form with ab

Proof Kub ert p gives the parameterization

Y cXY bY X bX

where

d d

d b d d c

d

Here Q and

dd d d d

Completing the square in gives

cX b

Y

c b bc b

X X X

d d d d d

X X d d X

d d

The linear change of variables

Y d y cX b

X d x d d

converts this to the form with B and

d d d d d

A

d d dd

This has the form with

a b c

Restriction b ecomes

this simplies to and hence ab since is rational 

Bremners curve with a b c has rational nontorsion

p oints and hence p ositive rank Sample xco ordinates are

x

and their recipro cals A search found rational nontorsion p oints on for some

other Pythagorean ratios a b c These solutions were analyzed for patterns

The solutions x when a b c x when

a b c and x when a b c all have the

form

x c ac a or x c bc b

This p oint is a torsion p oint when a b c but not in the other two

cases Substituting x c ac a and b c a in gives

c c a c ac a

y

a c ac a

Since c ac a b is assumed to b e a p erfect square the computed y is a

p erfect square whenever c ac a is a p erfect square

We can get innitely many solutions to by setting a t b t

and c t where t is a rational number to b e determined We require

that t t b e a rational square Letting u t we want u and

u u to b e squares By selecting an arbitrary p oint u v u v

on the elliptic curve

v uu u

and doubling it we accomplish our ob jective since the uco ordinate of the doubled

p oint is always a p erfect square Sp ecically we can set

u

t

v

Our arbitrary p oint can b e a random multiple of the known p oint u v

or on

This and other ways to get an initial p oint for are listed in Table

Each metho d requires some homogeneous quadratic p olynomial in a b and c to

b e a p erfect square while also requiring that a b c Up on parameterizing

a t b t and c t each entry leads to a fourthdegree p olynomial in t

which must b e a p erfect square its solutions if any lie on an elliptic curve Table

entries were found in an ad hoc manner so I make no claim of completeness The

last entry is due to Atkin and Morain who also show how to construct curves

of p ositive rank with other torsion groups over Q Elkies p describ es how

to add p oints on an elliptic curve Y quartict with known rational p oints this

can b e used to nd more ratios for the last column of Table sometimes using

a trivial rational p oint where ab

Numerical comparison of torsion subgroups of orders and

If a curve E has a torsion subgroup of order over Q and p is a prime which

do es not divide the denominator of any co ecient of E or of its torsion p oints

then its reduction E mo dulo p has order divisible by unless E is singular

p p

mo dulo p or two p oints in the torsion subgroup agree mo dulo p and hence for all

x Required square Example ratios a b c

a

a ab b

b

b

a bc

b c

c a

c ac a

c a

a

a ba b

a b

a c

ab ac bc

b c

b c

c bc b None found

b c

bc ac ab

bc ac ab

b c a

Table Some ways to ensure that curves group order is divisible by

but nitely many primes p The ECM algorithm succeeds if jE j is suciently

p

smo oth this quotient is approximately p Likewise if E has a torsion subgroup

of order over Q then the ECM algorithm succeeds if jE j is suciently

p

smo oth this quotient is approximately p Intuitively since p p the

former seems more likely to b e smo oth so curves with torsion subgroup of order

are b etter

A numerical exp eriment was conducted to check this hypothesis It used ve

curves of form with u It also used

ve curves of form as describ ed in the paragraph near with

p

u

The orders of all ten curves plus one other curve with A and B

were computed mo dulo each of primes from to those primes

for which a denominator vanished or for which one of the curves was singular ie

A were excluded

When the torsion subgroup had order over Q the prime divided the group

order jE j an average of times and the prime divided the order an average

p

of times eectively subtracting an average of

ln ln

from the natural logarithm of the order When the torsion subgroup had order

over Q the prime divided the order an average of times and the prime

divided the order an average of times eectively subtracting an average of

ln ln

from the natural logarithm of the order Both averages are considerably larger than

the ln or ln which one would exp ect given a

random multiple of resp but the dierence b etween the two exp ectations

seems slight As exp ected the curve with A and B torsion group

of order p erformed much worse than the others with app earing to the

p ower and to the p ower on average

In all cases the prime divided the order average of times while divided

the order an average of times

The data was subsequently analyzed for patterns dep ending on the residue class

of p Primes p mo d fared b etter when the torsion subgroup had order

but primes p mo d fared b etter when the torsion subgroup had order

as seen in Table The statistics app eared not to dep end on the residue class

of p mo dulo once the residue class of p mo dulo is xed

p mo d

Torsion subgroup order

Curves tried

Average exp onent of

Average exp onent of

Average ln jE j reduction

p

Power of

or more

Table Power of dividing group order

Tables and have data ab out the exp onent of dividing jE j and

p

p These data suggest Conjectures and

p mo d

Torsion subgroup order

Curves tried

Average exp onent of

Average exp onent of

Average ln jE j reduction

p

Power of

or more

Table Power of dividing group order when torsion subgroup has order

Conjecture Let E be an el liptic curve with torsion subgroup ZZ over

Q For any prime p let jE j denote the order of its reduction E modulo p

p p

Then

a As p ranges through the primes congruent to modulo the largest power

of dividing jE j is with probability for each

p

b As p ranges through the primes congruent to modulo the largest power

of dividing jE j is with probability if and probability

p

if

c As p ranges through the primes congruent to modulo the largest power

of dividing jE j is with probability if or probability

p

if and probability if

Conjecture Let E be an el liptic curve with torsion subgroup ZZ ZZ

over Q For any prime p let jE j denote the order of its reduction E modulo p

p p

Then

p mo d

Torsion subgroup order

Curves tried

Average exp onent of

Average exp onent of

Average ln jE j reduction

p

Power of

or more

Table Power of dividing group order when torsion subgroup has order

a As p ranges through the primes congruent to modulo the largest power

of dividing jE j is with probability if and probability if

p

b As p ranges through the primes congruent to modulo the largest power

of dividing jE j is with probability for each

p

c As p ranges through the primes congruent to modulo the largest power

of dividing jE j is with probability if and probability

p

if

According to Conjectures and the average exp onent of dividing

the order jE j is resp when the torsion group has order resp

p

over Q and p mo d The average exp onent of dividing jE j is resp

p

if p mo d and resp if p mo d

The conjectures make no prediction for the exp onent of when p mo d

or the exp onent of when p mo d The fractions

and in Table are approximately and resp ectively

but the other entries seem hard to guess

The evidence for these conjectures is weak even if it is correct for curves gener

ated using or using and b ecause another metho d of selecting

the curves may give dierent statistics The program was rerun using random

curves with torsion group ZZnot necessarily with p ositive rank and another

curves with torsion group ZZ ZZ for primes in

curves with each torsion group The results resembled those in Tables

and

If p is a prime and E has a subgroup isomorphic to ZnZ ZnZ for some

p

integer n then p mo d n p For example if p mo d then

E cannot have a subgroup isomorphic to ZZ ZZ so its Sylow subgroup

p

k

must b e cyclic Hence a p ower cannot divide E unless E has a p oint of

p p

k

order When instead p mo d the Sylow subgroup of E need not b e

p

cyclic This rationalizes why the exp onent of dividing the order in Table

varies with p mo d and why the exp onent of in Tables and varies

with the exp onent of dividing p but do es not predict the actual distributions

CHAPTER

Selection of Search Limits B d d

The ECM algorithm usually nds the prime pjN if the order jE j has all but

p

p ossibly one prime divisor b elow a b ound B and its remaining prime divisor q

if any divides m n for some i and j satisfying We want to estimate

i j

the exp ected cost required to nd p in terms of the size of p and the search

parameters B d d Here d d are as in Chapter To achieve this we

estimate Pr B d d the probability of nding p using parameters B d

succ

and d with a random curve We also estimate Cost B d d the cost of running

one curve with these parameters Then we attempt to minimize the exp ected total

cost

Cost B d d

Pr B d d

succ

The numerator estimate dep ends on the magnitude of N arithmetic is cheaper

if N is smaller The denominator estimate dep ends on the prime p larger proba

bilities for smaller p The minimization pro cess treats B d and d as continuous

real variables while xing N and p This pro cess ignores the requirements that d

b e a p ower of and that d jd Fortunately the exp ected cost is very at near its

minimum cf the last two columns of Table so imp osing these restrictions

later do es not signicantly aect the estimated total cost

The analysis assumes use of curves with torsion group of order over Q

Section It also assumes that Step uses P X X as in Section

Dickmans function

Dickmans function pp p estimates the probability that

a large integer x has all its prime factors b elow x It satises the functional

equation

the ranges for in are incorrect An asymptotic formula is

ln ln ln ln o

Estimated success probability p er curve

The curve has b een chosen to have order divisible by its order is approxi

mately p In terms of Dickmans function the estimated success probability

during Step is the probability that an integer near p has all its prime factors

b elow B namely

lnp

ln B

This estimate is slightly p essimistic since the data in Section suggest that the

average p owers of and dividing jE j are larger than those dividing random

p

integers

Step succeeds if there exists a prime q B such that

i q divides the group order

ii all prime factors of jE jq are b elow B

p

iii two of the multiples of Q constructed during Step satisfy

Item i has estimated probability q for q p and probability otherwise

The estimated probability of ii is given by Dickmans function By the

estimated probability of iii is

X

d d

Pr d d q gcd q exp

match

q

q mo d

0

gcdq

0

expd d q expd d q

expd d q expd d q

expd d q expd d q

expd d q expd d q

Since at most one q B can satisfy b oth i and ii we can sum the pro duct of

these probabilities over all prime q B leading to an estimated Step success

probability of

X

lnpq Pr d d q

match

q ln B

B q p

1

q prime

Z

p

Pr d d q lnpq dq

match

q ln B ln q

B

1

Substituting q expq and adding gives a total success probability of

Z

lnp

lnp q dq

Pr d d expq Pr B d d

match succ

ln B q

ln B

1

lnp

ln B

Estimated time p er curve

The rows of Table summarize the ma jor actions during the ECM algo

rithm Actions taking negligible time such as selection of the curve itself are

omitted Each row has four entries

i A brief description of the action Actions app ear in the order in which they

are rst executed

ii The number of times the action is executed p er curve

iii Its asymptotic cost p er execution for large d d and B but xed N This

column assumes that M d O d log d in Section

iv An estimated cost using the rst term of the asymptotic cost with con

stants chosen to match actual run times in milliseconds on a DEC

for a digit N Sp ecically that run attempted to factor the cofactor

of p listed in Table with B and d and

d

Times Asymptotic cost Fitted cost for

Action

executed p er execution digit N msec

Step O B B

Ro ots of F O d d

Construct F X O d log d d log d

Construct RECIPF X O d log d d log d

Ro ots of H d d O d d

Construct H X d d O d log d d log d

GX GX H X

d d O d log d d log d

mo d F X

gcdF X GX O d log d d log d

Table Estimated time p er curve milliseconds

Table predicts a total time p er curve of

Cost B d d

B d d log d d log d

d

d d log d d log d d log d d log d

d

B d d d d log d d d log d

The precise constants in this estimate dep end on the implementation and the

hardware available Using more precise asymptotic costs also aects

Estimated optimal parameters

We want to minimize The constants in its numerator cost estimate

were derived assuming that N has digits If all costs of the compu

tation grow in equal prop ortions N increases then the lo cation of the minimum

do es not dep end on N However suggests this prop ortionality assump

tion is incorrect indeed the data in Table show that op erations mo dulo a

digit N take times as long as those mo dulo a digit N whereas

the multiplication mo dulo N during Step presently uses an O log N algo

rithm The denominator success probability estimate dep ends heavily

on the magnitude of p

Table gives estimates of the optimal parameters ie those minimizing

for various sizes of p It includes estimated run times in hours for the

environment of the last column Table The minimization was done numeri

d

cally To approximate a prime of d decimal digits we put p in

The integral in was approximated by Simpsons rule Dickmans function

was approximated using interpolation in a table

The third and fourth columns of Table suggest that one should use

d d The exp ected cost in the sixth column increases by for each each

additional digit in p With optimal parameters approximately twothirds of the

run time is in Step this p ercentage is when p has

digits The conditional probability that a success o ccurs during Step

rather than Step while using optimal parameters drops from to as p

increases from to digits

In practice the size of p is usually unknown The last column of Table

gives the estimated times to nd p of various sizes using parameters optimized

for a p of digits The estimated times using these parameters are at most

twice the corresp onding optimal times if p has digits and within of the

corresp onding optimal times if p has digits If N has digits then each

curve takes ab out hours on a DEC using these parameters

The program used to generate Table was rerun using P X X rather

than P X X The estimated run times were smaller than the

corresp onding times in Table suggesting that the exp onent is to o large

The corresp onding table for P X X has values of B ab out smaller

than those in Table its values of d are larger while its values of d

are larger suggesting d d or d d These dierences are

more signicant for smaller p Both P X X and P X X have smaller

exp ected times than P X X

On a machine with parallel pro cessors each as fast as a DEC or a

network of these Table predicts that one can nd all factors up to digits

of a digit N within an hour up to digits within a day and digits within

a month assuming p erfect parallelism Such systems may b e widely available in

ten years allowing Rusins digit ECM record in Table to b e b eaten many

times Brent p predicts that factors up to digits can b e found this way

Exp ected Exp ected Time with

number time hrs B Digits

B d d

of for d in p

curves digit N d

Table Estimated optimal parameters

CHAPTER

MultiplePrecision and Mo dular Arithmetic

Timing runs revealed that this programs time was concentrated on four activ

ities

i Arithmetic mo dulo N esp multiplication This is the ma jor activity during

Step it is also used during Step when calculating the xco ordinates for

use in

ii Arithmetic mo dulo all primes p during a convolution

i

iii Finding remainders mo dulo all p given a value mo dulo N Section

i

iv Reconstructing remainder mo dulo N given remainders mo dulo several p

i

equation

The program has b een designed to allow parallelism but each of the ab ove

steps was treated to b e an indivisible op eration and assumed to b e completed on

a single pro cessor For example the design allows several indep endent multiplica

tions mo dulo N to pro ceed at once but any such multiplication is completed by a

single pro cessor

The only parallel architecture used during the study was an Alliant FX a

MIMD architecture which also supp orts vectorization If a convolution required

K prime mo duli then the data was structured so K remainders mo dulo dierent

primes were stored in adjacent lo cations allowing vectorization with unit stride

over the primes this contrasts with Silvermans implementation which tried

to assign each prime mo dulus to a separate pro cessor during the convolutions A

typical primitive op eration used by the FFT is

a b a b a b mo d p i K

i i i i i i i i i

which has one multiplication one addition and one subtraction mo dulo each p

i

all p otentially vectorizable

Arithmetic mo dulo N

Arithmetic mo dulo N used the algorithm in Supp ose that N can b e

represented using digits in radix R where R is a p ower of Let A B N

with

X X X

i i i

n R b R N a R B A

i i i

i i i

and a b n R for all i The algorithm requires a constant N such that

i i i

N N mo d R

such exists since N is o dd

Pro cedure MODMULN in Figure is based on the classical not highsp eed

multiplication techniques ie its time is O It returns AB R mo d N rather

than AB mo d N To comp ensate all residues mo dulo N should b e scaled by R

b eforehand Dene

A AR mo d N

for any integer A mo dulo N Then

A B A B R AR BR A B mo d N

AB AB R AR BR R A BR MODMULNA B mo d N

A then using MODMULN on two internal If A mo d N is represented internally by

representations gives the internal representation of their pro duct mo dulo N The

addition and subtraction algorithms are unchanged Algebraically the mapping

A is an isomorphism from the ring ZN Z with conventional arithmetic to A

that set with ordinary addition but with multiplication dened by MODMULN

but the multiplicative identity b ecomes The additive identity remains

Given two p olynomials F X and GX the algorithms of Section return

F X GX mo d N Supp ose we invoke it on

X X

i j

F X f X and G X g X

i j

i i

Each output co ecient has the form

X X

f g f g R mo d N

i j i j

ij ij

P

rather than the desired f g That is they are to o large by a factor of

i j ij

R mo d N This correction is easily incorp orated into the convolution algorithm

by scaling the precomputed co ecients in

pro cedure MODMULNA B

C

for j from to do

P

i

c R Cmt Supp ose C

i

i

dtemp c a b

j

q dtemp N mo d R

j

Cmt Compute C C c a B b q N dtempR

j j

car r y dtemp q n R

j

for i from to do

Cmt a q b n R

j j i i

Cmt car r y R

Cmt c R if i

i

Cmt dtemp R R if i

Cmt c R

Cmt dtemp R if i

dtemp c car r y a b q n

i j i j i

c dtemp mo d R

i

car r y bdtempRc

end for

c car r y

j

Cmt C R a a a B q q q N

j j R j j R

end for

if C N then C C N

return C

end MODMULN

Figure Pro cedure for multiplication mo dulo N

Doublelength multiplication

The p ortion of Algorithm MODMULN of Figure preceding the nal C

N test requires the following primitive arithmetic op erations

i Addition of nonnegative integers with sum not exceeding R

ii Multiplication of two nonnegative integers b elow R with pro duct as large as

R

iii Integer quotient and remainder when dividing a nonnegative integer b elow

R by R On a binary machine this is equivalent to extracting the least

signicant log R bits or the next log R bits of the input

The length of the op erands is dlog N e to keep and the number of lo op

R

iterations small the the radix R b e as large as convenient while tting into a

machine word

I co ded MODMULN in four essentially dierent ways some aimed for p orta

bility and some for high p erformance on certain architectures The subalgorithm

is chosen at compile time Dierent subalgorithms imp ose dierent b ounds on the

radix R

Direct This subalgorithm requires the hardware and language to supp ort integers

as large as R On a bit machine we can restrict R while using

ordinary integers or R while using bit results

The Fortran Fortran and ANSI C language standards do not sp ecify a

long integer data type though the kind parameter of Fortran p ermits it

The GNU C compiler do es supp ort a long long data type but UCLAs cur

rent version do es not inline the co de op erating on such data making it

needlessly slow though Version has improved supp ort This subalgorithm

was used primarily in some assembly language versions of MODMULN

Halnt This requires that R b e an even p ower of To multiply two op erands

p p

R and b b b R a and b where a b R one writes a a a

p p

where a a b b R Then ab a b a b a b R a b R

all pro ducts are less than R We can use three multiplications rather than

four by utilizing the Karatsuba identity

p p

R is suciently small then the multiplications of integers b elow R can If

b e done by table lo okup using one of the identities

a b a b

ab

aa bb a b a b

ab

The use of a table of squares requires only two table lo okups p er

multiplication versus three when using triangular numbers Equation

p p

uses a table indexed from from R to R whereas

p

uses indices only from to R A test showed outp erforming

on a program where the most cache data would b e from this table

but b oth table lo okup metho ds did p o orly

Double If the double precision oating p oint data type can represent integers as

large as R exactly and if all arithmetic is exact when the inputs and

outputs are integers and outputs b elow this b ound then all arithmetic can

b e done using oating p oint and converted back afterwards On a machine

whose double precision mantissa has bits as in the IEEE standard this

allows R On a SUN an optimized Fortran implementation of

MODMULN using this subalgorithm p erformed almost as well as an assembly

subprogram using the direct subalgorithm and mulscc instructions

TwosComp This subalgorithm assumes that all integer addition subtraction

b

and multiplication is done mo dulo some with overow ignored where

b

R often b That is all integer arithmetic really op erates in

b

Z Z this is the same as twos complement arithmetic on bbit integers

This subalgorithm also requires that the oating p oint data type hold items

a few bits larger than required for R

Supp ose we want the upp er and lower halves of a a a a where a R

i

b

Compute t a a a a mo d mo d R using integer arithmetic with

overow ignored its remainder mo dulo R is in the lower bits The top half

a a a a t R of this result is known to b e an integer less than R it

can b e estimated using oating p oint arithmetic whose mantissa is suciently

wide with the result rounded to the nearest integer

The twoscomp subalgorithm was originally developed for the Alliant which

can return a doublelength bit integer pro duct in scalar mo de but not

in vector mo de It is also the b estp erforming nonassembly subalgo

rithm on the DEC and IBM RS whose integer and oating p oint

calculations of a a a a can pro ceed concurrently in separate functional

units The RS compilation used the unsupp orted qxaghst of the

xlf Fortran compiler to suppress overow checking when converting oating

p oint to integer b oth here and while preparing Table Without this

compilation option halnt was the fastest subalgorithm on the RS

Arithmetic mo dulo small primes

The primes p selected in should b e large enough that only a few are

i

needed but small enough that arithmetic mo dulo the p can b e done easily In

i

order that enough p exist satisfying the congruence condition for primitive ro ots

i

their minimum size is ab out bits

A compilation option allows elements of Zp Zto b e represented as op erands in

i

the interval p p signed op erands or in p nonnegative op erands The

i i i

typical op eration is the comp osition of two simpler op erations replacing

b b mo d p followed by a b a b a b mo d p

i i i i i i i i i i i

given fa g fb g fp g Since the p rarely change we also p ermit auxiliary constants

i i i i

which dep end only on the p and can b e precomputed We want to implement

i

these op erations in a vectorizable fashion

If a and b are nonnegative ie in p then one can implement the

i i i

second op eration in by computing a p b and a b then adding

i i i i i

p to either result if negative If instead a and b are signed op erands then

i i i

one can rst adjust a and b to the interval p if they are negative followed

i i i

by outputting a p b and a b In each case the outputs follow the same

i i i i i

conventions as the inputs at a cost of ve vectorizable addssubtracts two of them

under conditional mask plus two vector compares and some loads and stores The

selection of signed or unsigned data dep ends primarily on the subalgorithm for

multiplication mo dulo p

i

Supp ose we want a pro duct b mo d p Analogous to the direct metho d in

i i i

Section one can compute the doublelength pro duct b and divide by p

i i i

if the machine and language supp ort doublelength multiplication and division

These op erations are most simply done on nonnegative op erands Division can b e

avoided if one uses a metho d analogous to MODMULN which is allowed to return

0 0

b mo d p where is a constant and p This variation nds c

i i i i i

0 0

such that b c p is a multiple of returning either b c p

i i i i i i i i

0

or b c p p

i i i i i

Subalgorithm halnt can also b e used here by breaking each input into two

pieces half as long and using an even exp onent It resembles MODMULN in

0

Figure with N replaced by p replaced by and R replaced by This

i

subalgorithm also prefers nonnegative op erands

Subalgorithm double nd approximate quotients q such that jq b p j

i i i i i

by computing b p with roundo error at most and rounding that result

i i i

to the nearest integer The outputs b p q can then b e computed exactly if

i i i i

the oating p oint mantissa is suciently wide eg p when using bit

i

mantissa This subalgorithm prefers signed data If the nearest integer function

Fortran NINT is slow but truncation towards zero is fast then approximate

p b

i i i

b

i i

p p

i i

Estimate the right side using oating p oint arithmetic Except when b

i i

the approximation is to o large in absolute value but its relative error is less than

p if oating precision is suciently large Up on converting the right side to an

i

integer while truncating towards zero the estimated quotients q are either correct

i

or one to o large and the computed remainders are in the interval p p as

i i

desired

Subalgorithm twoscomp also generalizes As in double estimate a quotient

q b p via oating p oint arithmetic Then compute the remainder b p q

i i i i i i i i

using integer arithmetic mo dulo a p ower of exceeding p We know that the

i

computed remainder has absolute value at most p so the computed result must

i

b e correct despite intermediate integer overow This subalgorithm prefers signed

data

0

mo d p for some rather than Some of these subalgorithms nd b

i i i

b mo d p One can comp ensate by scaling all inputs including primitive ro ots

i i i

0

by mo d p much as used by inputs to MODMULN Since we must divide by

i

0

mo d p after the convolutions mo dulo p the constant P p in should

i i i

not b e prescaled that multiply removes any bias

Finding remainders mo dulo p

i

The convolution algorithm in Section reduces all input p olynomial co ef

0

cients mo dulo p It may also need to scale these residues by mo d p if the

i i

multiplication subalgorithm of Section so requires Let a sample input b e

P

j

A a R as in Section

j

j

The direct algorithm b egins with the leading co ecient a which it reduces

mo dulo p Then it reduces a R a mo dulo p and so on

i i

0

j

To reduce the number of divisions we can precompute all R mo d p Now

i

the computation reduces to an inner pro duct

X

0 0

j

mo d p R mo d p a A

i i j

j

with each summand b ounded by R p Analogues of the multiplication

i

subalgorithms in Section handle this inner pro duct For direct and halnt it is

0

convenient to instead reduce A R mo dulo p getting a remainder in Rp

i i

0

j

and dividing the latter by R mo d p this is accomplished by replacing R by

i

0

j

R in and with an additional entry p er p in the precomputed table

i

Reconstructing remainder mo dulo N

The outputs h of a convolution are determined by an inner pro duct in

j

which one op erand is a residue mo dulo p and the other is a residue mo dulo N

i

the inner pro duct is calculated mo dulo N It resembles

K

X

a B mo d N h

i i

i

where a is a singleprecision integer and B N the integers B are pre

i i i

computed

P

j

b R mo d N b ecomes Writing B R

ij i

j

K K K

X X X X X

j j

a b mo d N R b R a a RB hR

i ij ij i i i

i j j i i

The inner sums are evaluated as tripleprecision baseR integers one p er j From

the last digit of this calculated pro duct we can determine a multiple of N which

when added to the pro duct gives a multiple of R That multiple is added while

doing carry propagation to get a result in radix R which is less than R

P

K

a times N Dividing this sum by R by eliminating its last digit gives a

i

i

remainder which is at most a small multiple of N at most O KN if the p and

i

R have the same magnitude the nal quotient and remainder h are determined

using the binary search algorithm on a table of small multiples of N followed by a

multipleprecision subtraction

Vectorized carry propagation

Multipleprecision addition and subtraction are much cheaper than multiplica

tion but are nonetheless used heavily by ECM They can b e done by starting with

the least signicant digit and working up p For example during mo dular

addition one may need to compute C A B N where A B and N are given

by and C has a like form Assuming that the radix R is suciently small

that R ts in a word one straightforward algorithm app ears in Figure

This algorithm do es not vectorize well Although the computations of a b n

i i i

can b e vectorized each carry p otentially dep ends on all earlier carries

Cmt Compute C A B N plus carry out

car r y

for i from to do

car r y car r y a b n

i i i

c car r y mo d R

i

car r y bcar r y Rc

end for

Figure Straightforward multipleprecision addition and subtraction

Bailey p vectorizes this op eration in almost all cases by assuming that

the carries propagate at most three places to the left When this assumption fails

Bailey resorts to scalar op erations

Full vectorization is p ossible if one can p erform integer arithmetic shifts adds

bitwise op erations on vector masks Initialize c a b n for i

i i i i

These satisfy R c R for all i Then apply the algorithm in Figure

i

to the fc g vector

i

Each iteration of the inner lo op in Figure do es carry propagation across a

vector of length l now ensuring that all but p ossibly the most signicant

lnow

are assumed element of the output vector are in range The values of fv g

i

i

stored in a vector register The rst two wheres in the inner lo op subtract R

from any elements which previously were R or larger while adding to their

P

i

left neighbors This do es not alter the numerical value of c R but avoids

i

i

digits larger than R Negative digits however may still remain All remaining

carries are or whereas previously they might have b een or

The lo cations where a carry in is are those where i the right neighbor is

negative or ii the right neighbor is zero and that neighbor has a carry in of

Dene m if v is p ositive m if v and m if v Form the

i i i i i i

P

lnow

i

binary sum m msk msk using the integer add instruction on

i

i

vector masks There are binary carries from all columns where m ie those

i

where v and from those with m ie v which receive a carry

i i i

in The algorithm sets bits in msk corresp onding to columns receiving a carry

in msk has bits corresp onding to carry outs These are precisely the lo cations

where the v need to b e adjusted for carry ins and carry outs in radix R

i

The algorithm in Figure was implemented in assembly language b ecause

the Alliant Fortran compiler do es not supp ort arithmetic op erations on vector

masks Caveat the bits in the Alliants vector masks are numbered with bit

b eing most signicant and bit b eing least signicant In order to get the bits in

Cmt Input digits of C assumed in R R

Cmt except c is allowed to b e R or R

Cmt On exit all but p ossibly c are in R

l ef t

while l ef t do

l now min l ef t Maximum vector length

Set v c for i l now

i lef ti

msk mask where v R and i l now

i

where msk do v v R

i i

where msk do v v

i i

Cmt Now all but p ossibly v and v are in R R

lnow

Cmt v may equal R

msk mask where v and i l now

i

msk mask where v and i l now

i

msk EORmsk msk msk msk EOR exclusive OR

Cmt msk identies which entries are to o small

where msk do v v

i i

where msk do v v R

i i

Set c v for i l now

lef ti i

l ef t l ef t l now

end while

Figure Vectorized carry propagation

the prop er order for computing msk the vector loads and stores used stride

rather than That is v really held c rather than c

i lef tlnow i lef ti

A timing run on the Alliant FX using one pro cessor showed the algorithm

in Figure to b e ab out faster than a partially vectorized implementation

of Figure on long vectors but worse on short vectors the crossover p oint is

ab out Using an if to bypass the wheres when a mask is identically zero

gives an additional improvement

CHAPTER

Results

Timing

Table gives timing information for various p olynomial op erations on an

IBM RS under AIX Computations were done mo dulo N d e

N d e N d e N d e using random op erands

These four values of N are probable primes of and digits The

convolution algorithm of Section used resp ectively and primes

just b elow for the convolutions mo dulo p All times are given in hundredths

i

of seconds though digits b eyond the two most signicant are probably noise

Table compares the corresp onding times on an Alliant FX for the

digit N once running on a single pro cessor and once running with ve MIMD

multipleinstruction multipledata pro cessors The timing runs were made dur

ing a p erio d of little other system activity As on the RS convolutions

mo dulo this N used small primes near

The Alliant times can b e contrasted with those in Table Silverman to ok

seconds to construct a p olynomial of degree from

its ro ots when N on an Alliant FX using four pro cessor Our estimated

time to construct such a p olynomial of degree is seconds

using one pro cessor and seconds using ve pro cessors According

to the Alliant architecture manual App endices F and G the FX and FX

b oth have cycle times of nanoseconds but the FX often requires fewer

cycles p er instruction For example converting a bit integer vector in memory

with stride to a double precision vector in a register vmoveld instruction takes

VL cycles on the FX but dV Le cycles on the FX where VL is the

vector length VL Some ma jor dierence in our implementations are

I vectorized over the primes p while Silverman assigned each p to a separate

i i

pro cessor

Silverman did not reduce the co ecients in mo dulo N is advance

instead constructing a result approximately N and reducing that mo dulo N

Table compares Step and Step times on a DEC using eleven or

twelve values of N ranging from to digits The IBM RS is ab out

faster than the DEC for this application

Op eration n Digits in N

Polynomial

pro duct

degrees

n and n

Polynomial

recipro cal

degree n

Construct

p olynomial

of degree n

from ro ots

Polynomial

GCD

degrees

n and n

Table Times for p olynomial op erations on RS seconds

Op eration n Times for digit N seconds

pro cessor pro cessors Sp eedup

Polynomial

pro duct

degrees

n and n

Polynomial

recipro cal

degree n

Construct

p olynomial

of degree n

from ro ots

Polynomial

GCD

degrees

n and n

Table Comparative Alliant FX times with one and ve pro cessors

Performance on RSA Factoring Challenge

RSA Data Security Inc announced the RSA Factoring Challenge in March

The comp etition features two lists of numbers which contestants attempt to

factor Prizes are awarded quarterly for complete though not for partial factor

izations

One list the RSA Challenge List has p otential keys for the RSA public key

cryptosystem This lists smallest entry has decimal digits its next has

digits then up to digits Presumably most or all entries in this

list have two prime factors of comparable sizes As of April thirteen months

after the contest b egan the rst two entries of this list had b een factored

digits by Arjen Lenstra and Mark Manasse digits by Arjen Lenstra each

using the Quadratic Sieve algorithm I did not attempt any entries on the RSA

Challenge list

The other list the Partition Challenge List has partition numbers pn the

number of partitions of n into integer summands without regard to order For

example p since

A table of pn values can b e computed using the generating function p

Y X

n

pnX

n

X

X

2

n n

n n n

X

n

To reduce the number of table entries the comp etition uses only prime values

of n This table starts at p the rst such value b eyond Partition

numbers app ear to have many small prime factors allowing them to b e attacked

protably by ECM Ab out of the original entries in the Partition Chal

lenge List were factored completely by some contestant during the rst week of

the comp etition and half had b een nished by the end of its second quarter

During Fall third quarter of the comp etition the upp er b ound of the

partition numbers b eing accepted by the comp etition was digits The numbers

pn for prime n had app eared in the lists for previous quarters but

eighteen new entries app eared as listed the rst column of Table I used a

DEC at UCLA luna to nd as many factors of these as I could b etween

Octob er and January After one day my old program found the

Easily found factors in Table This completely factored six of the eighteen

entries those whose Status column b egins with a P excluding p

Then I tried ECM with FFT on the twelve comp osite cofactors whose sizes ranges

from to decimal digits Factors found by the new algorithm are listed

separately each is given a lab el which identies its size and which is also used in

Table The Status column identies the number of decimal digits in the

cofactor with P for probable prime and C for comp osite

The FFT runs were done in three rounds with increasing limits as identied

by the nal three columns of Table Each round had multiple runs with the

same input data but dierent random number seed Each run used two curves

p er input number b oth chosen to have a torsion group ZZ ZZ over Q see

Chapter After running b oth curves with with Step limit B Step used

P X X for one curve and the Dickson p olynomial P X g X for the

other curve see Section the choice of which p olynomial to use rst was

decided pseudorandomly Any one prime factor might b e found during Step or

Step but would b e discovered only once p er run even if b oth group orders were

smo oth The search limits approximate those in Table when searching for a

digit digit or digit prime factor

Table summarizes how often each factor was found in each way After

the rst round of runs the two smallest factors found of and digits

were removed from the input numbers reducing the input from comp osite

numbers to but the larger factors were retained to see how often they would

b e rediscovered

Excluding the and digit factors the data shows ten rediscoveries using

P X g X nine using P X X and one in Step near The

exp erimental data suggest that the two choices for P X are approximately equally

eective and one should select the whichever is faster or easier to implement In

my implementation the overall Step time was ab out faster using P X X

As noted in Section that choice also parallelizes well since up to d doublings

can b e done at once while only k pro cessors can co op erate to evaluate the next

g N Q in Section

k j

Given a large random integer N the estimated number of prime factors of N

with n to n digits inclusive is

n

2

Z

n

2

n dp

lnln p lnn ln lnn ln ln

n 1

1

p ln p n

n 1

1

Since there are numbers in this study the actual number of prime factors in

this range should b e approximately Poisson distributed with mean and variance

ln n n

For the ranges and digits Table

compares the total number of factors found to the exp ected count The number

found is within one standard deviation of the exp ected count through digits

Subsequently the number found is more than one standard deviation to o small

with only one factor found despite an exp ected six from digits

pn Easily found factors Factors found by ECM with FFT Status

p C

pa

p C

pb

p p C

p C

p C

p C

p C

p P

p p C

p P

p p P

p P

p

p C

p

p C

p P

p P

p C

p P

Table Some factors of digit partition numbers

B

d

d

d d

Number of runs made

Step time curves hr hr hr

Step time P X X hr hr hr

Step time Dickson P X hr hr hr

Total CPU timerun hr hr hr

Maximum memory megabytes megabytes megabytes

p p mo d

Step

p NA NA

Dickson

X

Step

p NA NA

Dickson

X

Dickson Dickson

p

X

X X

Dickson

p

X

Dickson

pa

X

X

Step

pb

Dickson

X

p

Dickson

Table How often factors were found by ECM with FFT

Exp ected count

Found

Digits Factors

n

p

ln

n n found

n

Table Actual and exp ected numbers of prime factors by size

These runs used ab out hours according

to Table After removing known factors b elow digits eleven comp osite

cofactors remained with sizes averaging digits Approximately hours

were sp ent p er cofactor According to Table this is enough time to nd

factors up to ab out digits if optimal parameters are used The parameters for

the middle column of Table are close to the optimum but those in the latter

columns are much ab ove the optimum values Although nothing sp ectacular was

found the number of ndings is only slightly b elow what might b e exp ected in

terms of the eort exp ended

Additional ndings

The program was also run on some other parts of the partition table and on

the Fib onacci table Table lists some additional ndings

N B Factor found

F

F

S

F

S

p

S

p

p

p

p

p

p

p

p

S

Identies factors found during Step

Table Additional factors found

Bibliography

Milton Abramowitz and Irene A Stegun Handbook of mathematical functions

with formulas graphs and mathematical tables Dover Publications Inc New

York NY

Alfred V Aho John E Hop croft and Jerey D Ullman The design and

analysis of computer algorithms AddisonWesley Reading MA

Selim G Aki The design and analysis of parallel algorithms Prentice Hall

Englewoo d Clis NJ

Alliant Computer Systems Corp oration Littleton MA FXseries architec

ture manual April Part number C

A O L Atkin and F Morain Finding suitable curves for the el liptic curve

method of factorization Draft of January submitted to Mathematics

of Computation

David H Bailey The computation of to decimal digits using

Borweins quartically convergent algorithm Mathematics of Computation

no

I Borosh C J Moreno and H Porta El liptic curves over nite elds II

Mathematics of Computation no

Murray Bremner El liptic curves with torsion points and positive rank

AMS Abstracts no Abstract T

Richard P Brent Some algorithms using el liptic curves

Research Rep ort CMAR Centre for Mathematical Analysis The Aus

tralian National University GPO Box Canberra ACT Australia

September

David M Bressoud Factorization and primality testing SpringerVerlag New

York NY Undergraduate Texts in Mathematics

John Brillhart D H Lehmer J L Selfridge Bryant Tuckerman and S S

n

Wagsta Jr Factorization of b b up to high

powers nd ed Contemporary Mathematics vol American Mathematical

So ciety Providence RI

John Brillhart Peter L Montgomery and Rob ert D Silverman Tables of

Fibonacci and Lucas factorizations Mathematics of Computation

no SS

J P Buhler H W Lenstra Jr and Carl Pomerance Factoring integers with

the number eld sieve Version

k

Richard Crandall Implementation of the n method Electronic mail message

January

Noam D Elkies On A B C D Mathematics of Computation

no

Dale Husemoller El liptic curves Graduate Texts in Mathematics vol

SpringerVerlag New York

Donald E Knuth Seminumerical algorithms nd ed The Art of Computer

Programming vol AddisonWesley Reading MA

Daniel Sion Kub ert Universal bounds on the torsion of el liptic curves Pro c

London Math So c Third series

Serge Lang Fundamentals of Diophantine geometry SpringerVerlag New

York

Algebra second ed AddisonWesley Menlo Park CA

A K Lenstra H W Lenstra Jr M S Manasse and J M Pollard The

number eld sieve Pro ceedings of the Twenty Second Annual ACM Sym

p osium on Theory of Computing Baltimore May New York

ACM pp

The factorization of the ninth Fermat number February

Arjen K Lenstra and Mark S Manasse Factoring by electronic mail Ad

vances in Cryptology EUROCRYPT JJ Quisquater and J Vandewalle

eds Lecture Notes in Computer Science vol SpringerVerlag

pp

H W Lenstra Jr Factoring integers with el liptic curves Annals of Mathe

matics no Second series

Rudolf Lidl and Harald Niederreiter Finite elds Encyclop edia of Mathe

matics and its Applications vol AddisonWesley Reading MA

Anna Lubiw and Andras Racz A lower bound for the integer element dis

tinctness problem Information and Computation no

R T Mo enck Fast computation of GCDs Pro ceedings Fifth Annual ACM

Symp osium on Theory of Computing Austin TX pp

Peter L Montgomery Modular multiplication without trial division Mathe

matics of Computation no

Speeding the Pol lard and el liptic curve methods of factorization

Mathematics of Computation no

Design of an FFT continuation to the ECM method of factorization

AMS Abstracts no Abstract

Evaluating recurrences of form x f x x x via Lucas

mn m n mn

chains To b e submitted to Fib onacci Quarterly January

Peter L Montgomery and Rob ert D Silverman An FFT extension to the

P factoring algorithm Mathematics of Computation no

Francois Morain Courbes el liptiques et tests de primalite PhD thesis

LUniversiteClaude Bernard Lyon I September Introduction in French

b o dy in English

Carl Pomerance The quadratic sieve factoring algorithm Advances in Cryp

tology Pro ceedings of EUROCRYPT New York T Beth N Cot and

I Ingemarsson eds Lecture Notes in Computer Science vol Springer

Verlag pp

R L Rivest A Shamir and L Adleman A method for obtaining digital

signatures and publickey cryptosystems CACM no

RSA Challenge Administrator Information about RSA Factoring Chal lenge

March Send electronic mail to challengeinfocom

J T Schwartz Fast probabilistic algorithms for verication of polynomial

identities JACM no

Joseph H Silverman The arithmetic of el liptic curves Graduate Texts in

Mathematics vol SpringerVerlag New York

Rob ert D Silverman The multiple polynomial quadratic sieve Mathematics

of Computation no