Elliptic Curve Point Multiplication by Generalized Mersenne Numbers

Tao Wu and Li-Tian Liu

Abstract—Montgomery modular multiplication in the residue number system (RNS) can be applied for elliptic curve cryptography. In this work, unified modular multipliers over generalized Mersenne numbers are proposed for RNS Montgomery modular multiplication, which enables efficient elliptic curve point multiplication (ECPM). Meanwhile, the elliptic curve arithmetic with ECPM is performed by mixed coordinates and adjusted for hardware implementation. In addition, the conversion between RNS and the binary number system is also discussed. Compared with the results in the literature, our hardware architecture for ECPM demonstrates high performance. A 256-bit ECPM in Xilinx XC2VP100 field programmable gate array device (FPGA) can be performed in 1.44 ms, costing 22147 slices, 45 dedicated multipliers, and 8.25K bits of random access memories (RAMs).

Index Terms—Elliptic curve cryptography, generalized Mersenne numbers, modular multiplier, residue number system.

1. Introduction

Public key cryptography plays a significant role in information security that it provides convenient digital signatures and exchange of private keys[1]. Among kinds of public key cryptography schemes, RSA cryptography[2], Diffie-Hellman key exchange protocol[3], and elliptic curve cryptography (ECC)[4] are the most frequently used. Especially, ECC utilizes much shorter keys to reach the same security level as that of RSA cryptography. For example, the ECC with a 256-bit key provides the same security level as that of an RSA cryptosystem with a 3072-bit key[5].

In the above cryptosystems, the fundamental operation refers to multi-precision modular arithmetic. In RSA cryptography and the Diffie-Hellman key exchange protocol, modular exponentiation is the dominant operation in the whole computation. Within ECC the main computation is also modular multiplication in finite field. For instance, in the ECC digital signature protocol, there is 1 elliptic curve point multiplication (ECPM) to generate an ECC digital signature, and 2 ECPMs to authenticate a digital signature. Meanwhile, 1 ECPM include about 10N to 30N modular multiplication, where N is the field bit length.

As long as ECPM over prime fields is concerned, several designs have been implemented by setting up a long-precision modular multiplier. With the large modular multiplier, all modular multiplication in ECC are then processed sequentially. In order to decrease the complexity in arithmetic and accelerate the computation, architectures with superior parallelism are effective and welcome. A large Montgomery modular multiplier has been developed with a quotient pipeline architecture[6], and has also been devised by a systolic array architecture[7].

Meanwhile, in a residue number system (RNS)[8][9] multiprecision multiplication and addition are partitioned into a number of single-precision operations. Such architectures rely on the RNS Montgomery modular multiplication algorithm[10][17]. A Cox-Row architecture is built up[12] to perform modular exponentiation in RSA cryptography, and is applied for ECPM[13]. A very large scale integration circuit (VLSI) architecture for RNS Montgomery modular multiplication is also proposed[18], in which the conversion between the binary number system and RNS is emphasized and improved. Implementation over other platforms like microprocessors and graphic processing units (GPUs) have also been reported[19][20].

In this paper, based on RNS Montgomery modular multiplication and generalized Mersenne numbers, an optimized ECPM architecture is proposed for hardware implementation. Firstly, a unified modular multiplier over generalized Mersenne numbers is devised, which performs modular multiplication with respect to two conjugated moduli. A group of such modular multipliers then cover two groups of RNS moduli, so it can be applied to RNS Montgomery modular multiplication. Besides, the binary-to-RNS and RNS-to-binary conversions can also be
carried out by configuring the RNS Montgomery modular multiplier in different modes. Together with lots of precomputation for RNS Montgomery modular multiplication and mixed coordinates for ECPM, high-performance ECPM can be carried out by efficient modular arithmetic over generalized Mersenne numbers.

The remaining parts of this paper are organized as follows. Section 2 introduces the background for the RNS Montgomery modular multiplication algorithm and elliptic curve arithmetic with ECC. Then a unified modular multiplier is proposed for RNS Montgomery modular multiplication in Section 3, in which the residue arithmetic by generalized Mersenne numbers is developed. Section 4 discusses ECPM based on such RNS arithmetic, where the RNS-to-binary conversion and its inverse operation are also considered. In Section 5, we give the results with hardware implementation of ECPM over prime field. The last Section concludes this paper.

2. Background for RNS and ECC

2.1 Residue Number System

A set of pairwise prime numbers define a residue number system, which is called RNS moduli. Suppose RNS \( \Omega \) is defined by the moduli \( \{m_1, m_2, \ldots, m_n\} \), then the dynamic range of \( \Omega \) is \( M=\prod_{i=1}^{n} m_i \). It is well-known that RNS can be used to accelerate long-precision addition, subtraction, and multiplication. In fact, long-precision modular addition, modular subtraction, and modular multiplication can also be performed in RNS.

Assume \( A \) and \( B \) are two integers represented in RNS \( \Omega \), i.e., \( A = (a_1, a_2, \ldots, a_n) \) and \( B = (b_1, b_2, \ldots, b_n) \), while the moduli is \( \{m_1, m_2, \ldots, m_n\} \) and the RNS dynamic range is \( M \). Suppose we have to compute the modular addition \( C = A + B \) (mod \( P \)) in \( \Omega \), then a pure addition \( C = A + B \) is performed instead of \( C \) as long as \( A + B < M \).

However, the RNS modular subtraction and RNS modular multiplication need more considerations, which are respectively discussed below and in Algorithm 1. Usually, we perform finite arithmetic in the domain of positive integers, and in order to avoid overflow a correction step should follow the RNS modular subtraction, which just adds the modulus. For instance, when we compute \((A-B)\) mod \( P \) in RNS with \( 0 \leq A < P \) and \( 0 \leq B < P \), there are two possibilities about the result:

1) \((a_i - b_i) \) mod \( m_i \), for \( 1 \leq i \leq n \);
2) \((a_i - b_i + p_i) \) mod \( m_i \), for \( 1 \leq i \leq n \).

The first case refers to \( A \geq B \), while the second case is with \( A < B \). Since we have to ensure that the result of a RNS subtraction is still positive, RNS tuple with the modulus should be added to the RNS result \( [10] \). In general, if \( A \leq D \), and \( B \geq E \), then in RNS arithmetic one should replace

\[
\alpha = \text{an integer between } [0, n-1].
\]

Denote \( \sigma_i = \left|\frac{a_i}{m_i} M_i^{-1}\right| \mod M \), then

\[
R = \sum_{i=1}^{n} M_i \sigma_i - \alpha M \tag{3}
\]

which can be rewritten as \([11] \)

\[
\sum_{i=1}^{n} \sigma_i (m_i) = \alpha + R/M \tag{4}
\]

Because 

\[
0 \leq R = M < 1,
\]

there is

\[
\alpha \leq \sum_{i=1}^{n} \left( \sigma_i/m_i \right) < \alpha + 1.
\]

i.e.,

\[
\alpha = \left[ \sum_{i=1}^{n} \left( \sigma_i/m_i \right) \right].
\]

As long as the integer \( \alpha \) is estimated from (5), \( R \) can be obtained from (2) within the range \([0, 2^{L}]\). Kawamura’s method replaces \( \sigma_i/m_i \) by \( \sigma_i/2^L \) and truncates certain digits \([11, 12]\).

For example, let \( \{m_i\} = \{2, 3, 5\} \), then \( M = 2 \times 3 \times 5 = 30 \), \( < M > = (15, 10, 6) \), \( < M_i > = m_i \). Suppose \( x = 7 = (1, 1, 2) \), then \( < \sigma_1 > = (1, 1, 2) \), \( \alpha = \left[ \frac{1}{2} + \frac{1}{3} + \frac{2}{5} \right] = 1 \), thus from (3) we have \( x = 1 \times 15 + 1 \times 10 + 2 \times 6 - 1 \times 30 = 7 \), which is just the original number.

2.2 Improved Chinese Remainder Theorem

The Chinese remainder theorem is a mathematical connection between the binary expression and the RNS form of a number. Suppose \( R = \sum_{i=1}^{n} r_i \) in \( \Omega \); \( \{m_1, m_2, \ldots, m_n\} \), with \( M = \prod m_i \), and \( M_i = M/m_i \), for \( i = 1, 2, \ldots, n \). Then the Chinese remainder theorem declares that

\[
R = \sum_{i=1}^{n} M_i \left[ \frac{r_i}{M_i} \right] \mod M, \tag{1}
\]

The improved Chinese remainder theorem replaces the last modular reduction by subtraction \([21]\), i.e.,

\[
R = \sum_{i=1}^{n} M_i \left[ \frac{r_i}{M_i} \right] - \alpha M \tag{2}
\]
\((a_0 - b_0) \mod m_0\) by \((a_0 - b_0 + ep_i) \mod m_i\).

2.3 RNS Montgomery Modular Multiplication

**Algorithm 1.** RNS Montgomery modular multiplication

Input: RNS \(\Omega\) is defined by moduli \(\{m_1, m_2, \ldots, m_n\}\), \(M = \prod m_i, M_i = m_i\), for \(i = 1, 2, \ldots, n\). RNS \(\Gamma\) is defined by moduli \(\{m_{n+1}, m_{n+2}, \ldots, m_{2n}\}\), \(N = \prod m_{n+i}, N_i = m_{n+i}\), for \(i = 1, 2, \ldots, n\). Meanwhile, \(X = (x_1, x_2, \ldots, x_{n}, \ldots, x_{2n})\), \(Y = (y_1, y_2, \ldots, y_{n}, \ldots, y_{2n})\), \(P = (p_1, p_2, \ldots, p_n, \ldots, p_{2n})\). \(X\cdot Y < P\cdot M, X < M, M < N\).

Precomputation: for \(i = 1, 2, \ldots, n, j = n + 1, n + 2, \ldots, 2n:\)

\(< p_j >= P \mod m_j >, < r_j >= M^{-1} \mod m_j >, < x_j >= M^{-1} \mod m_j >, < \omega_j >= N_j \mod m_j >, < \zeta_j >= N_j^{-1} \mod m_j >, < \phi_j >= N_j^{-1} \mod m_j >, < \theta_j >= N_j \mod m_j >, < \lambda_j >= N \mod m_j >, < M_i^{(j)} >= M_i \mod m_j >, < N_i^{(j)} >= N_i \mod m_j >, < \theta_i >= N_i \mod m_j >, < \eta_i >= < -p_i^{-1} >, < \psi_i >= < -p_i^{-1} >, < \tau_i >= < -p_i^{-1} >, < \xi_i > = < u_j > >, < \nu_i > = < u_j > >, < \nu_i > = < u_j > >, < \nu_i > = < u_j > >, < \nu_i > = < u_j > >, < \nu_i > = < u_j > >, where the tickets \(< >\) denotes a set of modular operations.

Output: \(R <= r_j = (r_1, r_2, \ldots, r_n) \rightarrow r_j = (r_{n+1}, r_{n+2}, \ldots, r_{2n}) = X \cdot Y \cdot M^{-1} \mod P, \) with \(0 \leq R < 3P\).

1: \(< z_j = x_j >, < y_j > >;\)
2: \(< z_j = x_j >, < y_j > >;\)
3: \(< \sigma_j = z_j >, < \theta_j > >;\)
4: \(< u_j = z_j >, < \psi_j > >;\)
5: \(\alpha = \frac{\sum_{i=1}^{2n} \text{trunc}(\sigma_j)}{2^\ell};\)
6: \(< \xi_j >= u_j >;\)
7: \(\text{for } i = 1 \text{ to } n:\)
8: \(< \xi_j >= u_j > + \sigma_i ;\)
9: \(\text{end for};\)
10: \(< v_j = a < \phi_j > ;\)
11: \(< \xi_j >= v_j > - < v_j > ;\)
12: \(< r_j >= 0 >;\)
13: \(\beta = \frac{\sum_{i=1}^{2n} \text{trunc}(\xi_i)}{2^\ell} + 0.5 ;\)
14: \(\text{for } i = 1 \text{ to } n:\)
15: \(< r_j >= < r_j > + < \xi_j > + N_i^{(j)} ;\)
16: \(\text{end for};\)
17: \(< r_j >= < r_j > - \beta < \lambda_i > ;\)
18: \(< r_j >= < \xi_j > + \omega_j >;\)
19: \(\text{return } \{ < r_j >, < r_j > \}.\)

In Algorithm 1, the main computation is two base conversions between RNS \(\Omega\) and \(\Gamma\). The colon before “\(=\)” emphasizes that it is an assignment to the original variable, \(\text{trunc}(x)\) sets the lowest \((L-\ell)\) bits of “\(x\)” as zeros and “\(\cdot\)” means modular multiplication. The value of \(l\) can be determined by RNS moduli\(^{[1]}\).

According to Fermat’s little theorem, if \(p\) is a prime number and \(\text{GCD}(a, p) = 1\), then \(a^{-1} = a^{p-2} \mod p\). Therefore, RNS modular inverse over prime fields can be calculated by a series of RNS modular multiplication.

2.4 ECPM

An elliptic curve over the prime field \(F_p\) can be defined by the elliptic curve equation\(^{[5]}\):

\[y^2 = x^3 + ax + b\]

where \(4a^3 + 27b^2 \neq 0 \mod p\), and \(p\) is a prime number with \(p > 3\). In the elliptic curve field, point addition, point doubling, and point multiplication are basic operations.

In fact, an ECPM can be separated into a series of elliptic curve point addition and point doubling. Set \(Q_0 = (x_0, y_0)\) as a starting point on the elliptic curve, and choose \(k\) as a binary number with \(k = (k_{i-1}, \ldots, k_{0})_2\). As it is shown in Algorithm 2, the ECPM \(kQ_0\) can be performed by the binary left-to-right method\(^{[5]}\).

**Algorithm 2.** ECPM by point addition and point doubling

Input: \(Q_0\) is a point in an elliptic curve over \(F_p\), and \(k = (k_{i-1}, \ldots, k_{0})_2\) is a binary number.

Output: \(S_{i+1} = kQ_0\).

1: \(S_0 := Q_0 ;\)
2: \(\text{for } i = 1 \text{ to } l-1:\)
3: \(S_i := 2S_{i-1} ;\)
4: \(\text{if } k_{i-1} = 1 \text{ then}\)
5: \(S_i := S_i + Q_0 ;\)
6: \(\text{end if}\)
7: \(\text{end for}\)
8: \(\text{return } S_{i+1} .\)

In the above algorithm, the addition “\(+\)” denotes point addition, while the multiplication of 2 denotes point doubling.

2.5 Elliptic Curve Point Doubling and Point Addition

Assuming \(Q_0\) is the starting point with Jacobian coordinate \((X_0, Y_0, Z_0)\), then its affine coordinate reads\(^{[5]}\) \(x_0Y_0Z_0^2\) and \(y_0 = Y_0/Z_0^3\). Suppose \(Q_1\) is the doubling point of \(Q_0\), and its Jacobian coordinate is \((X_1, Y_1, Z_1)\). Substituting \(x_0, y_0\) by \(X_0, Y_0, Z_0\), there are

\[
\begin{align*}
X_1 &= (3X_0^2 + aZ_0^2)Y_0^2 - 8X_0Y_0^3; \\
Y_1 &= (3X_0^2 + aZ_0^2)(4X_0Y_0^2 - X_1) - 8Y_0^4; \\
Z_1 &= 2Y_0Z_0. 
\end{align*}
\]

In the above equations, the parameter \(a\) is just the
coefficients within the elliptic curve equation:  

\[ y^2 = x^3 + ax + b. \]

For point addition of two points \( P_0 \) and \( P_1 \), mixed coordinates can be applied[9]. One point \( P_0 \) keeps the affine coordinate \((x_0, y_0)\), and the other point is denoted by the Jacobian coordinate \((X_1, Y_1, Z_1)\). Suppose \( P_2 = P_0 + P_1 \), and it has the Jacobian coordinate \((X_2, Y_2, Z_2)\), then \( P_2 \) is:

\[
\begin{align*}
X_2 &= \left( y_0 Z_1^2 - Y_1 \right) (X_1 + x_0 Z_1^2) \quad - \quad (y_0 Z_1^2 - X_1) Y_1 (X_1 + x_0 Z_1^2) \\
Y_2 &= \left( y_0 Z_1^2 - Y_1 \right) \left( X_1 (y_0 Z_1^2 - X_1)^2 - X_2 \right) \\
Z_2 &= (x_0 Z_1^2 - X_1) Z_1.
\end{align*}
\]  

(8)

3. Proposed Modular Multiplier

Generalized Mersenne numbers are widely known to facilitate modular arithmetic[22]-[24], and in this section unified modular arithmetic over generalized Mersenne numbers \( 2^k - 2^i \pm 1 \) is developed. The main idea is to find out the most common parts between two modular multipliers respectively over \( 2^k - 2^i - 1 \) and \( 2^k - 2^i + 1 \), which results in a area-efficient modular multiplier architecture for hardware implementation.

3.1 New Modular Arithmetic over Generalized Mersenne Numbers

Suppose that \( A \) and \( B \) are two binary numbers, with \( T = A \times B = 2^L \times T_{h} + T_{l} \). Here \( T_{h} \) and \( T_{l} \) are the higher and lower halves of \( T \). In this work, we will process the two parts of \( T \) in parallel, in order to avoid the final full addition before modular reduction. As a result, a carry out appears from the lower part, which is denoted as \( c_f \) and processed in together. With \( T = T_{h} + T_{l} + c_f \cdot 2^l \), there is

\[
T = T_{h} \cdot 2^k + T_{l} + c_f \cdot 2^l.
\]  

(9)

As it is shown in Fig. 1, a 54-bit multiplier can be achieved by 18-bit multipliers, adders, and carry save logic. Similary, other multi-precision multiplications can also be performed by single-precision multiplications and separated into high and low parts.

Although the generalized Mersenne numbers can be divided into two kinds: \( m = 2^k - 2^i - 1 \) and \( m = 2^k - 2^i + 1 \). At first, we can separetely simplify the modular arithmetic with each case.

A. \( m = 2^k - 2^i - 1 \)

Set \( T_{h1} = (t_{L-1} \cdots t_{L-i} t_{L-1})_2 < 2^k \), \( T_{h2} = (t_{L-1} \cdots t_{L-1})_2 < 2^k \), then \( T_h = 2^L - T_{h1} + T_{h2} \), and

\[
C = T \mod m = (2^k + 1)T_h + T_l \cdot c_f \cdot (2^k + 1) =
\]

\[
(2^k + 2^L - T_{h1} + T_{h2}) + T_l + c_f \cdot (2^k + 1) =
\]

\[
(2^k + 2^L - 2^k - 1)T_{h1} + (2^k + 1)T_{h2} + T_l + c_f \cdot (2^k + 1).
\]

(10)

Let

\[ C' = (2^k + 2^L - 1)T_{h1} + (2^k + 1)T_{h2} + T_l + c_f \cdot (2^k + 1) \]  

(11)

then \( C' = C \mod m \).

In (11), \( c_f \cdot (2^k + 1) \) is a \((k+1)\)-bit number. With \( L-k \rightarrow k+1 \), it can be combined with \( 2^L - T_{h1} \), i.e., \( 2^L - T_{h1} + c_f \cdot (2^k + 1) = (t_{L-1} \cdots t_{L-1})_2 \) \((L-k)\) \medtext{bits} \end{sageblock}

Therefore, only 6 partial products need to be accumulated, all of which are below \( 2^k \). For example, \( 2^L T_{h1} < 2^L \), \( 2^L \mod 2^L < 2^L \). In the accumulation, before final carry ripple addition, we can employ three carry save addition to reduce the six parts to two \((L+2)\)-bit unsigned numbers.

Next, the bound of \( C' \) should be determined by the modulus. It is self-evident that \( C' \geq 0 \). Meanwhile,

\[
C' \leq (2^k + 2^L - 1) + (2^k - 1) + (2^k + 1) + 2^k - 1 + 2^k + 1 =
\]

\[
3(2^L - 2^k - 1) + 2^k + 2^L + 1 <
\]

\[
3m + 2^k + 2^L + 1 <
\]

\[
3m + 2^k + 2^L + 1.
\]

With \( m = 2^k - 1 \), \( 2^k - 2^k = \sum_{i=0}^{L-1} 2^i = \sum_{i=0}^{L-1} 2^i \), \( k < L/2 - 1 \), i.e., \( L > 2k + 2 \), we have

- If \( k = 1 \), then \( L > 2 \) \medtext{and} \end{sageblock}

Thus there is still \( m > 2^k + 2^L + 1 \)

Finally, there is always \( 0 \leq C' < 4m \).

B. \( m = 2^k - 2^i + 1 \)

Set \( T_{h1} = (t_{L-1} \cdots t_{L-i} t_{L-1})_2 < 2^k \) and \( T_{h2} = (t_{L-1} \cdots \)
\[ t_{b_0} < 2^{L-2}, \text{ then } T_b = 2^{L-2}T_{a_1} + T_{b_2}. \] Thus,

\[ C = T \mod m = (2^{L-2} - 1)T_{a_1} + T_b + c_{j'} \cdot 2^{L} = (2^{L-2} - 1)(2^{L-2}T_{a_1} + T_{a_2}) + T_b + c_{j'} \cdot 2^{L-1} = (2^{L-2} - 2^{L-2} + 1)T_{a_1} + (2^{L-2} - 1)T_{a_2} + T_b + c_{j'} \cdot 2^{L-1} = (2^{L-2} - 2^{L-2} + 1)T_{a_1} + (2^{L-2} - 1)T_{a_2} + T_b + c_{j'} \cdot 2^{L-1} \] (mod m).

Also, set

\[ C^* = (2^{L-2} - 1)T_{a_1} + (2^{L-2} - 1)T_{a_2} + T_b + c_{j'} \cdot 2^{L-1} \] (14)

then \( C^* = C \mod m \).

In (14), the partial product \( c_{j'}(2^{L-2} - 1) \) has only \( k \) bits, and can be incorporated into \( 2^{L-2}T_{a_2} \) as

\[ 2^{L-2}T_{a_2} + c_{j'}(2^{L-2} - 1) = (t_{L_1 - 4 \cdot 1} \cdots t_{L_0} c_{j'} \cdots c_{j'}) t_2. \]

Thus, there are also 6 partial products in accumulation, and carry save addition can be employed to compress it to 2 parts. The upper bound of \( C^* \) can be determined as

\[ C^* = T_b + (2^{L-2} - 1)T_{a_1} + c_{j'}(2^{L-2} - 1) - (2^{L-2} - 1)T_{a_1} \leq 2^{L-2} - 1 + (2^{L-2} - 1)2^{L-2} - 2^{L-2} - 1 = m + m \cdot 2^{L-2} - 2^{L-2} - 2 = 2m. \]

As \( 0 \leq T_{a_1} \leq 2^{L-1} - 1 \), the lower bound yields

\[ C^* \geq -(2^{L-2} - 2^{L-2} + 1)T_{a_1} \geq -(2^{L-2} - 2^{L-2} + 1)2^{L-2} = (2^{L-2} - 2^{L-2} + 1)2^{L-1} \geq 2^{L-2} + 2^{L-1} < 2^{L-2} + 2^{L} > m. \] (15)

Finally, we have \(-m < C^* < 2m\). Therefore, the six parts with \( C^* \) can be accumulated as six (\( L+2 \)-bit) signed numbers.

3.2 Unified Modular Multiplier over a Pair of Generalized Mersenne Numbers

The unified modular multiplier here refers to a modular multiplier that works for two conjugated moduli \( m = 2^{L-2} - 2^{L-1} \) or \( m = 2^{L-2} + 1 \).

Let us define a sign function at first, i.e.,

\[ \text{sign} = \begin{cases} +1, & \text{if } m = 2^{L-2} - 2^{L-1} \\ -1, & \text{if } m = 2^{L-2} + 2^{L-1} \end{cases}. \] (17)

Then, set \( S_b = 2^{L}T_{a_1}, S_2 = \text{sign}2^{L-2}T_{a_1}, S_3 = \text{sign}T_{a_1}, S_4 = 2^{L}T_{a_2}, S_5 = \text{sign}T_{a_2}, S_6 = T_b, \text{ and } S = \sum_{i=1}^{6} S_i. \)

These parts are compressed into two (\( L+2 \)-bit) numbers by carry save addition, e.g., \( C_{a_2}, S_a \), with \( C_{m} + S_m = S \).

According to the analysis, \( S \) is also an \( (L+2) \)-bit number. Assume that the redundant result has been obtained, then up to three times of addition/subtraction is enough to reduce it down to \([0, m]\) by Algorithm 3.

**Algorithm 3.** Modular reduction in unified modular multiplier

Input: \( C_m \) and \( S_m \) are both \((L+2)\)-bit numbers, with \( C_m + S_m = T \mod m \) and \( m = 2^{L} - (2^{L} + 1) \).

Output: \( S = T \mod m, 0 \leq S < m \).

1: \( X_0 := (C_m + S_m) \mod 2^{L+2} \); 2: \( X_1 := X_0 - m \); 3: if \( \text{sign} = +1 \) then \( X_2 := C_m + S_m - 2m \); 4: else \( X_2 := C_m + S_m + m \); 5: end if 6: \( X_3 := C_m + S_m - 3m \); 7: if \( \text{sign} = +1 \) then 8: \( \text{if } X_{2, L+1} = 1 \) then 9: \( \text{if } X_{1, L+1} = 1 \) then \( S := X_{0, L+1} - 0 \); 10: else \( S := X_{1, L+1} - 0 \); 11: end if 12: else 13: \( \text{if } X_{1, L+1} = 1 \) then \( S := X_{2, L+1} - 0 \); 14: else \( S := X_{3, L+1} - 0 \); 15: end if 16: end if 17: else \( // \text{sign} = -1 \); 18: \( \text{if } X_{0, L+1} = 1 \) then \( S := X_{2, L+1} - 0 \); 19: else if \( X_{1, L+1} = 1 \) then \( S := X_{0, L+1} - 0 \); 20: else \( S := X_{1, L+1} - 0 \); 21: end if 22: end if 23: return \( S \).

In the above algorithm, in Line 1 if \( \text{sign} = +1 \), then \( X_0 \in [0, 4m] \); else \( X_0 \in [-m, 2m] \). Line 2 to Line 6 can be computed in parallel with carry save addition. In detail, \( C_m \) and \( S_m \) produce a "generate" vector and a "propagate" vector, which then take part in the carry save addition at the same time. Also, \( X_{1, L+1} (i=1, 2, 3) \) means the lowest \( L \) bits of \( X_m \) and \( X_{1, L+1} \) means the \((L+1)\)th bit of \( X \).

The hardware architecture of a unified modular multiplier is shown in Fig. 2, consisting of two stages. The first stage performs multiplication or concatenation of \( A \) and \( B \), while the second stage performs modular reduction of Algorithm 3. In Fig. 2 the REG denotes registers for buffering. Comb means the combination of six parts according to different inputs, CSA is carry save addition, and MUX is the final multiplexer.

Although the architecture supports modular arithmetic over two conjugated and generalized Mersenne numbers, it only deals with one modulus at one time. Meanwhile, as shown in Algorithm 1, the computational steps in an RNS Montgomery modular multiplication take place in sequence from one RNS to another RNS, for which this modular multiplier can be applied.
Fig. 2. Unified modular multiplier.

Fig. 3. ECPM based on RNS Montgomery modular multiplier.

4. ECPM by RNS Arithmetic

By RNS Montgomery modular multiplication, ECC can be performed through RNS arithmetic. Obviously, there are two levels of operations: 1) the modulo \( p \) operation at a high level and 2) the modulo \( m_i \) operation at a low level. The integer \( p \) is the characteristic of the elliptic curve field, while \( \{ m_1, m_2, \ldots, m_n \} \) are RNS moduli. In fact, the implementation of ECC in RNS is just to replace modular arithmetic over \( \mathbb{Z} \) with modular arithmetic over \( \{ m_1, m_2, \ldots, m_n \} \). It may be doubted that there is a conflict between the RNS Montgomery algorithm and conversion of \( \bar{X} = X \cdot M \cdot M^{-1} \mod p \), because \( M \) is represented as zero in RNS. Such a problem can be avoided by setting \( \bar{X} = X \cdot (M \mod p) \cdot M^{-1} \mod p \).

An ECPM based on RNS Montgomery modular multiplication is illustrated in Fig. 3. In addition to the RNS Montgomery modular multiplier, two groups of modular adders/subtractors are employed to deal with other operations in ECPM. The conversion from the binary number system to RNS and the inverse operation are considered in Section 4.1.

In particular, we have applied the left-to-right signed-digit coding technique\(^{[25]}\) in Algorithm 2.

4.1 Conversion between Binary Number System and RNS

In this work, to perform binary-to-RNS conversion, e.g., \( A \mod m_i \) for \( i=1, 2, \ldots, n \), a serial modular reduction method is developed, as shown in Algorithm 4. It requires \((n-1)\) sequential steps to reduce an \( n \)-word integer.

Algorithm 4. Serial modular reduction

Input: \( A = (A^{(n-1)} A^{(n-2)} \cdots A^{(0)})_2, 0 \leq A^{(0)} < 2^n, 0 \leq m_i < 2^n \).

Output: \( S_{n-1} = A \mod m_i \).

1: \( S_0 := (A^{(n-1)} A^{(n-2)} \cdots A^{(0)})_2 \mod m_i ; \)
2: for \( i=1 \) to \( n-2 \)
3: \( S_i = [2^{i} \cdot S_{i-1} + A^{(n-2-i)}] \mod m_i ; \)
4: end for
5: return \( S_{n-2} \).

By concatenating \( S_i \) to \( A^{(n-2-i)} \) in Fig. 2, the above algorithm can be easily implemented, no extra multipliers or memories are required.

The RNS-to-binary conversion is completed by the improved Chinese remainder theorem and \( \text{Guillermin}^{[12]} \). Suppose that the final result \( R \) is represented by an \( n \)-tuple \((r_1, r_2, \ldots, r_n)\), and it reads \((D_{n-1} \cdots D_1 D_0)\) in the binary number system, where \( 0 \leq D \leq 2^L \). Also, set \( R_0 = R \) and precompute \( <\gamma_j> = 2^{-j} \mod <m_j> \).

It is obvious that \( D = r_i \mod 2^L \), and the following process can be done by similar steps. Assume \( D_{j-1} \) to \( D_0 \) have been obtained, and \( R_0 \) has become \( R_{j-1} = (r_{j-1}^{(i-1)}, r_{j-1}^{(j-1)}, \ldots, r_{j-1}^{(i-1)}) \), then

\[
R_i := <r_i^{(j)}> = (<r_i^{(j-1)} - D_{j-1} > \cdot <\gamma_j> \mod <m_j> >
\]

where \( j=1, 2, \ldots, n \). Obviously, \( R_i / M \cdot 2^L \leq 1 \). Then, by the improved Chinese remainder theorem we have

\[
R_i = \sum_{j=1}^{n} M_j \left[ r_i^{(j)} M_j^{-1} \right]_{m_j} - \alpha^{(j)} M
\]

where \( \alpha^{(j)} = \left[ \sum_{j=1}^{n} \frac{1}{2^i} \text{trunc} \left( r_i^{(j)} M_j^{-1} \right)_{m_j} + 0.5 \right] \). Thus, with the precomputation of \( M_j^{-1} \mod m_j \), \( M_i \mod 2^L \), and \( -M \mod 2^L \), (19) can be used to obtain \( D_i = R_i \mod 2^L \). The accumulation in the equation can be performed by \( n \) times of full \( L \)-bit addition. The above processes can be carried out until \( R_{n-1} \) and \( D_{n-1} \) are obtained.
4.2 Moduli Selection

In the RNS Montgomery modular multiplication algorithm, 2n RNS moduli are required by Algorithm 1. Suppose the moduli are denoted as \( m_i = f_i(k) = 2^k - 2^i - 1 \) and \( m_{n+i} = f_2(k) = 2^k - 2^i + 1 \), with \( k \in [1, L/2) \) being a set of numbers \{\( k_1, k_2, \ldots, k_n \)\}. It is necessary that \( k_i \) (\( i=1, 2, \ldots, n \)) make all \( \{m_i\} \) and \( \{m_{n+i}\} \) pairwise prime. Except the generalized Mersenne numbers, two special moduli \( 2^L \) and \( 2^L - 1 \) are included as RNS moduli.

For FPGA implementation, the binary length \( L \) should be sufficiently large for coprime moduli and the RNS dynamic range. And it is also set to be around multiples of 18, so as to use the dedicated multipliers in FPGA devices.

The seeking of moduli can be done with Wolfram Mathematica. In this work, the RNS moduli for 192-bit, 256-bit, and 384-bit ECC over prime fields are demonstrated in Table 1.

In Table 1, RNS 1 is the original RNS, and RNS 2 is the auxiliary RNS. Modular multiplication over the two moduli \( m \) and \( m_{n+i} \) (\( n=5, 6, 8 \)) are incorporated into one unified modular multiplier.

Table 1: RNS moduli for prime ECC

<table>
<thead>
<tr>
<th>Design</th>
<th>RNS</th>
<th>Moduli</th>
</tr>
</thead>
<tbody>
<tr>
<td>192-bit ECC</td>
<td>( m_i = 2^i - 1 )</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
</tr>
<tr>
<td>RNS 1</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
</tr>
<tr>
<td>RNS 2</td>
<td>( m_i = 2^i - 2^i + 1 )</td>
<td>( m_i = 2^i - 2^i + 1 )</td>
</tr>
<tr>
<td>256-bit ECC</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
</tr>
<tr>
<td>RNS 1</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
</tr>
<tr>
<td>RNS 2</td>
<td>( m_i = 2^i - 2^i + 1 )</td>
<td>( m_i = 2^i - 2^i + 1 )</td>
</tr>
<tr>
<td>384-bit ECC</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
</tr>
<tr>
<td>RNS 1</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
<td>( m_i = 2^i - 2^i - 1 )</td>
</tr>
<tr>
<td>RNS 2</td>
<td>( m_i = 2^i - 2^i + 1 )</td>
<td>( m_i = 2^i - 2^i + 1 )</td>
</tr>
</tbody>
</table>

4.3 Optimized Elliptic Curve Arithmetic

The point addition and doubling in Section 2 have been reorganized to facilitate hardware implementation[26], which are shown in Algorithm 5 and Algorithm 6. The basic idea is to reduce both the computational steps and buffering registers. The time complexity of one ECPM in this work is \( T_{\text{total}} = h \cdot c \cdot N \cdot T_{\text{gmult}} \), where \( h \) is a factor determined by ECC arithmetic, \( c \) is the number of sequential steps to finish one RNS Montgomery modular multiplication, \( N \) is the binary length of ECC field bits, and \( T_{\text{gmult}} \) is the time of one modular multiplication over generalized Mersenne numbers. In this work, \( h=15 \) is for the average case, and \( h=11 \) is for the best case. \( T_{\text{gmult}} \) is just the critical path of the second stage in Fig. 2.

In fact, the lazy RNS modular reduction[27]-[29] can be applied to accelerate the ECPM to some extent[13],[29]. On average, to process one ECC bit, the average number of RNS modular reductions in this work is \( 11+12 \times 1/3=15 \), where the factor 1/3 is due to signed-digit coding[25]. By contrast, only 13 sequential RNS modular reductions are needed by the lazy RNS reduction[13],[29]. However, this selection of ECC arithmetic steps does not affect the efficiency of unified modular multiplier.

Algorithm 5. Optimized point doubling

Input: Point \( Q_0 \) with Jacobian coordinate \((X_0, Y_0, Z_0)\) lies at an elliptic curve, which is defined by \( y^2 = x^3 + ax + b \).

Output: \( Q_t = 2Q_0 \) with \( Q_t = (X_t, Y_t, Z_t) \).

1: \( V_1 = X_0^2, F = Y_0 + Z_0 \);
2: \( V_2 = Z_0^2, G_1 = 3 \cdot V_1 + V_1 + V_1 \);
3: \( V_3 = V_2^2 \);
4: \( V_4 = aV_2 \);
5: \( E = G_1 + V_4, G_2 = F^2 \);
6: \( G_3 = G_2 \cdot X_0 \);
7: \( V_5 = G_3 + G_1, V_6 = E^2 \);
8: \( X_1 = V_6 - V_5, V_7 = F \cdot G_2 \);
9: \( V_8 = G_3 - X_1, G_4 = V_7 \cdot Y_0 \);
10: \( V_9 = E \cdot V_8 \);
11: \( Y_1 = V_9 - G_4, Z_1 = Z_0 \cdot F \);
12: return \((X_1, Y_1, Z_1)\).

At Line 4 of Algorithm 5, the constant \( a \) has been represented in RNS as \((a \cdot M) \mod p\) by precomputation, where \( M \) is the RNS dynamic range.

Algorithm 6. Optimized point addition

Input: Point \( P_0 \) is denoted by the affine coordinate \((x_0, y_0)\), and the other point \( P_1 \) is denoted as a Jacobian coordinate \((X_1, Y_1, Z_1)\).

Output: \( P_2 = P_0 + P_1 \) with a Jacobian coordinate \((X_2, Y_2, Z_2)\).
Table 2: Hardware implementation of $F_p$ ECPM

<table>
<thead>
<tr>
<th>References</th>
<th>Prime Field</th>
<th>Technology</th>
<th>Max. frequency (MHz)</th>
<th>Area (Slices)</th>
<th>ROM+RAM (bit)</th>
<th>Multipliers/DSP</th>
<th>ECPM time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td>This work</td>
<td>192-bit $F_p$</td>
<td>Xilinx XC2VP100</td>
<td>79.3</td>
<td>17747</td>
<td>32+192</td>
<td>24</td>
<td>1.13</td>
</tr>
<tr>
<td>This work</td>
<td>256-bit $F_p$</td>
<td>Xilinx XC2VP100</td>
<td>73.5</td>
<td>22147</td>
<td>(32+2)×256</td>
<td>45</td>
<td>1.44</td>
</tr>
<tr>
<td>This work</td>
<td>256-bit $F_p$</td>
<td>Xilinx XC4VLX80</td>
<td>100</td>
<td>21572</td>
<td>(32+2)×256</td>
<td>45</td>
<td>1.06</td>
</tr>
<tr>
<td>This work</td>
<td>384-bit $F_p$</td>
<td>Xilinx XC4VLX160</td>
<td>99.0</td>
<td>35109</td>
<td>(32+2)×384</td>
<td>72</td>
<td>2.03</td>
</tr>
<tr>
<td>This work</td>
<td>256-bit $F_p$</td>
<td>TSMC CMOS 0.18 um process</td>
<td>123.46</td>
<td>221K gates</td>
<td>(32+2)×256</td>
<td>0</td>
<td>0.857</td>
</tr>
<tr>
<td>This work</td>
<td>384-bit $F_p$</td>
<td>TSMC CMOS 0.18 um process</td>
<td>123.46</td>
<td>324K gates</td>
<td>(32+2)×384</td>
<td>0</td>
<td>1.63</td>
</tr>
<tr>
<td>Ref. [30]</td>
<td>192-bit $F_p$</td>
<td>Xilinx XCVL000E</td>
<td>52.9</td>
<td>25012 LUTs</td>
<td>0</td>
<td>0</td>
<td>2.97</td>
</tr>
<tr>
<td>Ref. [30]</td>
<td>256-bit $F_p$</td>
<td>Xilinx XCVL000E</td>
<td>39.7</td>
<td>32716 LUTs</td>
<td>0</td>
<td>0</td>
<td>3.95</td>
</tr>
<tr>
<td>Ref. [13]</td>
<td>192-bit $F_p$</td>
<td>Altera EP1S30F780C5</td>
<td>89.6</td>
<td>12480 LE</td>
<td>Unspecified</td>
<td>80 DSP</td>
<td>0.72</td>
</tr>
<tr>
<td>Ref. [13]</td>
<td>256-bit $F_p$</td>
<td>Altera EP1S60F780C5</td>
<td>90.7</td>
<td>16200 LE</td>
<td>Unspecified</td>
<td>125 DSP</td>
<td>1.17</td>
</tr>
<tr>
<td>Ref. [13]</td>
<td>256-bit $F_p$</td>
<td>Altera EP2S30F484C3</td>
<td>157.2</td>
<td>9177 ALM</td>
<td>Unspecified</td>
<td>96 DSP</td>
<td>0.68</td>
</tr>
<tr>
<td>Ref. [13]</td>
<td>384-bit $F_p$</td>
<td>Altera EP2S30F484C3</td>
<td>150.9</td>
<td>12958 ALM</td>
<td>Unspecified</td>
<td>177 DSP</td>
<td>1.35</td>
</tr>
<tr>
<td>Ref. [6]</td>
<td>192-bit $F_p$</td>
<td>Xilinx XCVL000E</td>
<td>40</td>
<td>11416 LUTs</td>
<td>35 BRAMs</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<td>Ref. [31]</td>
<td>NIST-256 $F_p$</td>
<td>Xilinx XC4VSX55</td>
<td>375</td>
<td>24574</td>
<td>176 BRAMs</td>
<td>512 DSP</td>
<td>0.0405</td>
</tr>
<tr>
<td>Ref. [31]</td>
<td>NIST-256 $F_p$</td>
<td>Xilinx XC4VSX55</td>
<td>490</td>
<td>1715</td>
<td>176 BRAMs</td>
<td>32 DSP</td>
<td>0.495</td>
</tr>
<tr>
<td>Ref. [32]</td>
<td>256-bit $F_p$</td>
<td>Xilinx XC2VP125</td>
<td>39.46</td>
<td>15755</td>
<td>72</td>
<td>2.03</td>
<td></td>
</tr>
</tbody>
</table>

1: $H_1 = Z_1^3$;
2: $H_2 = x_0 \cdot H_1$;
3: $S = H_3 - X_1$, $U_1 = H_1 \cdot Z_1$;
4: $W = H_2 + X_1$, $U_2 = y_0 \cdot U_1$;
5: $T = U_3 - Y_1$, $H_3 = S^2$;
6: $H_4 = H_3 \cdot W$;
7: $U_3 = T^2$;
8: $X_2 = U_3 - H_4$, $U_4 = X_1 \cdot H_3$;
9: $U_5 = U_3 - X_2$, $H_5 = H_1 \cdot S$;
10: $H_6 = T \cdot U_5$;
11: $U_6 = Y_1 \cdot H_3$;
12: $Y_2 = H_6 - U_6$, $Z_2 = S \cdot Z_1$;
13: return $(X_2, Y_2, Z_2)$.

In Algorithm 5 and Algorithm 6, $U_i (i=1, 2, \ldots, 6)$ and $V_i (i=1, 2, \ldots, 9)$ can be buffered by output registers of RNS processing units, while the other variables need extra registers for buffering.

5. Hardware Implementation

In this work, ECPM has been described by verilog hardware description language (VHDL) and then synthesized in Xilinx ISE foundation by Synplify 9.6.2. The place and route process is completed in Xilinx ISE 10.1. The target devices are Xilinx XC2VP100-6 FF1696 FPGA (130 nm CMOS node) and XC4VLX80-12 FF1148 FPGA (90 nm CMOS node), which can be compared with Altera Stratix and Stratix II devices. This work is aiming at ECPM over general prime fields. The hardware implementation results are shown in Table 2. In order to measure the total area complexity, we have also synthesized the 256-bit and 384-bit $F_p$ ECPM by TSMC 0.18 μm CMOS process, which respectively costs an area of 221K gates (NAND2) and 324 K gates at 123.46 MHz.

The number of sequential steps $c$ in one RNS Montgomery modular multiplication is interfered with hardware implementation. In this work, $c=26$ for 192-bit ECC, $c=24$ for 256-bit ECC, and $c=30$ for 384-bit ECC. Correspondingly, the number of clock cycles for each ECPM is respectively 89984, 105785, and 201143.

Reference [30] is a pioneer for implementing $F_p$ ECPM by RNS, in which the RNS parallelism of addition and multiplication were exploited. However, it deals with modular arithmetic out of RNS and therefore conversions into and out of the binary system become a bottleneck. In order to improve the performance, a pipelined conversion logic unit is designed, and the multiplication in their architecture appears in port conversions between RNS and the binary number system. Compared with their work, the proposed work obtains higher performance by more hardware resources.

In Ref. [13], an efficient RNS architecture with a deep pipeline was proposed for ECPM. It also depends on the RNS Montgomery modular multiplication algorithm, and the moduli are just generalized Mersenne numbers of $2^{2k}-δ$. Its critical path is reduced owing to deep pipeline stages. In Table 2, the Altera logic elements (LE) are equivalent to the look-up table (LUT) of the Xilinx Virtex II-pro. Meanwhile, one Altera DSP (digital signal processing)
block accounts for a 9×9-bit multiplier, while one DSP block in Xilinx FPGA includes an 18×18-bit multiplier. In general, this work is inferior to [13] in performance due to only 2 pipeline stages and the relative deficiency of elliptic curve arithmetic with ECPM.

Reference [6] is the initial work for implementing ECPM in FPGA, in which a high-radix Montgomery modular multiplier was used to perform ECPM, which is very area-efficient but relatively slow compared with other designs.

In [31], ECPM architectures for NIST (American National Institute of Standards and Technology) prime fields were implemented in FPGA. It makes best use of FPGA DSP resources to achieve a very high frequency, which is quite fast compared with both hardware and software implementation in the literature. However, the high frequency may be unavailable in many applications. Also it depends heavily on the special DSP units and is only applicable to NIST primes.

The work in [32] focused on the improvement of modular inverse in the elliptic curve field, which brings out good performance compared with related designs. In detail, in Xilinx XC2VP125-9 FPGA it requires 15755 slices and 256 18×18-bit multipliers for a 256-bit \( F_p \) ECPM, and computes one ECPM by 3.86 ms. By contrast, our work in Xilinx XC2VP125-9 FPGA it requires 15755 slices and 45 18×18-bit multipliers, and completes one ECPM in 1.44 ms. Apparently, the proposed design enjoys better performance than [32].

6. Conclusions

In this work, we have improved the modular arithmetic over generalized Mersenne numbers \( m=2^k-2^\pm 1 \), by which a unified modular multiplier working over this pair of numbers can be obtained. Modular reductions over such generalized Mersenne numbers can be performed by a little addition and a few multiplexors.

This modular multiplier is then applied for ECPM by the RNS Montgomery modular multiplication algorithm. Together with the partially optimized point addition and doubling for elliptic curve arithmetic, the proposed ECPM demonstrates good performance for hardware implementation in FPGA and application specific integrated circuits (ASIC).

Acknowledgment

The authors would like to thank the comments from anonymous reviewers. Discussion with Prof. Shuguo Li is also acknowledged.

References


Tao Wu was born in 1981 in Hubei Province. He received the B.S. degree in electronic science and technology from Wuhan University, Wuhan in 2003 and the M.S. degree from Tsinghua University, Beijing in 2006. From September 2006 to April 2007, he served as a temporary assistant with the Device and System Laboratory, the Institute of Micro-electronics, Tsinghua University. Then he worked with the Department of Physics and Electronic Engineering, Guangxi Normal University from July 2007 to July 2008. Since September 2008 he has been pursuing the Ph.D. degree with Tsinghua University. His current research refers to circuits and computer arithmetic about public-key cryptography.

Li-Tian Liu was born in 1947 in Jiangxi Province. He received the B.S. degree in electronic engineering from Tsinghua University, Beijing in 1970. He is currently a full professor with the Institute of Microelectronics, Tsinghua University. His research interests include the development of semiconductor devices and integrated circuits.