Always end BN_mod_exp_mont_consttime with normal Montgomery reduction.

This partially fixes a bug where, on x86_64, BN_mod_exp_mont_consttime
would sometimes return m, the modulus, when it should have returned
zero. Thanks to Guido Vranken for reporting it. It is only a partial fix
because the same bug also exists in the "rsaz" codepath. That will be
fixed in the subsequent CL. (See the commented out test.)

The bug only affects zero outputs (with non-zero inputs), so we believe
it has no security impact on our cryptographic functions. BoringSSL
calls BN_mod_exp_mont_consttime in the following cases:

- RSA private key operations
- Primality testing, raising the witness to the odd part of p-1
- DSA keygen and key import, pub = g^priv (mod p)
- DSA signing, r = g^k (mod p)
- DH keygen, pub = g^priv (mod p)
- Diffie-Hellman, secret = peer^priv (mod p)

It is not possible in the RSA private key operation, provided p and q
are primes. If using CRT, we are working modulo a prime, so zero output
with non-zero input is impossible. If not using CRT, we work mod n.
While there are nilpotent values mod n, none of them hit zero by
exponentiating. (Both p and q would need to divide the input, which
means n divides the input.)

In primality testing, this can only be hit when the input was composite.
But as the rest of the loop cannot then hit 1, we'll correctly report it
as composite anyway.

DSA and DH work modulo a prime, where this case cannot happen.

Analysis:

This bug is the result of sloppiness with the looser bounds from "almost
Montgomery multiplication", described in
https://eprint.iacr.org/2011/239. Prior to upstream's
ec9cc70f72454b8d4a84247c86159613cee83b81, I believe x86_64-mont5.pl
implemented standard Montgomery reduction (the left half of figure 3 in
the paper).

Though it did not document this, ec9cc70f7245 changed it to implement
the "almost" variant (the right half of the figure.) The difference is
that, rather than subtracting if T >= m, it subtracts if T >= R. In
code, it is the difference between something like our bn_reduce_once,
vs. subtracting based only on T's carry bit. (Interestingly, the
.Lmul_enter branch of bn_mul_mont_gather5 seems to still implement
normal reduction, but the .Lmul4x_enter branch is an almost reduction.)

That means none of the intermediate values here are bounded by m. They
are only bounded by R. Accordingly, Figure 2 in the paper ends with
step 10: REDUCE h modulo m. BN_mod_exp_mont_consttime is missing this
step. The bn_from_montgomery call only implements step 9, AMM(h, 1).
(x86_64-mont5.pl's bn_from_montgomery only implements an almost
reduction.)

The impact depends on how unreduced AMM(h, 1) can be. Remark 1 of the
paper discusses this, but is ambiguous about the scope of its 2^(n-1) <
m < 2^n precondition. The m+1 bound appears to be unconditional:

Montgomery reduction ultimately adds some 0 <= Y < m*R to T, to get a
multiple of R, and then divides by R. The output, pre-subtraction, is
thus less than m + T/R. MM works because T < mR => T' < m + mR/R = 2m.
A single subtraction of m if T' >= m gives T'' < m. AMM works because
T < R^2 => T' < m + R^2/R = m + R. A single subtraction of m if T' >= R
gives T'' < R. See also Lemma 1, Section 3 and Section 4 of the paper,
though their formulation is more complicated to capture the word-by-word
algorithm. It's ultimately the same adjustment to T.

But in AMM(h, 1), T = h*1 = h < R, so AMM(h, 1) < m + R/R = m + 1. That
is, AMM(h, 1) <= m. So the only case when AMM(h, 1) isn't fully reduced
is if it outputs m. Thus, our limited impact. Indeed, Remark 1 mentions
step 10 isn't necessary because m is a prime and the inputs are
non-zero. But that doesn't apply here because BN_mod_exp_mont_consttime
may be called elsewhere.

Fix:

To fix this, we could add the missing step 10, but a full division would
not be constant-time. The analysis above says it could be a single
subtraction, bn_reduce_once, but then we could integrate it into
the subtraction already in plain Montgomery reduction, implemented by
uppercase BN_from_montgomery. h*1 = h < R <= m*R, so we are within
bounds.

Thus, we delete lowercase bn_from_montgomery altogether, and have the
mont5 path use the same BN_from_montgomery ending as the non-mont5 path.
This only impacts the final step of the whole exponentiation and has no
measurable perf impact.

In doing so, add comments describing these looser bounds.  This includes
one subtlety that BN_mod_exp_mont_consttime actually mixes bn_mul_mont
(MM) with bn_mul_mont_gather5/bn_power5 (AMM). But this is fine because
MM is AMM-compatible; when passed AMM's looser inputs, it will still
produce a correct looser output.

Ideally we'd drop the "almost" reduction and stick to the more
straightforward bounds. As this only impacts the final subtraction in
each reduction, I would be surprised if it actually had a real
performance impact. But this would involve deeper change to
x86_64-mont5.pl, so I haven't tried this yet.

I believe this is basically the same bug as
https://github.com/golang/go/issues/13907 from Go.

Change-Id: I06f879777bb2ef181e9da7632ec858582e2afa38
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52825
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
chromium-5359
David Benjamin 3 years ago committed by Boringssl LUCI CQ
parent a8b1633d1c
commit 13c9d5c69d
  1. 197
      crypto/fipsmodule/bn/asm/x86_64-mont5.pl
  2. 9
      crypto/fipsmodule/bn/bn_test.cc
  3. 70
      crypto/fipsmodule/bn/bn_tests.txt
  4. 74
      crypto/fipsmodule/bn/exponentiation.c
  5. 19
      crypto/fipsmodule/bn/internal.h

@ -2088,194 +2088,6 @@ __bn_post4x_internal:
.size __bn_post4x_internal,.-__bn_post4x_internal .size __bn_post4x_internal,.-__bn_post4x_internal
___ ___
} }
{
$code.=<<___;
.globl bn_from_montgomery
.type bn_from_montgomery,\@abi-omnipotent
.align 32
bn_from_montgomery:
.cfi_startproc
testl \$7,`($win64?"48(%rsp)":"%r9d")`
jz bn_from_mont8x
xor %eax,%eax
ret
.cfi_endproc
.size bn_from_montgomery,.-bn_from_montgomery
.type bn_from_mont8x,\@function,6
.align 32
bn_from_mont8x:
.cfi_startproc
.byte 0x67
mov %rsp,%rax
.cfi_def_cfa_register %rax
push %rbx
.cfi_push %rbx
push %rbp
.cfi_push %rbp
push %r12
.cfi_push %r12
push %r13
.cfi_push %r13
push %r14
.cfi_push %r14
push %r15
.cfi_push %r15
.Lfrom_prologue:
shl \$3,${num}d # convert $num to bytes
lea ($num,$num,2),%r10 # 3*$num in bytes
neg $num
mov ($n0),$n0 # *n0
##############################################################
# Ensure that stack frame doesn't alias with $rptr+3*$num
# modulo 4096, which covers ret[num], am[num] and n[num]
# (see bn_exp.c). The stack is allocated to aligned with
# bn_power5's frame, and as bn_from_montgomery happens to be
# last operation, we use the opportunity to cleanse it.
#
lea -320(%rsp,$num,2),%r11
mov %rsp,%rbp
sub $rptr,%r11
and \$4095,%r11
cmp %r11,%r10
jb .Lfrom_sp_alt
sub %r11,%rbp # align with $aptr
lea -320(%rbp,$num,2),%rbp # future alloca(frame+2*$num*8+256)
jmp .Lfrom_sp_done
.align 32
.Lfrom_sp_alt:
lea 4096-320(,$num,2),%r10
lea -320(%rbp,$num,2),%rbp # future alloca(frame+2*$num*8+256)
sub %r10,%r11
mov \$0,%r10
cmovc %r10,%r11
sub %r11,%rbp
.Lfrom_sp_done:
and \$-64,%rbp
mov %rsp,%r11
sub %rbp,%r11
and \$-4096,%r11
lea (%rbp,%r11),%rsp
mov (%rsp),%r10
cmp %rbp,%rsp
ja .Lfrom_page_walk
jmp .Lfrom_page_walk_done
.Lfrom_page_walk:
lea -4096(%rsp),%rsp
mov (%rsp),%r10
cmp %rbp,%rsp
ja .Lfrom_page_walk
.Lfrom_page_walk_done:
mov $num,%r10
neg $num
##############################################################
# Stack layout
#
# +0 saved $num, used in reduction section
# +8 &t[2*$num], used in reduction section
# +32 saved *n0
# +40 saved %rsp
# +48 t[2*$num]
#
mov $n0, 32(%rsp)
mov %rax, 40(%rsp) # save original %rsp
.cfi_cfa_expression %rsp+40,deref,+8
.Lfrom_body:
mov $num,%r11
lea 48(%rsp),%rax
pxor %xmm0,%xmm0
jmp .Lmul_by_1
.align 32
.Lmul_by_1:
movdqu ($aptr),%xmm1
movdqu 16($aptr),%xmm2
movdqu 32($aptr),%xmm3
movdqa %xmm0,(%rax,$num)
movdqu 48($aptr),%xmm4
movdqa %xmm0,16(%rax,$num)
.byte 0x48,0x8d,0xb6,0x40,0x00,0x00,0x00 # lea 64($aptr),$aptr
movdqa %xmm1,(%rax)
movdqa %xmm0,32(%rax,$num)
movdqa %xmm2,16(%rax)
movdqa %xmm0,48(%rax,$num)
movdqa %xmm3,32(%rax)
movdqa %xmm4,48(%rax)
lea 64(%rax),%rax
sub \$64,%r11
jnz .Lmul_by_1
movq $rptr,%xmm1
movq $nptr,%xmm2
.byte 0x67
mov $nptr,%rbp
movq %r10, %xmm3 # -num
___
$code.=<<___ if ($addx);
leaq OPENSSL_ia32cap_P(%rip),%r11
mov 8(%r11),%r11d
and \$0x80108,%r11d
cmp \$0x80108,%r11d # check for AD*X+BMI2+BMI1
jne .Lfrom_mont_nox
lea (%rax,$num),$rptr
call __bn_sqrx8x_reduction
call __bn_postx4x_internal
pxor %xmm0,%xmm0
lea 48(%rsp),%rax
jmp .Lfrom_mont_zero
.align 32
.Lfrom_mont_nox:
___
$code.=<<___;
call __bn_sqr8x_reduction
call __bn_post4x_internal
pxor %xmm0,%xmm0
lea 48(%rsp),%rax
jmp .Lfrom_mont_zero
.align 32
.Lfrom_mont_zero:
mov 40(%rsp),%rsi # restore %rsp
.cfi_def_cfa %rsi,8
movdqa %xmm0,16*0(%rax)
movdqa %xmm0,16*1(%rax)
movdqa %xmm0,16*2(%rax)
movdqa %xmm0,16*3(%rax)
lea 16*4(%rax),%rax
sub \$32,$num
jnz .Lfrom_mont_zero
mov \$1,%rax
mov -48(%rsi),%r15
.cfi_restore %r15
mov -40(%rsi),%r14
.cfi_restore %r14
mov -32(%rsi),%r13
.cfi_restore %r13
mov -24(%rsi),%r12
.cfi_restore %r12
mov -16(%rsi),%rbp
.cfi_restore %rbp
mov -8(%rsi),%rbx
.cfi_restore %rbx
lea (%rsi),%rsp
.cfi_def_cfa_register %rsp
.Lfrom_epilogue:
ret
.cfi_endproc
.size bn_from_mont8x,.-bn_from_mont8x
___
}
}}} }}}
if ($addx) {{{ if ($addx) {{{
@ -3864,10 +3676,6 @@ mul_handler:
.rva .LSEH_begin_bn_power5 .rva .LSEH_begin_bn_power5
.rva .LSEH_end_bn_power5 .rva .LSEH_end_bn_power5
.rva .LSEH_info_bn_power5 .rva .LSEH_info_bn_power5
.rva .LSEH_begin_bn_from_mont8x
.rva .LSEH_end_bn_from_mont8x
.rva .LSEH_info_bn_from_mont8x
___ ___
$code.=<<___ if ($addx); $code.=<<___ if ($addx);
.rva .LSEH_begin_bn_mulx4x_mont_gather5 .rva .LSEH_begin_bn_mulx4x_mont_gather5
@ -3899,11 +3707,6 @@ $code.=<<___;
.byte 9,0,0,0 .byte 9,0,0,0
.rva mul_handler .rva mul_handler
.rva .Lpower5_prologue,.Lpower5_body,.Lpower5_epilogue # HandlerData[] .rva .Lpower5_prologue,.Lpower5_body,.Lpower5_epilogue # HandlerData[]
.align 8
.LSEH_info_bn_from_mont8x:
.byte 9,0,0,0
.rva mul_handler
.rva .Lfrom_prologue,.Lfrom_body,.Lfrom_epilogue # HandlerData[]
___ ___
$code.=<<___ if ($addx); $code.=<<___ if ($addx);
.align 8 .align 8

@ -2788,15 +2788,6 @@ TEST_F(BNTest, BNMulMont5ABI) {
words, 13); words, 13);
CHECK_ABI(bn_power5, r.data(), a.data(), table.data(), m->d, mont->n0, CHECK_ABI(bn_power5, r.data(), a.data(), table.data(), m->d, mont->n0,
words, 13); words, 13);
EXPECT_EQ(1, CHECK_ABI(bn_from_montgomery, r.data(), r.data(), nullptr,
m->d, mont->n0, words));
EXPECT_EQ(1, CHECK_ABI(bn_from_montgomery, r.data(), a.data(), nullptr,
m->d, mont->n0, words));
} else {
EXPECT_EQ(0, CHECK_ABI(bn_from_montgomery, r.data(), r.data(), nullptr,
m->d, mont->n0, words));
EXPECT_EQ(0, CHECK_ABI(bn_from_montgomery, r.data(), a.data(), nullptr,
m->d, mont->n0, words));
} }
} }
} }

@ -10604,6 +10604,76 @@ A = 95564994a96c45954227b845a1e99cb939d5a1da99ee91acc962396ae999a9ee38603790448f
E = f95dc0f980fbd22e90caa5a387cc4a369f3f830d50dd321c40db8c09a7e1a241a536e096622d3280c0c1ba849c1f4a79bf490f60006d081e8cf69960189f0d312cd9e17073a3fba7881b21474a13b334116cb2f5dbf3189a6de3515d0840f053c776d3982d391b6d04d642dda5cc6d1640174c09875addb70595658f89efb439dc6fbd55f903aadd307982d3f659207f265e1ec6271b274521b7a5e28e8fd7a55df089292820477802a43cf5b6b94e999e8c9944ddebb0d0e95a60f88cb7e813ba110d20e1024774107dd02949031864923b3cb8c3f7250d6d1287b0a40db6a47bd5a469518eb65aa207ddc47d8c6e5fc8e0c105be8fc1d4b57b2e27540471d5 E = f95dc0f980fbd22e90caa5a387cc4a369f3f830d50dd321c40db8c09a7e1a241a536e096622d3280c0c1ba849c1f4a79bf490f60006d081e8cf69960189f0d312cd9e17073a3fba7881b21474a13b334116cb2f5dbf3189a6de3515d0840f053c776d3982d391b6d04d642dda5cc6d1640174c09875addb70595658f89efb439dc6fbd55f903aadd307982d3f659207f265e1ec6271b274521b7a5e28e8fd7a55df089292820477802a43cf5b6b94e999e8c9944ddebb0d0e95a60f88cb7e813ba110d20e1024774107dd02949031864923b3cb8c3f7250d6d1287b0a40db6a47bd5a469518eb65aa207ddc47d8c6e5fc8e0c105be8fc1d4b57b2e27540471d5
M = fef15d5ce4625f1bccfbba49fc8439c72bf8202af039a2259678941b60bb4a8f2987e965d58fd8cf86a856674d519763d0e1211cc9f8596971050d56d9b35db3785866cfbca17cfdbed6060be3629d894f924a89fdc1efc624f80d41a22f19009503fcc3824ef62ccb9208430c26f2d8ceb2c63488ec4c07437aa4c96c43dd8b9289ed00a712ff66ee195dc71f5e4ead02172b63c543d69baf495f5fd63ba7bcc633bd309c016e37736da92129d0b053d4ab28d21ad7d8b6fab2a8bbdc8ee647d2fbcf2cf426cf892e6f5639e0252993965dfb73ccd277407014ea784aaa280cb7b03972bc8b0baa72360bdb44b82415b86b2f260f877791cd33ba8f2d65229b M = fef15d5ce4625f1bccfbba49fc8439c72bf8202af039a2259678941b60bb4a8f2987e965d58fd8cf86a856674d519763d0e1211cc9f8596971050d56d9b35db3785866cfbca17cfdbed6060be3629d894f924a89fdc1efc624f80d41a22f19009503fcc3824ef62ccb9208430c26f2d8ceb2c63488ec4c07437aa4c96c43dd8b9289ed00a712ff66ee195dc71f5e4ead02172b63c543d69baf495f5fd63ba7bcc633bd309c016e37736da92129d0b053d4ab28d21ad7d8b6fab2a8bbdc8ee647d2fbcf2cf426cf892e6f5639e0252993965dfb73ccd277407014ea784aaa280cb7b03972bc8b0baa72360bdb44b82415b86b2f260f877791cd33ba8f2d65229b
# The following inputs trigger an edge case between Montgomery reduction and the
# "almost" reduction variant from https://eprint.iacr.org/2011/239
ModExp = 00
A = 19c7bc9b97c6083cd7b8d1cd001452c9b67983247169c6532047eb7fc8933014dbf69fee7a358769f1429802c8ea89d4f9ca6ba6f368fbdb1fa5717b4a00
E = bbc7e09147408571050e8d0c634682c5863b7e8a573626648902cff12e590c74f5a23ecce39732266bc15b8afbd6c48a48c83fbdc33947515cc0b6e4fb98ae2cd730e58f951fec8be7e2e3c74f4506c7fd7e29bdb28675fe8a59789ab1148e931a2ebd2d36f78bc241682a3d8083d8ff538858cd240c5a693936e5a391dc9d77118062a3f868c058440a4192267faaaba91112f45eee5842060febbf9353a6d3e7f7996573209136a5506062ea23d74067f08c613f3ff74bade25f8c3368e6dba84eae672eac11be1137fc514924fcab8c82e46d092bd047dcbadaa48c67a096ec1a04f392a8511e6acbad9954949b703e71ff837337b594055ae6f3c0fc154447a687c9ac8a2cdfd64a2e680c6ff21254735af7f5eb6b43f0bce86bda55a04143a991711081435ed4f4a89b23fc3a588022b7a8543db4bf5c8ac93603367c750ff2191f59a716340fab49bb7544759c8d846465eec1438e76395f73e7b5e945f31f1b87fefa854a0d208846eaab5fa27144fd039911608bab0eaee80f1d3553dfa2d9ba95268479b97a059613660df5ad79796e0b272244aca90ccc13449ec15c206eeed7b60405a4c5cfdf5da5d136c27fa9385d810ad198dfe794ffce9955e10520efea1e2eb794e379401b9affd863b9566ce941c4726755574a1b1946acf0090bfb93f37dd55f524485bbba7fa84b53addfde01ae1de9c57fe50d4b708dd0fa45d02af398b3d05c6d17f84c11e9aacdbe0b146cad6ddbd877731e26a17f3ebed459560d12ed7a6abc2ea6fe922e69d2622ef11b6b245b9ba8f0940faaa671a4beb727be5393a94dafaeff7221b29183e7418f4c5bb95a6a586c93dbc8ce0236d9dbe26c40513611b4141fed66599adbfb20fc30e09a4815e4159f65a6708f34584a7a77b3843941cd61a6917dcc3d07a3dfb5a2cb108bacea7e782f2111b4d22ecaaeff469ecd0da371df1ac5e9bf6df6ccba2d3a9f393d597499eaca2c206bfb81c3426c5fe45bcf16e38aecd246a319a1f37041c638b75a4839517e43a6d01bee7d85eaeedbce13cd15699d3ee42c7414cfed576590e4fb6ddb6edd3e1957efaf039bfe8b9dc75869b1f93abff15cae8b234161070fa3542303c2ed35ca66083d0ac299b81182317a2a3985269602b1fa1e822fcbda48e686d80b273f06b0a702ca7f42cbbbd2fc2b3601422c8bff6302eda3c61b293049636002649b16f3c1f0be2b6599d66493a4497cd795b10a2ab8220fafad24fa90e1bfcf39ecce337e705695c7a224bf9f445a287d6aab221341659ca4be7861f6ac4c9d33dac811e6
M = 519b6e57781d40d897ec0c1b648d195526726b295438c9a70928ac25979563d72db91c8c42298a33b572edecdf40904c68a23337aa5341b56e92b0da5041
# To fully exercise BN_mod_exp_mont_consttime codepaths, we generate inputs at
# different bitwidths. rsaz-avx2.pl only runs at 1024-bit moduli, and
# x86_64-mont5.pl unrolls 8 64-bit words at a time, so we want to capture both
# multiples of 512- and non-multiples. Also include moduli that are not quite a
# full word.
# 512-bit
ModExp = 00
A = 8000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000e
E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
M = 8f42c9e9e351ba9b32ab0cf69da43f4acf7028d19cff6e5059ea0e3fcc97c97f36a31470044737d4c0c933ac441ecb29e32c81401523afdac7de9c3fd8493c97
# 1024-bit
# TODO(davidben): This test breaks the RSAZ implementation. Fix it and enable
# this test.
# ModExp = 00
# A = 800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002f
# E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
# M = 9da8dc26fdf4d2e49833b240ee552beb7a6e251caa91bfb5d6cafaf8ed9461877fda8f6ac299036d35806bc1ae7872e54eaac1ec6bee6d02c6621a9cf8883b3abc33c49b3e601203e0e86ef8f0562412cc689ee2670704583909ca6d7774c9f9f9f4d77d37fedef9cb51d207cb629ec02fa03b526fd6594bfa8f2da71238a0b7
# 1025-bit
ModExp = 00
A = 010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
M = 010223abfdda02e84e11cec8ee7fc784fa135733935f7b9054bb70f1f06d234d76dcf3beed55c7f39e955dc1fef2b65009240fd02f7a1b27a78fc2867144bf666efb929856db9f671c356c4c67a068a70fe83c52eebda03668872fd270d0794f0771d217fb6b93b12529a944f7f0496a9158757c55b8ee14f803f1d2d887e2f561
# 1088-bit
ModExp = 00
A = 8000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003d
E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
M = e91f6d748773cb212a23aa348125615123b1800c9ea222c9374c757702ae4140fa333790ed8f6bf60a1d7dda65c2767cc5f33e32e333d19fbfb5a2b85795757c9ca070268763a618e9d33873d28a89bf88acd209efbb15b80cd33b92a6b3a682e1c91782fc24fb86ddff4f809219c977b54b99359094bbcc51dfe17b992ab24b74a17950ad754281
# 1472-bit
ModExp = 00
A = 8000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001d
E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
M = a8770362f4bfe4fc1ab0e52705c11a9b6ba235d5a5f22197c2d68e27ed18426ede3316af706aa79bcf943dbd51459eb15ae1f9386216b3f3a847f94440a65b97659bc5ba2adb67173714ecaa886c0b926d7a64ea45576f9d2171784ce7e801724d5b0abfd93357d538ea7ad3ad89a74f4660bdb66dfb5f684dcf00402e3cdf0ab58afd867c943c8f47b80268a789456aa7c50a619dd2f9f5e3f74b5d810f0f8dadbf4ad5b917cdcb156c4c132611c8b3b035118a9e03551f
# 1536-bit
ModExp = 00
A = 800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002
E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
M = 878cd000778f927b2f1a4b8bac86efd282079a7ac0d25e09ffd2f72fbc282e65e233929d2457c7b1d63c56fb706cdfa04fb87e654c578c98d7cf59c2293dc5641086b68db4867105981daaf147a0ee91f6932ef064deae4142c19e58d50c0686f0eaf778be72450f89a98b4680bbc5ffab942195e44dd20616150fd1deca058068ca31ab2f861e99082588f17a2025bf5e536150142fca3187a259c791fc721430f24d7e338f8dc02e693a7e694d42775e80f7f7c03600b6ae86b4aba2b0e991
# 2048-bit
ModExp = 00
A = 8000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000f
E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
M = 9f40a7535c561208ecb38e17c9336d9bc8484d335901b2cd42759cf03689227f6992f10cb6b586d767fbcdf30e9d82a0eda60d2694ccd0194fa96b50b56e0cdeec1951ea9e58b07e334a7f108841a0ab28256917fecea561388807ed124a17386a7a7b501f9cbf3404247a76948d0561e48137d3f9669e36f175731796aeaf78851f7d866917f661422186a4814aa35c066b5a90b9cfc918af769a9f0bb30c12581027df64ac328a0f07dbd20adb704479f6d0f233a131828c71bab19c3c34795ea4fb68aa632c6f688e5b3b84413c9031d8dc251003a590dec0dd09bfa6109ed4570701439b6f265b84ac2170c317357b5fbe5535e2bbdd93c1aacfdaa28c85
# 3072-bit
ModExp = 00
A = 80000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001d
E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
M = c23dfd244a58a668d514498a705c8f8f548311b24f0f98b023d2d33632534c2ae948d6641d41fd7a29fbbd594bfc7fdd6e8162cbb3056af3075347b6fc8876458d33a9d0ffdbcdf482de0c73d1310fd8fa8f9f92dd0dbb0e2034e98a30f6c11b482f7476c5b593f673a322b1130daa4314e9074270dce1076436f0d56cf196afcbb235a9a7b3ac85b9062e85fc0e63a12c468c787019f6805f9faab64fc6a0babc80785d88740243f11366bffb40ccbe8b2bb7a99a2c8238a6f656bb0117d7b2602aa400f4d77de5f93c673f13264ca70de949454e3e3f261993c1aa427e8ef4f507af744f71f3b4aaf3c981d44cc1bfb1eb1151168762b242b740573df698e500d99612e17dc760f7b3bf7c235e39e81ad7edbe6c07dbb8b139745bb394d61cb799bcafec5de074932b0b2d74797e779ac8d81f63a2b2e9baa229dfaa7f90f34ffade1d2ad022a3407d35eb2d7477c6ae8ad100f6e95c05b4f947c1fabfb11a17add384e6b4cd3a02fd9b43f46805c6c74e366b74aa3b766be7a5fbbd67fa81
# 4096-bit
ModExp = 00
A = 8000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
E = ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
M = 8030411ecbddcb0fe4e76fd6b5bf542e8b015d1610cf96130ded12ba2cda0641bd9692080f218ea8b0d751845b519d95b843542ec8d2a07f1f93afe3189b69a4f35c983011c7f7928c3df458cc3eae85c36e6934a4b1bc0a67c8a521de336642c49e10a7ffa8d0af911aacc19e3900449161940f139220e099a150dcaf0ff96ffff6e726c1ac139969103cf6a828ac3adf0301506aa02787b4f570d5dde53a34acab8fec6fa94760abf16ee99954371ad65a6e899daab87b95811d069404991de9abe064ebbddf886e970f10d260c899dda940191a82d4c8bd36651363aff5493f4f59e700007dcadf37ebea7fcfd7600d16617ffea0d9ae659446d851d93c564e50e558f734c894d735fa273770703dab62844d9f01badf632f3d14a00f739c022c9be95f54e9cea46ec6da7cb11f4602e06962951c48204726b7f120ddbd0eb3566dc8d1e6f195a9196e96db33322d088b43aecffe9b4df182dd016aca0bd14f1c56cd1a18b89165c027029862b09ffd78e92ab614349c4fd67f49cb12cd33d0728930d0538bda57acef1365a73cc8fbac7d463b9e3c3bae0bb6224b080cdb8b5cd47d546d53111fdc22b7ff679bcfe27192920ee163b2be337d8cccc93b4de7d2d31934b9c0e97af291dcc1135b4a473bd37114eec3ba75c411887b57799d3188e7353f33a4d31735ebfc9fcfc044985148dd96da3876a5ab7ea7a404b411
# RSAZ 512-bit. # RSAZ 512-bit.
# #

@ -979,13 +979,13 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
#if defined(OPENSSL_BN_ASM_MONT5) #if defined(OPENSSL_BN_ASM_MONT5)
if (window >= 5) { if (window >= 5) {
window = 5; // ~5% improvement for RSA2048 sign, and even for RSA4096 window = 5; // ~5% improvement for RSA2048 sign, and even for RSA4096
// reserve space for mont->N.d[] copy // Reserve space for the |mont->N| copy.
powerbufLen += top * sizeof(mont->N.d[0]); powerbufLen += top * sizeof(mont->N.d[0]);
} }
#endif #endif
// Allocate a buffer large enough to hold all of the pre-computed // Allocate a buffer large enough to hold all of the pre-computed
// powers of am, am itself and tmp. // powers of |am|, |am| itself, and |tmp|.
numPowers = 1 << window; numPowers = 1 << window;
powerbufLen += powerbufLen +=
sizeof(m->d[0]) * sizeof(m->d[0]) *
@ -1008,7 +1008,7 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
} }
OPENSSL_memset(powerbuf, 0, powerbufLen); OPENSSL_memset(powerbuf, 0, powerbufLen);
// lay down tmp and am right after powers table // Place |tmp| and |am| right after powers table.
tmp.d = powerbuf + top * numPowers; tmp.d = powerbuf + top * numPowers;
am.d = tmp.d + top; am.d = tmp.d + top;
tmp.width = am.width = 0; tmp.width = am.width = 0;
@ -1028,18 +1028,26 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
} }
#if defined(OPENSSL_BN_ASM_MONT5) #if defined(OPENSSL_BN_ASM_MONT5)
// This optimization uses ideas from http://eprint.iacr.org/2011/239, // This optimization uses ideas from https://eprint.iacr.org/2011/239,
// specifically optimization of cache-timing attack countermeasures // specifically optimization of cache-timing attack countermeasures,
// and pre-computation optimization. // pre-computation optimization, and Almost Montgomery Multiplication.
//
// Dedicated window==4 case improves 512-bit RSA sign by ~15%, but as // The paper discusses a 4-bit window to optimize 512-bit modular
// 512-bit RSA is hardly relevant, we omit it to spare size... // exponentiation, used in RSA-1024 with CRT, but RSA-1024 is no longer
// important.
//
// |bn_mul_mont_gather5| and |bn_power5| implement the "almost" reduction
// variant, so the values here may not be fully reduced. They are bounded by R
// (i.e. they fit in |top| words), not |m|. Additionally, we pass these
// "almost" reduced inputs into |bn_mul_mont|, which implements the normal
// reduction variant. Given those inputs, |bn_mul_mont| may not give reduced
// output, but it will still produce "almost" reduced output.
//
// TODO(davidben): Using "almost" reduction complicates analysis of this code,
// and its interaction with other parts of the project. Determine whether this
// is actually necessary for performance.
if (window == 5 && top > 1) { if (window == 5 && top > 1) {
const BN_ULONG *n0 = mont->n0; // Ensure |am| and |tmp| are padded to the right width.
BN_ULONG *np;
// BN_to_montgomery can contaminate words above .top
// [in BN_DEBUG[_DEBUG] build]...
for (i = am.width; i < top; i++) { for (i = am.width; i < top; i++) {
am.d[i] = 0; am.d[i] = 0;
} }
@ -1047,30 +1055,34 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
tmp.d[i] = 0; tmp.d[i] = 0;
} }
// copy mont->N.d[] to improve cache locality // Copy |mont->N| to improve cache locality.
for (np = am.d + top, i = 0; i < top; i++) { BN_ULONG *np = am.d + top;
for (i = 0; i < top; i++) {
np[i] = mont->N.d[i]; np[i] = mont->N.d[i];
} }
// Fill |powerbuf| with the first 32 powers of |am|.
const BN_ULONG *n0 = mont->n0;
bn_scatter5(tmp.d, top, powerbuf, 0); bn_scatter5(tmp.d, top, powerbuf, 0);
bn_scatter5(am.d, am.width, powerbuf, 1); bn_scatter5(am.d, am.width, powerbuf, 1);
bn_mul_mont(tmp.d, am.d, am.d, np, n0, top); bn_mul_mont(tmp.d, am.d, am.d, np, n0, top);
bn_scatter5(tmp.d, top, powerbuf, 2); bn_scatter5(tmp.d, top, powerbuf, 2);
// same as above, but uses squaring for 1/2 of operations // Square to compute powers of two.
for (i = 4; i < 32; i *= 2) { for (i = 4; i < 32; i *= 2) {
bn_mul_mont(tmp.d, tmp.d, tmp.d, np, n0, top); bn_mul_mont(tmp.d, tmp.d, tmp.d, np, n0, top);
bn_scatter5(tmp.d, top, powerbuf, i); bn_scatter5(tmp.d, top, powerbuf, i);
} }
// Compute odd powers |i| based on |i - 1|, then all powers |i * 2^j|.
for (i = 3; i < 8; i += 2) { for (i = 3; i < 8; i += 2) {
int j;
bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np, n0, top, i - 1); bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np, n0, top, i - 1);
bn_scatter5(tmp.d, top, powerbuf, i); bn_scatter5(tmp.d, top, powerbuf, i);
for (j = 2 * i; j < 32; j *= 2) { for (int j = 2 * i; j < 32; j *= 2) {
bn_mul_mont(tmp.d, tmp.d, tmp.d, np, n0, top); bn_mul_mont(tmp.d, tmp.d, tmp.d, np, n0, top);
bn_scatter5(tmp.d, top, powerbuf, j); bn_scatter5(tmp.d, top, powerbuf, j);
} }
} }
// These two loops are the above with the inner loop unrolled.
for (; i < 16; i += 2) { for (; i < 16; i += 2) {
bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np, n0, top, i - 1); bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np, n0, top, i - 1);
bn_scatter5(tmp.d, top, powerbuf, i); bn_scatter5(tmp.d, top, powerbuf, i);
@ -1137,15 +1149,15 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
bn_power5(tmp.d, tmp.d, powerbuf, np, n0, top, val); bn_power5(tmp.d, tmp.d, powerbuf, np, n0, top, val);
} }
} }
// The result is now in |tmp| in Montgomery form, but it may not be fully
ret = bn_from_montgomery(tmp.d, tmp.d, NULL, np, n0, top); // reduced. This is within bounds for |BN_from_montgomery| (tmp < R <= m*R)
tmp.width = top; // so it will, when converting from Montgomery form, produce a fully reduced
if (ret) { // result.
if (!BN_copy(rr, &tmp)) { //
ret = 0; // This differs from Figure 2 of the paper, which uses AMM(h, 1) to convert
} // from Montgomery form with unreduced output, followed by an extra
goto err; // non-zero ret means it's not error // reduction step. In the paper's terminology, we replace steps 9 and 10
} // with MM(h, 1).
} else } else
#endif #endif
{ {
@ -1206,7 +1218,11 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
} }
} }
// Convert the final result from montgomery to standard format // Convert the final result from Montgomery to standard format. If we used the
// |OPENSSL_BN_ASM_MONT5| codepath, |tmp| may not be fully reduced. It is only
// bounded by R rather than |m|. However, that is still within bounds for
// |BN_from_montgomery|, which implements full Montgomery reduction, not
// "almost" Montgomery reduction.
if (!BN_from_montgomery(rr, &tmp, mont, ctx)) { if (!BN_from_montgomery(rr, &tmp, mont, ctx)) {
goto err; goto err;
} }

@ -353,6 +353,9 @@ int bn_rand_secret_range(BIGNUM *r, int *out_is_uniform, BN_ULONG min_inclusive,
// corresponding field in |BN_MONT_CTX|. It returns one if |bn_mul_mont| handles // corresponding field in |BN_MONT_CTX|. It returns one if |bn_mul_mont| handles
// inputs of this size and zero otherwise. // inputs of this size and zero otherwise.
// //
// If at least one of |ap| or |bp| is fully reduced, |rp| will be fully reduced.
// If neither is fully-reduced, the output may not be either.
//
// TODO(davidben): The x86_64 implementation expects a 32-bit input and masks // TODO(davidben): The x86_64 implementation expects a 32-bit input and masks
// off upper bits. The aarch64 implementation expects a 64-bit input and does // off upper bits. The aarch64 implementation expects a 64-bit input and does
// not. |size_t| is the safer option but not strictly correct for x86_64. But // not. |size_t| is the safer option but not strictly correct for x86_64. But
@ -372,6 +375,10 @@ int bn_mul_mont(BN_ULONG *rp, const BN_ULONG *ap, const BN_ULONG *bp,
// by |ap| modulo |np|, and stores the result in |rp|. The values are |num| // by |ap| modulo |np|, and stores the result in |rp|. The values are |num|
// words long and represented in Montgomery form. |n0| is a pointer to the // words long and represented in Montgomery form. |n0| is a pointer to the
// corresponding field in |BN_MONT_CTX|. // corresponding field in |BN_MONT_CTX|.
//
// WARNING: This function implements Almost Montgomery Multiplication from
// https://eprint.iacr.org/2011/239. The inputs do not need to be fully reduced.
// However, even if they are fully reduced, the output may not be.
void bn_mul_mont_gather5(BN_ULONG *rp, const BN_ULONG *ap, void bn_mul_mont_gather5(BN_ULONG *rp, const BN_ULONG *ap,
const BN_ULONG *table, const BN_ULONG *np, const BN_ULONG *table, const BN_ULONG *np,
const BN_ULONG *n0, int num, int power); const BN_ULONG *n0, int num, int power);
@ -391,16 +398,12 @@ void bn_gather5(BN_ULONG *out, size_t num, BN_ULONG *table, size_t power);
// values are |num| words long and represented in Montgomery form. |n0| is a // values are |num| words long and represented in Montgomery form. |n0| is a
// pointer to the corresponding field in |BN_MONT_CTX|. |num| must be divisible // pointer to the corresponding field in |BN_MONT_CTX|. |num| must be divisible
// by 8. // by 8.
//
// WARNING: This function implements Almost Montgomery Multiplication from
// https://eprint.iacr.org/2011/239. The inputs do not need to be fully reduced.
// However, even if they are fully reduced, the output may not be.
void bn_power5(BN_ULONG *rp, const BN_ULONG *ap, const BN_ULONG *table, void bn_power5(BN_ULONG *rp, const BN_ULONG *ap, const BN_ULONG *table,
const BN_ULONG *np, const BN_ULONG *n0, int num, int power); const BN_ULONG *np, const BN_ULONG *n0, int num, int power);
// bn_from_montgomery converts |ap| from Montgomery form modulo |np| and writes
// the result in |rp|, each of which is |num| words long. It returns one on
// success and zero if it cannot handle inputs of length |num|. |n0| is a
// pointer to the corresponding field in |BN_MONT_CTX|.
int bn_from_montgomery(BN_ULONG *rp, const BN_ULONG *ap,
const BN_ULONG *not_used, const BN_ULONG *np,
const BN_ULONG *n0, int num);
#endif // !OPENSSL_NO_ASM && OPENSSL_X86_64 #endif // !OPENSSL_NO_ASM && OPENSSL_X86_64
uint64_t bn_mont_n0(const BIGNUM *n); uint64_t bn_mont_n0(const BIGNUM *n);

Loading…
Cancel
Save