Fix allocation size in BN_mod_exp_mont_consttime.

powerbuf's layout is:
- num_powers values mod m, stored transposed
- one value mod m, tmp
- one value mod m, am
- (mont5-only) an extra copy of m

powerbuf_len broadly computed this, but where tmp + am would be
sizeof(BN_ULONG) * top * 2, it used
sizeof(BN_ULONG) * max(top * 2, num_powers).

(In the actual code it's written as a ternary op and some
multiplications are factored out.)

That is, it allocated enough room for tmp + am OR an extra row in the
num_powers table, as if each entry were top + 1 words long instead of
top, with the space overlapping. This expression dates to upstream's
361512da0d,
though the exact layout has shifted over the years as mont5 evolved.
(Originally, it only contained one extra value mod m.)

At the time, this was necessary because bn_mul_mont_gather5 actually
overreads the table by one row! Although it only uses top * 32 words, it
requires the table to have (top + 1) * 32 words. This is because the
computation was scheduled so that the .Louter4x loop would read and mask
off the next iteration's value while incorporating the previous
iteration:

There were masked reads from $bp into XMM registers at the start of the
loop:
361512da0d/crypto/bn/asm/x86_64-mont5.pl (L541)

The XMM logic is interleaved throughout and does not move to a
general-purpose register, $m0, until much later. $m is not read again
until after the jump.
361512da0d/crypto/bn/asm/x86_64-mont5.pl (L700)

Meanwhile, the loop is already reading $m0 at the start of the
iteration.
361512da0d/crypto/bn/asm/x86_64-mont5.pl (L551)

The whole thing is bootstrapped by similar code just above it:
361512da0d/crypto/bn/asm/x86_64-mont5.pl (L531)

In the final iteration, we read one extra row into $m0 but never use it.
That is the overread.

I also confirmed this by rewinding our x86_64-mont5.pl to this state,
hacking things up until it built, and then hacking up
BN_mod_exp_mont_consttime to place the table in its own allocation, with
no extra slop using C11 aligned_alloc. This was so valgrind could
accurately instrument the bounds. When I did that, valgrind was clean if
I allocated (top + 1) * num_powers, but flagged an out-of-bounds read at
top * num_powers.

This no longer applies. After
25d14c6c29,
bn_mul_mont_gather5's scheduling is far less complicated. .Louter4x now
begins with a masked read, setting up $m0, and then it incorporates $m0
into the product. The same valgrind strategy confirmed this. Thus, I
don't believe this extra row is needed and we can allocate the buffer
straightforwardly.

Change-Id: I6c1ee8d5ebdb66eb4e5fec63d2140814c13ae146
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/55231
Reviewed-by: Bob Beck <bbe@google.com>
Commit-Queue: Bob Beck <bbe@google.com>
fips-20230428
David Benjamin 2 years ago committed by Boringssl LUCI CQ
parent 3ae0778f3d
commit de434576d7
  1. 4
      crypto/fipsmodule/bn/exponentiation.c

@ -961,9 +961,7 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
// Allocate a buffer large enough to hold all of the pre-computed
// powers of |am|, |am| itself, and |tmp|.
int num_powers = 1 << window;
powerbuf_len +=
sizeof(m->d[0]) *
(top * num_powers + ((2 * top) > num_powers ? (2 * top) : num_powers));
powerbuf_len += sizeof(m->d[0]) * top * (num_powers + 2);
#if defined(OPENSSL_BN_ASM_MONT5)
if (powerbuf_len <= sizeof(storage)) {

Loading…
Cancel
Save