powerbuf's layout is: - num_powers values mod m, stored transposed - one value mod m, tmp - one value mod m, am - (mont5-only) an extra copy of m powerbuf_len broadly computed this, but where tmp + am would be sizeof(BN_ULONG) * top * 2, it used sizeof(BN_ULONG) * max(top * 2, num_powers). (In the actual code it's written as a ternary op and some multiplications are factored out.) That is, it allocated enough room for tmp + am OR an extra row in the num_powers table, as if each entry were top + 1 words long instead of top, with the space overlapping. This expression dates to upstream'sfips-20230428361512da0d
, though the exact layout has shifted over the years as mont5 evolved. (Originally, it only contained one extra value mod m.) At the time, this was necessary because bn_mul_mont_gather5 actually overreads the table by one row! Although it only uses top * 32 words, it requires the table to have (top + 1) * 32 words. This is because the computation was scheduled so that the .Louter4x loop would read and mask off the next iteration's value while incorporating the previous iteration: There were masked reads from $bp into XMM registers at the start of the loop:361512da0d/crypto/bn/asm/x86_64-mont5.pl (L541)
The XMM logic is interleaved throughout and does not move to a general-purpose register, $m0, until much later. $m is not read again until after the jump.361512da0d/crypto/bn/asm/x86_64-mont5.pl (L700)
Meanwhile, the loop is already reading $m0 at the start of the iteration.361512da0d/crypto/bn/asm/x86_64-mont5.pl (L551)
The whole thing is bootstrapped by similar code just above it:361512da0d/crypto/bn/asm/x86_64-mont5.pl (L531)
In the final iteration, we read one extra row into $m0 but never use it. That is the overread. I also confirmed this by rewinding our x86_64-mont5.pl to this state, hacking things up until it built, and then hacking up BN_mod_exp_mont_consttime to place the table in its own allocation, with no extra slop using C11 aligned_alloc. This was so valgrind could accurately instrument the bounds. When I did that, valgrind was clean if I allocated (top + 1) * num_powers, but flagged an out-of-bounds read at top * num_powers. This no longer applies. After25d14c6c29
, bn_mul_mont_gather5's scheduling is far less complicated. .Louter4x now begins with a masked read, setting up $m0, and then it incorporates $m0 into the product. The same valgrind strategy confirmed this. Thus, I don't believe this extra row is needed and we can allocate the buffer straightforwardly. Change-Id: I6c1ee8d5ebdb66eb4e5fec63d2140814c13ae146 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/55231 Reviewed-by: Bob Beck <bbe@google.com> Commit-Queue: Bob Beck <bbe@google.com>
parent
3ae0778f3d
commit
de434576d7
1 changed files with 1 additions and 3 deletions
Loading…
Reference in new issue