This work continues on top of the CL opened by Vlad Krasnov
(https://boringssl-review.googlesource.com/c/boringssl/+/44364). The
CL was thoroughly reviewed by David Benjamin but not merged due to
some outstanding comments which this work addresses:
- The flag check when doing the final reduction in poly1305 was
changed from `eq` to `cs`
- The CFI prologues and epilogues of open/seal were modified as
recommended by David.
- Added Pointer Authentication instruction to the functions that are
exported from the assembly code as pointed out by David.
Testing:
- The current tests against ChaCha20-Poly1305 continue to pass.
- More test vectors were produced using a Python script to try and
prove that having `eq` instead of `cs` was a bug. They passed as
well, but didn't result in the most significant word being
non-zero after the reduction, which would have highlighted the
bug. An argument about why it's unlikely to find the vector is
detailed below.
- `objdump -W|Wf|WF` was used to confirm the value of the CFA and the
locations of the registers relative to the CFA were as expected. See
https://www.imperialviolet.org/2017/01/18/cfi.html.
Performance:
| Size | Before (MB/s) | After (MB/s) | Improvement |
| 16 bytes | 30.5 | 43.3 | 1.42x |
| 256 bytes | 220.7 | 361.5 | 1.64x |
| 1350 bytes | 285.9 | 639.4 | 2.24x |
| 8192 bytes | 329.6 | 798.3 | 2.42x |
| 16384 bytes | 331.9 | 814.9 | 2.46x |
Explanation of the unlikelihood of finding a test vector:
* the modulus is in t2:t1:t0 = 3 : FF..FF : FF..FB, each being a 64 bit
word; i.e. t2 = 3, t1 = all 1s.
* acc2 <= 4 after the previous reduction.
* It is highly likely to have borrow = 1 from acc1 - t1 since t1 is
all FFs.
* So for almost all test vectors we have acc2 <= 4 and borrow = 1,
thus (t2 = acc2 - t2 - borrow) will be 0 whenever acc >
modulus. **It would be highly unlikely to find such a test vector
with t2 > 0 after that final reduction:** Trying to craft that
vector requires having acc and r of high values before their
multiplication, yet ensuring that after the reduction (see Note) of
their product, the resulting value of the accumulator has t2 = 4,
all 1s in t1 and most of t0 so that no borrow occurs from acc1:acc0
- t1:t0.
* Note: the reduction is basically carried by folding over the top
64+62 bits once, then folding them again shifted left by 2,
resulting in adding 5 times those bits.
Change-Id: If7d86b7a9b74ec3615ac2d7a97f80100dbfaee7f
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51885
Reviewed-by: Adam Langley <alangley@gmail.com>
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: Adam Langley <agl@google.com>