Enable SHA-NI optimizations for SHA-256.

While our CI machines don't have these instructions, Intel SDE covers
them. Benchmarks on an AMD EPYC machine (VM on Google Compute Engine):

Before:
Did 13619000 SHA-256 (16 bytes) operations in 3000147us (72.6 MB/sec)
Did 3728000 SHA-256 (256 bytes) operations in 3000566us (318.1 MB/sec)
Did 920000 SHA-256 (1350 bytes) operations in 3002829us (413.6 MB/sec)
Did 161000 SHA-256 (8192 bytes) operations in 3017473us (437.1 MB/sec)
Did 81000 SHA-256 (16384 bytes) operations in 3029284us (438.1 MB/sec)

After:
Did 25442000 SHA-256 (16 bytes) operations in 3000010us (135.7 MB/sec) [+86.8%]
Did 10706000 SHA-256 (256 bytes) operations in 3000171us (913.5 MB/sec) [+187.2%]
Did 3119000 SHA-256 (1350 bytes) operations in 3000470us (1403.3 MB/sec) [+239.3%]
Did 572000 SHA-256 (8192 bytes) operations in 3001226us (1561.3 MB/sec) [+257.2%]
Did 289000 SHA-256 (16384 bytes) operations in 3006936us (1574.7 MB/sec) [+259.4%]

Although we don't currently have unwind tests in CI, I ran the unwind
tests manually on the same VM. They pass, after adding in the missing
.cfi_startproc and .cfi_endproc lines.

Change-Id: I45b91819e7dcc31e63813843129afa146d0c9d47
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51546
Reviewed-by: Adam Langley <agl@google.com>
fips-20220613
David Benjamin 3 years ago committed by Adam Langley
parent ec85d0ddbc
commit 17c8c81104
  1. 19
      crypto/fipsmodule/sha/asm/sha512-x86_64.pl

@ -126,15 +126,12 @@ die "can't locate x86_64-xlate.pl";
# versions, but BoringSSL is intended to be used with pre-generated perlasm
# output, so this isn't useful anyway.
#
# TODO(davidben): Enable AVX2 code after testing by setting $avx to 2. Is it
# necessary to disable AVX2 code when SHA Extensions code is disabled? Upstream
# did not tie them together until after $shaext was added.
# This file also has an AVX2 implementation, controlled by setting $avx to 2.
# For now, we intentionally disable it. While it gives a 13-16% perf boost, the
# CFI annotations are wrong. It allocates stack in a loop and should be
# rewritten to avoid this.
$avx = 1;
# TODO(davidben): Consider enabling the Intel SHA Extensions code once it's
# been tested.
$shaext=0; ### set to zero if compiling for 1.0.1
$avx=1 if (!$shaext && $avx);
$shaext = 1;
open OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
*STDOUT=*OUT;
@ -275,7 +272,7 @@ $code.=<<___ if ($SZ==4 || $avx);
___
$code.=<<___ if ($SZ==4 && $shaext);
test \$`1<<29`,%r11d # check for SHA
jnz _shaext_shortcut
jnz .Lshaext_shortcut
___
# XOP codepath removed.
$code.=<<___ if ($avx>1);
@ -559,7 +556,8 @@ $code.=<<___;
.type sha256_block_data_order_shaext,\@function,3
.align 64
sha256_block_data_order_shaext:
_shaext_shortcut:
.cfi_startproc
.Lshaext_shortcut:
___
$code.=<<___ if ($win64);
lea `-8-5*16`(%rsp),%rsp
@ -703,6 +701,7 @@ $code.=<<___ if ($win64);
___
$code.=<<___;
ret
.cfi_endproc
.size sha256_block_data_order_shaext,.-sha256_block_data_order_shaext
___
}}}

Loading…
Cancel
Save