boringssl

Commit Graph

Author	SHA1	Message	Date
David Benjamin	ecb7e9ae5c	Require C11 in MSVC too BoringSSL can currently be built in C11 or pre-C11 mode in MSVC. They're broadly the same, but do use completely different implementations of alignas and alignof. Now that every build configuration I'm aware of has been moved to the C11 mode, we don't even test the pre-C11 mode anymore. Start requiring it. Update-Note: If building with MSVC, BoringSSL now requires building with /std:c11 or later. (On non-MSVC compilers, we have required C11 for a while now.) Fixed: 624 Change-Id: Ie9f66eee0bebac8143c23a7229c6854afaefea6e Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/63065 Commit-Queue: Adam Langley <agl@google.com> Auto-Submit: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>	1 year ago
David Benjamin	15b1f9c6a4	Help the compiler dedup ia32cap and armcap accesses https://boringssl-review.googlesource.com/c/boringssl/+/62585 made the compiler emit multiple CRYPTO_library_init calls in functions which dispatch between a tower of alternatives. Ideally, the compiler would know that at most one call suffices. There doesn't seem to be such an attribute, but we can get the same effect with pure or const attributes. We tie init with returning the capability vector. On Intel, because the vector is so large, we have to go with a weaker version. Somewhat annoyingly, the getter must be out-of-line, because otherwise the compiler inlines first and loses the attribute. I went with pure because we allow our unit tests to mutate OPENSSL_armcap_P, which means the Arm one is, strictly speaking, pure, not const. This slightly reduces optimization potential, but should still allow deduping in most places. Confirmed that aes_init_key now only calls a helper function once. See discussion in https://boringssl-review.googlesource.com/c/boringssl/+/62585/comment/26083b88_b3db2b75/ Bug: 35 Change-Id: I9bc464f0e5a0ed9601017a5037028f906693a137 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/62985 Auto-Submit: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com> Commit-Queue: Bob Beck <bbe@google.com>	1 year ago
David Benjamin	e5b6c141a1	Automatically call CRYPTO_library_init before C accesses All the C accesses have been sufficiently abstracted that this is pretty easy to handle automatically. We still have accesses from assembly, so we're not quite initializationless yet. But this does get us most of the way there. I'm thinking what's next is: - Make a list of asm symbols that touch armcap or ia32cap - For each, figure out the place(s) in the calling code where we need to init manually and/or pull the dispatch up into C One interesting subtlety with how this CL does it: although this CL means you can freely call, say, CRYPTO_is_SSSE3_capable without CRYPTO_library_init, you cannot quite assume that CRYPTO_library_init has been called after you call CRYPTO_is_SSSE3_capable. It is possible that the build defined __SSSE3__, in which case CRYPTO_is_SSSE3_capable does nothing. This does complicate resolving the asm cases above. Bug: 35 Change-Id: Ie52c74e4a59a7019c3af0526dbb35950604ada66 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/62585 Auto-Submit: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com> Commit-Queue: Bob Beck <bbe@google.com>	1 year ago
David Benjamin	5edba0bcc9	Remove the CRYPTO_is__capable_at_runtime indirection I think this dates to when CRYPTO_is__capable were inline functions in public headers, so they couldn't access OPENSSL_armcap_P directly. Now they can. Change-Id: Ic06fffa7f5056401118b62d690dfe6b21bc30f86 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/62345 Auto-Submit: David Benjamin <davidben@google.com> Commit-Queue: Bob Beck <bbe@google.com> Reviewed-by: Bob Beck <bbe@google.com>	1 year ago
David Benjamin	d95b03c315	Trim some redundant Arm feature detection files cpu_arm_openbsd.c is the same as OPENSSL_STATIC_ARMCAP. cpu_aarch64_freebsd.c is the same as cpu_aarch64_sysreg.c. (The FreeBSD one was using the macros in their headers, but those macros expand to the same inline assembly.) Also send ANDROID_BAREMETAL + 32-bit Arm to OPENSSL_STATIC_ARMCAP. This way we can remove OPENSSL_STATIC_ARMCAP from the Android baremetal build without having to chase down constraining it to aarch64. See b/291106677#comment6 Update-Note: This is a slight change to the OpenBSD build. Previously, we assumed OpenBSD on 32-bit Arm implies NEON. Now, we pick it up from the __ARM_NEON define from ACLE, i.e. whether the compiler has been told that NEON is available. (This comes from -march or other options.) Doing that is desirable anyway: if NEON is in your baseline, you should tell your compiler so it can vectorize loops. Change-Id: Icd43a2b56bb6e3f04f0fed996ae750fba65e3312 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/62066 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com>	1 year ago
David Benjamin	0152f73962	Stop duplicating the OPENSSL_armcap_P declaration Change-Id: I9ccf1e7442baf84fe8c4aa4c09b69f7289ccd7ba Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/62065 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com>	1 year ago
David Benjamin	9fcaec6435	Start recognizing the OPENSSL_NANOLIBC define nanolibc is an embedded platform with no threads. To start unforking that build, generalize some of the OPENSSL_TRUSTY defines. OpenSSL has OPENSSL_NO_SOCK if you don't have sockets and OPENSSL_NO_POSIX_IO if you don't have file descriptors. Those names are fine enough, so I've borrowed them here too. There's more to be done here, but this will clear out some of it. Change-Id: Iaba1fafdebb46ebb8f68b7956535dd0ccaaa832f Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60890 Auto-Submit: David Benjamin <davidben@google.com> Commit-Queue: Bob Beck <bbe@google.com> Reviewed-by: Bob Beck <bbe@google.com>	1 year ago
David Benjamin	04c3d40f06	Remove CRYPTO_MUTEX from public headers We no longer need to define CRYPTO_MUTEX in public headers. This simplifies a pile of things. First, we can now use pthread_rwlock_t without any fuss, rather than trying to guess the size on glibc. As a result, CRYPTO_MUTEX and CRYPTO_STATIC_MUTEX can be merged into one type. We can almost do this to CRYPTO_refcount_t too. BIO is the one straggler remaining. Fixed: 325 Change-Id: Ie93c9f553c0f02ce594b959c041b00fc15ba51d2 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60611 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com>	1 year ago
Andres Erbsen	d605df5b6f	Use packed representation for large Curve25519 table Did 59000 Ed25519 key generation operations in 1004188us (58753.9 ops/sec) [+8.3%] Did 57000 Ed25519 signing operations in 1005649us (56679.8 ops/sec) [+7.9%] Did 19000 Ed25519 verify operations in 1054380us (18020.1 ops/sec) [-2.0%] Did 61000 Curve25519 base-point multiplication operations in 1007401us (60551.9 ops/sec) [+8.3%] Did 22000 Curve25519 arbitrary point multiplication operations in 1022882us (21507.9 ops/sec) [+0.5%] Change-Id: I14668f658b1ae99850cb0f8938f90f988d0edd0b Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60107 Reviewed-by: David Benjamin <davidben@google.com> Commit-Queue: David Benjamin <davidben@google.com>	1 year ago
David Benjamin	f712c86eda	Avoid locks in CRYPTO_free_ex_data Every time we free a type with ex_data (RSA, EC_KEY, DSA, SSL_CTX, SSL, SSL_SESSION, X509, X509_STORE), we allocate and take a read lock. The allocation means, if we believe in malloc failures, it is possible to leak memory on malloc failure. The read lock causes an unnecessary bit of contention writing to the cache line. Instead, since we never remove ex_data entries, just thread them in a singly-linked list. This way we only need to synchronize when to stop iterating. Add a counter to synchronize that. (Or we could make each 'next' pointers atomic, but this seemed more straightforward.) (I suspect this doesn't matter much, but it was shorter and we were already allocating the funcs structures anyway.) Bug: 570 Change-Id: Ie7ba5cc44f2b71ebd79c8971e784912d53af7f5c Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/60025 Reviewed-by: Adam Langley <agl@google.com> Commit-Queue: Adam Langley <agl@google.com> Auto-Submit: David Benjamin <davidben@google.com>	2 years ago
David Benjamin	8abcb6fb41	Don't include <stdatomic.h> in C++ OPENSSL_C11_ATOMIC is both computed in crypto/internal.h and also defined externally. This is a remnant of C11 atomics were an opt-in feature. If defined externally, this means OPENSSL_C11_ATOMIC might be defined when built as C++. That, in turn, causes <stdatomic.h> to be included in C++ mode. At least one of our users toolchains has a <stdatomic.h> that is incompatible with C++. We don't get anything out of including it, so just gate the include on !defined(__cplusplus) for now. Things to look into as follow-up: - Fix build files to stop defining OPENSSL_C11_ATOMIC. Prior to https://boringssl-review.googlesource.com/c/boringssl/+/59847, it was still serving a purpose: in server builds, if autodetection fails, we would rather fail to build than accidentally fallback to locks. There is no lock fallback anymore. - Fix that toolchain so their <stdatomic.h> is C++-compatible. It's certainly not C++23-conformant. I suspect it's also not C++11-conformant, but I'm not positive. Change-Id: I13bcd8380efeb87b9f9cc439fe24a743e48aec60 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/59985 Auto-Submit: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com>	2 years ago
David Benjamin	1599fea870	Remove read locks from PRNG steady state We don't take write locks in the PRNG, steady state, but we do take some read locks: computing fork generation, reading the fork-unsafe buffering flag, and a FIPS-only artifact of some global state clearing mess. That last one is completely useless, but it's a consequence of FIPS's understanding of process exit being comically inconsistent with reality. Taking read locks is, in principle, parallel, but the cacheline write causes some contention, even in newer glibcs with faster read locks. Fix these: - Use atomic reads to check the fork generation. We only need to lock when we observe a fork. - Replace the fork-unsafe buffering flag with an atomic altogether. - Split state_clear_all_lock into a per-rand_thread_state lock. We still need a read lock, but a completely uncontended one until process exit. With many threads, this gives a significant perf boost. x86_64, non-FIPS, Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 30 threads: Before: Did 45131875 RNG (16 bytes) operations in 300039649us (150419.7 ops/sec): 2.4 MB/s Did 44089000 RNG (32 bytes) operations in 300053237us (146937.3 ops/sec): 4.7 MB/s Did 43328000 RNG (256 bytes) operations in 300058423us (144398.5 ops/sec): 37.0 MB/s Did 45857000 RNG (1350 bytes) operations in 300095943us (152807.8 ops/sec): 206.3 MB/s Did 43249000 RNG (8192 bytes) operations in 300102698us (144114.0 ops/sec): 1180.6 MB/s After: Did 296204000 RNG (16 bytes) operations in 300009524us (987315.3 ops/sec): 15.8 MB/s Did 311347000 RNG (32 bytes) operations in 300014396us (1037773.5 ops/sec): 33.2 MB/s Did 295104000 RNG (256 bytes) operations in 300012657us (983638.5 ops/sec): 251.8 MB/s Did 255721000 RNG (1350 bytes) operations in 300016481us (852356.5 ops/sec): 1150.7 MB/s Did 103339000 RNG (8192 bytes) operations in 300040059us (344417.3 ops/sec): 2821.5 MB/s (Smaller PRNG draws are more impacted because they spend less time in the DRBG. But they're also more likely because you rarely need to pull 8K of data out at once.) x86_64, FIPS, Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 30 threads: Before: Did 29060000 RNG (16 bytes) operations in 300081190us (96840.5 ops/sec): 1.5 MB/s Did 31882000 RNG (32 bytes) operations in 300118031us (106231.5 ops/sec): 3.4 MB/s Did 30925000 RNG (256 bytes) operations in 300113646us (103044.3 ops/sec): 26.4 MB/s Did 31969000 RNG (1350 bytes) operations in 300096688us (106529.0 ops/sec): 143.8 MB/s Did 33434000 RNG (8192 bytes) operations in 300093240us (111412.0 ops/sec): 912.7 MB/s After: Did 299013000 RNG (16 bytes) operations in 300012167us (996669.6 ops/sec): 15.9 MB/s Did 289788000 RNG (32 bytes) operations in 300014611us (965913.0 ops/sec): 30.9 MB/s Did 298699000 RNG (256 bytes) operations in 300013443us (995618.7 ops/sec): 254.9 MB/s Did 247061000 RNG (1350 bytes) operations in 300018215us (823486.7 ops/sec): 1111.7 MB/s Did 100479000 RNG (8192 bytes) operations in 300037708us (334887.9 ops/sec): 2743.4 MB/s On an M1 Pro, it's mostly a wash by default (fewer threads because this chip has fewer cores) aarch64, M1 Pro, 8 threads: Before: Did 23218000 RNG (16 bytes) operations in 80009131us (290191.9 ops/sec): 4.6 MB/s Did 23021000 RNG (256 bytes) operations in 80007544us (287735.4 ops/sec): 73.7 MB/s Did 22853000 RNG (1350 bytes) operations in 80013184us (285615.4 ops/sec): 385.6 MB/s Did 25407000 RNG (8192 bytes) operations in 80008371us (317554.3 ops/sec): 2601.4 MB/s Did 22128000 RNG (16384 bytes) operations in 80013269us (276554.1 ops/sec): 4531.1 MB/s After: Did 23303000 RNG (16 bytes) operations in 80011433us (291245.9 ops/sec): 4.7 MB/s Did 23072000 RNG (256 bytes) operations in 80008755us (288368.4 ops/sec): 73.8 MB/s Did 22807000 RNG (1350 bytes) operations in 80013355us (285039.9 ops/sec): 384.8 MB/s Did 23759000 RNG (8192 bytes) operations in 80010212us (296949.6 ops/sec): 2432.6 MB/s Did 23193000 RNG (16384 bytes) operations in 80011537us (289870.7 ops/sec): 4749.2 MB/s This is likely because, without RDRAND or MADV_WIPEONFORK, we draw from the OS on every call. We're likely bottlenecked by getentropy, whether it's some internal synchronization or syscall overherad. With fork-unsafe buffering enabled, this change shows even more significant wins on the M1 Pro. aarch64, fork-unsafe buffering, M1 Pro, 8 threads: Before: Did 25727000 RNG (16 bytes) operations in 80010579us (321545.0 ops/sec): 5.1 MB/s Did 25776000 RNG (32 bytes) operations in 80008587us (322165.4 ops/sec): 10.3 MB/s Did 25780000 RNG (256 bytes) operations in 80006127us (322225.3 ops/sec): 82.5 MB/s Did 33171250 RNG (1350 bytes) operations in 80002532us (414627.5 ops/sec): 559.7 MB/s Did 54784000 RNG (8192 bytes) operations in 80005706us (684751.2 ops/sec): 5609.5 MB/s After: Did 573826000 RNG (16 bytes) operations in 80000668us (7172765.1 ops/sec): 114.8 MB/s Did 571329000 RNG (32 bytes) operations in 80000423us (7141574.7 ops/sec): 228.5 MB/s Did 435043750 RNG (256 bytes) operations in 80000214us (5438032.3 ops/sec): 1392.1 MB/s Did 229536000 RNG (1350 bytes) operations in 80001888us (2869132.3 ops/sec): 3873.3 MB/s Did 57253000 RNG (8192 bytes) operations in 80004974us (715618.0 ops/sec): 5862.3 MB/s Note that, on hardware with RDRAND, the read lock in rand_fork_unsafe_buffering_enabled() doesn't do much. But without RDRAND, we hit that on every RAND_bytes call. More importantly, the subsequent CL will fix a bug that will require us to hit it more frequently. I've removed the volatile on g_fork_detect_addr because I think we didn't need it and this avoids thinking about the interaction between volatile and atomics. The pointer is passed into madvise, so the compiler knows the pointer escapes. For it to be invalid, the compiler would need to go out of its way to model madvise as not remembering the pointer, which would be incorrect of it for MADV_WIPEONFORK. Bug: 570 Cq-Include-Trybots: luci.boringssl.try:linux_clang_rel_tsan Change-Id: Ie6977acd1b8e7639aaa419cf6f4f5f0645bde9d1 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/59849 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>	2 years ago
David Benjamin	a05691d5d8	Add an atomics library to crypto/internal.h refcount.c is now a single, generic file that calls into C11-atomic-like APIs. Behind the scenes, this selects one of C11 atomics, Windows interlocked APIs, or unsynchronized reads/writes (in the no-threads build). This frees us up to use atomics elsewhere in the library. For now, this only binds sequentially consistent atomics, but we can add other memory orders if needed. In particular, I believe up_ref only needs relaxed atomics. Some of the later change I think only need acquire and release, but I'm not positive. Bug: 570 Cq-Include-Trybots: luci.boringssl.try:linux_clang_rel_tsan Change-Id: Ifcd7357611bb7a8cd14b82c23ad080d1a2df1386 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/59848 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>	2 years ago
David Benjamin	8a85012bc4	Remove the lock-based atomics fallback On Windows, we can rely on Interlocked APIs. On non-Windows builds, we currently require C11 but permit C11 atomics to be missing, via __STDC_NO_ATOMICS__. This CL tightens this so C11 atomics are required on non-MSVC builds. My hope is that, now that we require C11 on non-Windows, this is a fairly safe requirement. We already require pthreads on any platform where this might apply, and it's hard to imagine someone has C11, pthreads, but not C11 atomics. This change means that, in later work, we can refactor the refcount logic to instead be a <stdatomic.h> compatibility layer, and then an atomics-targetting CRYPTO_refcount_t implementation. With a <stdatomic.h> compatibility layer, we can use atomics in more places, notably where our uses of read locks are causing cacheline contention. The platform restriction isn't strictly necessary. We could, like with refcounts, emulate <stdatomic.h> with a single, global lock. Indeed any platforms in this situation have already been living with that lock for refcounts without noticing. But then later work to add "atomics" to read locks would regress contention for those platforms. So I'm starting by rejecting this, so if any such platform exists, we can understand their performance needs before doing that. Update-Note: On non-Windows platforms, we now require C11 atomics support. Note we already require C11 itself. If this affects your build, get in touch with BoringSSL maintainers. Bug: 570 Cq-Include-Trybots: luci.boringssl.try:linux_clang_rel_tsan Change-Id: I868fa4ba87ed73dfc9d52e80d46853ef56715a5f Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/59847 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>	2 years ago
David Benjamin	5b845de636	Use Windows Interlocked* APIs for refcounts when C11 isn't available Right now, MSVC has to fallback to refcount_lock.c, which uses a single, global lock for all refcount operations. Instead, use the Interlocked* APIs to implement them. The motivation is two-fold. First, this removes a performance cliff when building for Windows on a non-Clang compiler. (Although I've not been able to measure it in an end-to-end EVP benchmark, only a synthetic refcount-only benchmark.) More importantly, it gets us closer to assuming atomics support on all non-NO_THREADS configurations. (The next CL will clear through that.) That, in turn, will make it easier to add an atomics-like abstractions to some of our hotter synchronization points. (Even in newer glibc, with its better rwlock, read locks fundamentally need to write to memory, so we have some cacheline contention on shared locks.) Annoyingly, the Windows atomic_load replacement is not quite right. I've used a "no-op" InterlockedCompareExchange(p, 0, 0) which, empirically, still results in a write. But a write to the refcount cacheline is surely better than taking a global exclusive lock. See comments in file for details. OpenSSL uses InterlockedOr(p, 0), but that actually results in even worse code. (InterlockedOr needs a retry loop when the underlying cmpxchg fails, whereas InterlockedCompareExchange is a single cmpxchg.) Hopefully, in the future (perhaps when we require VS 2022's successor, based on [1]), this can be removed in favor of C11 atomics everywhere. [1] https://devblogs.microsoft.com/cppblog/c11-atomics-in-visual-studio-2022-version-17-5-preview-2/ Bug: 570 Cq-Include-Trybots: luci.boringssl.try:linux_clang_rel_tsan Change-Id: I125da139e2fd3ae51e54309309fda16ba97ccf20 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/59846 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>	2 years ago
Bob Beck	dcabfe2d89	Make OPENSSL_malloc push ERR_R_MALLOC_FAILURE on failure. Remove all the other ERR_R_MALLOC_FAILURES from the codebase. Also changes cbb to push to the error stack, to correctly report cbb failures instead of now only reporting malloc failures. Previously it turned all cbb failures into a malloc failure Bug: 564 Change-Id: Ic13208bf9d9aaa470e83b2f15782fc94946bbc7b Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/57046 Auto-Submit: Bob Beck <bbe@google.com> Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: David Benjamin <davidben@google.com>	2 years ago
Bob Beck	350f8547cf	Add OPENSSL_asprintf and friends for asprintf(3) functionality. This includes an internal version which allows a flag to specify the use of system malloc, or OPENSSL_malloc - this in turn allows us to use this function in the ERR family of functions and allow for ERR to not call OPENSSL_malloc with a circular dependency. Bug: 564 Change-Id: Ifd02d062fda9695cddbb0dbef2e1c1db0802a486 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/57005 Auto-Submit: Bob Beck <bbe@google.com> Reviewed-by: David Benjamin <davidben@google.com> Commit-Queue: Bob Beck <bbe@google.com>	2 years ago
David Benjamin	5e356a8a9a	Partially mitigate quadratic-time malloc tests in unit tests Malloc failure testing is quadratic in the number of allocations. To test a failure at allocation N, we must first run the previous N-1 allocations. Now that we have combined GTest binaries, this does not work very well. Use the test listener to reset the counter across independent tests. We assume failures in a previous test won't interfere in the next one and run each test's counter in parallel. The assumption isn't quite true because we have a lot of internal init-once machinery that is reused across otherwise "independent" tests, but it's close enough that I was able to find some bugs, fixed in the next commit. That said, the tests still take too long to run to completion. Bug: 127 Change-Id: I6836793448fbdc740a8cc424361e6b3dd66fb8a6 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/56926 Reviewed-by: Bob Beck <bbe@google.com> Commit-Queue: David Benjamin <davidben@google.com>	2 years ago
David Benjamin	8f220ece1e	Clear false positives in RSA-OAEP constant time validation We check OAEP padding in constant time, but once the padding is determined to be valid (or not), this fact and, if valid, the output length are public. Change-Id: I2aa6a707ca9a91761776746264416736c820977c Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/56845 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com>	2 years ago
David Benjamin	53b876a4d1	Stop clang from un-constant-timing copy_from_prebuf. Newer versions of clang figure out that copy_from_prebuf (used in builds that aren't x86_64 with assembly optimizations) has a bunch of no-op iterations and insert a branch. Add a value barrier to stop it. This was caught by our valgrind-based constant-time validation. As part of this, I noticed that OPENSSL_NO_ASM builds turn off value barriers. This is because the value barriers use an empty inline asm block. While this is technically correct, it's probably unnecessary. The clang\|gcc check means we know GCC-style inline assembly is supported. Disabling inline asm is used by sanitizers to shut off unintrumentable code, but there's no uninstrumentable code in the empty string. It's also used by consumers who haven't figured out how to integrate an assembler into their build system, but that also doesn't apply. So just remove the condition on the value barriers so OPENSSL_NO_ASM also get mitigations. Update-Note: It is possible the above is wrong and some OPENSSL_NO_ASM relied on value barriers being disabled. If so, this will break that build and we'll need to reconsider. Change-Id: I6e3ea3ee705bef3afcf42d3532b17aaabbbcc60b Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/56827 Reviewed-by: Bob Beck <bbe@google.com> Commit-Queue: David Benjamin <davidben@google.com>	2 years ago
David Benjamin	aa83c12069	Clear various false positives in RSA constant-time validation This silences a few false positives in the valgrind-based constant-time validation. First, there are a few precondition checks that are publicly true, but valgrind doesn't know that. I've added a constant_time_declassify_int function and stuck those in there, since the existing macro is mostly suited for macros. It also adds a value barrier in production code (see comment for why). If we more thoroughly decoupled RSA from BIGNUM, we could probably avoid this, since a lot of comes from going through public BIGNUM APIs. Next, our BIGNUM strategy is such that bounds on bignums are sometimes computed pessimally, and then clamped down later. Modular arithmetic is trivially bounded and avoids that, but RSA CRT involves some non-modular computations. As a result, we actually compute a few more words than necessary in the RSA result, and then bn_resize_words down. bn_resize_words also has a precondition check, which checks that all discarded words are zero. They are, but valgrind does not know that. Similarly, the BN_bn2bin_padded call at the end checks for discarded non-zero bytes, but valgrind does not know that, because the output is bounded by n, the discarded bytes are zero. I've added a bn_assert_fits_in_bytes to clear this. It's an assert in debug mode and a declassification in constant-time validation. I suspect a different secret integer design would avoid needing this. I think this comes from a combination of non-modular arithmetic, not having callers pass explicit width, and tracking public widths at the word granularity, rather than byte or bit. (Bit would actually be most ideal.) Maybe worth a ponder sometime. Change-Id: I1bc9443d571d2881e2d857c70be913074deac156 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/56825 Commit-Queue: Bob Beck <bbe@google.com> Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com>	2 years ago
David Benjamin	5fdc03f74a	Remove ppc64le assembly. We no longer have a need to support ppc64le, nor do we have any testing story for the assembly we previously had. Remove all ppc64le-specific assembly. This CL stops short of removing it from base.h. That'll be done in a follow-up CL, just to separate which removals are for the assembly and which removals remove all support. Update-Note: After this change, ppc64le builds drop assembly optimizations and will fallback to a generic C-based AES implementation. Change-Id: Ic8075638085761d66cebc276eb16c4770ce03920 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/56388 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>	2 years ago
David Benjamin	41eb890f77	Rearrange the C11 vs MSVC check. Starting version 2019 16.8 (released November 2020), MSVC actually implements parts of C11, though disabled by default. You have to pass /std:c11, and then alignas, alignof, and noreturn all work. When built that way, better to use the real ones, so check for __STDC_VERSION__ first. It would be nice to mandate that so we can remove the polyfill, but for now just opportunistically use it when we can. Sadly, even in this mode, they still don't implement C11 atomics, so the refcounting implementation will still be slow. Change-Id: I28dab4a339c368f7d8f8da5aa7aee1cb344803d3 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/53006 Reviewed-by: Adam Langley <agl@google.com> Commit-Queue: David Benjamin <davidben@google.com>	2 years ago
David Benjamin	1e469e45a4	Replace some more C unions. I don't think these are all UB by C's rules, but it's easier not to think about the pointers. Still more to go, but these were some easy ones. Bug: 301 Change-Id: Icdcb7fb40f85983cbf566786c5f7dbfd7bb06571 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52905 Reviewed-by: Bob Beck <bbe@google.com> Commit-Queue: Bob Beck <bbe@google.com>	2 years ago
Adam Langley	118a892d2d	Add a service indicator for FIPS 140-3. This is cribbed, with perimssion, from AWS-LC. The FIPS service indicator[1] signals when an approved service has been completed. [1] FIPS 140-3 IG 2.4.C Change-Id: Ib40210d69b3823f4d2a500b23a1606f8d6942f81 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52568 Reviewed-by: David Benjamin <davidben@google.com> Commit-Queue: Adam Langley <agl@google.com>	3 years ago
David Benjamin	227ff6e642	Remove unions in EC_SCALAR and EC_FELEM. When introducing EC_SCALAR and EC_FELEM, I used unions as convenience for converting to and from the byte representation. However, type-punning with unions is not allowed in C++ and hard to use correctly in C. As I understand the rules, they are: - The abstract machine knows what member of union was last written to. - In C, reading from an inactive member is defined to type-pun. In C++, it is UB though some compilers promise the C behavior anyway. - However, if you read or write from a pointer to a union member, the strict aliasing rule applies. (A function passed two pointers of different types otherwise needs to pessimally assume they came from the same union.) That last rule means the type-punning allowance doesn't apply if you take a pointer to an inactive member, and it's common to abstract otherwise direct accesses of members via pointers. https://github.com/openssl/openssl/issues/18225 is an example where similar union tricks have caused problems for OpenSSL. While we don't have that code, EC_SCALAR and EC_FELEM play similar tricks. We do get a second lifeline because our alternate view is a uint8_t, which we require to be unsigned char. Strict aliasing always allows the pointer type to be a character type, so pointer-indirected accesses of EC_SCALAR.bytes aren't necessarily UB. But if we ever write to EC_SCALAR.bytes directly (and we do), we'll switch the active arm and then pointers to EC_SCALAR.words become strict aliasing violations! This is all far too complicated to deal with. Ideally everyone would build with -fno-strict-aliasing because no real C code actually follows these rules. But we don't always control our downstream consumers' CFLAGS, so let's just avoid the union. This also avoids a pitfall if we ever move libcrypto to C++. For p224-64.c, I just converted the representations directly, which avoids worrying about the top 32 bits in p224_felem_to_generic. Most of the rest was words vs. bytes conversions and boils down to a cast (we're still dealing with a character type, at the end of the day). But I took the opportunity to extract some more "words"-based helper functions out of BIGNUM, so the casts would only be in one place. That too saves us from the top bits problem in the bytes-to-words direction. Bug: 301 Change-Id: I3285a86441daaf824a4f6862e825d463a669efdb Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52505 Commit-Queue: Bob Beck <bbe@google.com> Reviewed-by: Bob Beck <bbe@google.com>	3 years ago
David Benjamin	f961de5c47	Try to require C11 (in non-MSVC compilers). MSVC is a little behind, but otherwise we should be able to assume C11 support in all our compilers. The only C99 builds should just be stale build files. Such consumers are leaving performance on the table, by using the worse refcounting implementation. For now, don't require it in public headers. Android's build is still defaulting to C99, which means requiring C11 will be disruptive. We can try the public headers after that's fixed. Update-Note: If the build fails with an error about C11, remove -std=c99 or -std=gnu99 from your build. Refcounting will get faster. Change-Id: I2ec6f7d7acc026a451851d0c38f60c14bae6b00f Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52247 Reviewed-by: Adam Langley <agl@google.com> Commit-Queue: David Benjamin <davidben@google.com>	3 years ago
Adam Langley	4b55af0fc5	Make FFDH self tests lazy. Change-Id: I7ac046a2422d79b77a231ab65325402658144390 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51566 Reviewed-by: David Benjamin <davidben@google.com>	3 years ago
Adam Langley	3053b739ba	Make ECC self tests lazy. Change-Id: I1b7e4bd5403031232fc1e1ffb3c6e40decac23b9 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51565 Reviewed-by: David Benjamin <davidben@google.com>	3 years ago
David Benjamin	4d955d20d2	Check static CPU capabilities on x86. On Arm, our CRYPTO_is_*_capable functions check the corresponding preprocessor symbol. This allows us to automatically drop dynamic checks and fallback code when some capability is always avilable. This CL does the same on x86, as well as consolidates our OPENSSL_ia32cap_P checks in one place. Since this abstraction is incompatible with some optimizations we do around OPENSSL_ia32cap_get() in the FIPS module, I've marked the symbol __attribute__((const)), which is enough to make GCC and Clang do the optimizations for us. (We already do the same to DEFINE_BSS_GET.) Most x86 platforms support a much wider range of capabilities, so this is usually a no-op. But, notably, all x86_64 Mac hardware has SSSE3 available, so this allows us to statically drop an AES implementation. (On macOS with -Wl,-dead_strip, this seems to trim 35080 bytes from the bssl binary.) Configs like -march=native can also drop a bunch of code. Update-Note: This CL may break build environments that incorrectly mark some instruction as statically available. This is unlikely to happen with vector instructions like AVX, where the compiler could freely emit them anyway. However, instructions like AES-NI might be set incorrectly. Change-Id: I44fd715c9887d3fda7cb4519c03bee4d4f2c7ea6 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51548 Reviewed-by: Adam Langley <agl@google.com>	3 years ago
Adam Langley	0da6b4805b	Don't call a non-test file *test.h. fips_break_test.h is a bad name because generate_build_files.py thinks that it's a test file, which it is, but one that's needed in the main build. Thanks to Svilen Kanev for noting this. That header doesn't particularly carry its weight. The idea was that rebuilding the break test wouldn't need to rebuild everything if that logic was isolated in its own header. But we only have to rebuild once now, so whatever. There's already a block of crypto/internal.h with very similar stuff; it can go there. Change-Id: Ifb479eafd4df9a7aac4804cae06ba87257c77fc3 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51485 Reviewed-by: David Benjamin <davidben@google.com> Commit-Queue: David Benjamin <davidben@google.com>	3 years ago
Adam Langley	1c2e61efef	Make RSA self-test lazy. We need to ensure that all public functions that end up doing a cryptographic RSA operation run the self-tests first. We could do that by putting calls in the lower-most functions but the self-tests must run operations without creating a cycle. Therefore calls are placed as low down as possible except where it would conflict with the self-tests. Some functions need to be split so that there's a private version that doesn't require that the self tests have passed. Here's the call-graph that I used for this: ┌───────────────────────────┐ │ private_decrypt │ └───────────────────────────┘ │ │ ▼ ┌───────────────────────────┐ │ decrypt │ └───────────────────────────┘ │ │ ▼ ┌───────────────────────────┐ │ default_decrypt │ └───────────────────────────┘ │ │ ▼ ┌───────────────────────────┐ │ private_transform │ ◀┐ └───────────────────────────┘ │ │ │ │ │ ▼ │ ┌───────────────────────────┐ │ │ default_private_transform │ │ └───────────────────────────┘ │ ┌───────────────────────────┐ │ │ private_encrypt │ │ └───────────────────────────┘ │ ┌───────────────┐ │ │ │ sign_pss_mgf1 │ │ │ └───────────────┘\ ▼ │ ┌────────┐ ┌───────────────────────────┐ │ │ sign │ ──▶ │ sign_raw │ │ └────────┘ └───────────────────────────┘ │ │ │ │ │ ▼ │ ┌───────────────────────────┐ │ │ default_sign_raw │ ─┘ └───────────────────────────┘ ┌−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−┐ ╎ Verification ╎ ╎ ╎ ╎ ┌───────────────────────────┐ ╎ ╎ │ public_decrypt │ ╎ ╎ └───────────────────────────┘ ╎ ╎ │ ╎ ╎ │ ╎ ╎ │ ╎ ┌−−−−−−−−−−−−−−−− │ ╎ ╎ ▼ ╎ ╎ ┌────────┐ ┌───────────────────────────┐ ╎ ╎ │ verify │ ────▶ │ verify_raw │ ╎ ╎ └────────┘ └───────────────────────────┘ ╎ ╎ ╎ └−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−┘ ┌−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−┐ ╎ Encryption ╎ ╎ ╎ ╎ ┌───────────────────────────┐ ╎ ╎ │ public_encrypt │ ╎ ╎ └───────────────────────────┘ ╎ ╎ │ ╎ ╎ │ ╎ ╎ ▼ ╎ ╎ ┌───────────────────────────┐ ╎ ╎ │ encrypt │ ╎ ╎ └───────────────────────────┘ ╎ ╎ ╎ └−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−┘ Speed difference looks to be in the noise. Before: Did 19716 RSA 2048 signing operations in 10050000us (1961.8 ops/sec) Did 712000 RSA 2048 verify (same key) operations in 10007156us (71149.1 ops/sec) Did 590000 RSA 2048 verify (fresh key) operations in 10004296us (58974.7 ops/sec) Did 101866 RSA 2048 private key parse operations in 10090285us (10095.5 ops/sec) Did 2919 RSA 4096 signing operations in 10019359us (291.3 ops/sec) Did 203000 RSA 4096 verify (same key) operations in 10008421us (20282.9 ops/sec) Did 175000 RSA 4096 verify (fresh key) operations in 10026353us (17454.0 ops/sec) Did 30900 RSA 4096 private key parse operations in 10090073us (3062.4 ops/sec) After: Did 19525 RSA 2048 signing operations in 10000499us (1952.4 ops/sec) Did 706000 RSA 2048 verify (same key) operations in 10002172us (70584.7 ops/sec) Did 588000 RSA 2048 verify (fresh key) operations in 10010856us (58736.2 ops/sec) Did 101864 RSA 2048 private key parse operations in 10063474us (10122.2 ops/sec) Did 2919 RSA 4096 signing operations in 10037480us (290.8 ops/sec) Did 203000 RSA 4096 verify (same key) operations in 10026966us (20245.4 ops/sec) Did 175000 RSA 4096 verify (fresh key) operations in 10032281us (17443.7 ops/sec) Did 31416 RSA 4096 private key parse operations in 10031047us (3131.9 ops/sec) Change-Id: I8dec8a33066717b7078f160e3f93c33cd354bb0c Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51426 Reviewed-by: David Benjamin <davidben@google.com>	3 years ago
Adam Langley	9cad13eea1	Perform SHA-$x and HMAC KAT before integrity check. AS10.20 requires that the self-test for the integrity algorithm pass before the integrity check itself. IG 10.3.A requires an HMAC self-test now. Therefore run these tests before the integrity check. Since we also need the ability to run all self-tests, both SHA self-tests and the HMAC test are run again when running self-tests. I'm assuming that they're so fast that it doesn't matter. Change-Id: I6b23b6fd3cb6107edd7420bc8680780719bd41d2 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51328 Reviewed-by: David Benjamin <davidben@google.com>	3 years ago
Adam Langley	15565a8984	Split FIPS KATs into fast and slow groups. The provision of FIPS that allowed the tests to be skipped based on a flag-file has been removed in 140-3. Therefore we expect to run the fast KATs on start-up, but to defer to slower ones until the functionality in question is first used. So this change splits off the fast KATs and removes support for skipping KATs based on a flag-file. Change-Id: Ib24cb1739cfef93e4a1349d786a0257ee1083cfb Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/51326 Reviewed-by: David Benjamin <davidben@google.com>	3 years ago
David Benjamin	846a22700f	Switch __ARM_FEATURE_CRYPTO to __ARM_FEATURE_{AES,SHA2}. The latest version of ACLE splits __ARM_FEATURE_CRYPTO into two defines to reflect that, starting ARMv8.2, the cryptography extension can include {AES,PMULL} and {SHA1,SHA256} separately. Also standardize on __ARM_NEON, which is the recommended symbol from ACLE, and the only one defined on non-Apple aarch64 targets. Digging through GCC history, __ARM_NEON__ is a bit older. __ARM_NEON was added in GCC's 9e94a7fc5ab770928b9e6a2b74e292d35b4c94da from 2012, part of GCC 4.8.0. I suspect we can stop paying attention to __ARM_NEON__ at this point, but I've left both working for now. __ARM_FEATURE_{AES,SHA2} is definite too new to fully replace __ARM_FEATURE_CRYPTO. Tested on Linux that -march=armv8-a+aes now also drops the fallback AES code. Previously, we would pick up -march=armv8-a+crypto, but not -march=armv8-a+aes. Also tested that, on an OPENSSL_STATIC_ARMCAP build, -march=armv8-a+sha2 sets the SHA-1 and SHA-256 features. Change-Id: I749bdbc501ba2da23177ddb823547efcd77e5c98 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/50847 Reviewed-by: Adam Langley <agl@google.com>	3 years ago
David Benjamin	661266ea06	Move CPU detection symbols to crypto/internal.h. These symbols were not marked OPENSSL_EXPORT, so they weren't really usable externally anyway. They're also very sensitive to various build configuration toggles, which don't always get reflected into projects that include our headers. Move them to crypto/internal.h. Change-Id: I79a1fcf0b24e398d75a9cc6473bae28ec85cb835 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/50846 Reviewed-by: Adam Langley <agl@google.com>	3 years ago
David Benjamin	fa6ced9512	Extract common rotl/rotr functions. We have a ton of per-file rotation functions, often with generic names that do not tell you whether they are uint32_t vs uint64_t, or rotl vs rotr. Additionally, (x >> r) \| (x << (32 - r)) is UB at r = 0. (x >> r) \| (x << ((-r) & 31)) works for 0 <= r < 32, which is what cast.c does. GCC and Clang recognize this pattern as a rotate, but MSVC doesn't. MSVC does, however, provide functions for this. We usually rotate by a non-zero constant, which makes this moot, but rotation comes up often enough that it's worth extracting out. Some particular changes to call out: - I've switched sha256.c from rotl to rotr. There was a comment explaining why it differed from the specification. Now that we have both functions, it's simpler to just match the specification. - I've dropped all the inline assembly from sha512.c. Compilers should be able to recognize rotations in 2021. Change-Id: Ia1030e8bfe94dad92514ed1c28777447c48b82f9 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/49765 Reviewed-by: Adam Langley <agl@google.com>	3 years ago
David Benjamin	e38cf79cdf	Don't enable atomics in NO_THREADS configurations. In configurations without threads, we're not thread-safe anyway. Instead use the refcount_lock.c implementation which, in turn, calls into thread_none.c, so this turns into a plain refcount. This avoids a build issue on platforms which define NO_THREADS, use C11, lack C11 atomics, and are missing a __STDC_NO_ATOMICS__ definition. The platforms ought to define __STDC_NO_ATOMICS__ or implement them, but atomics are also unnecessary overheard in NO_THREADS configurations anyway. Change-Id: I927e1825dd6474d95226b93dad704594f120450a Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/48565 Commit-Queue: David Benjamin <davidben@google.com> Commit-Queue: Adam Langley <agl@google.com> Reviewed-by: Adam Langley <agl@google.com>	3 years ago
David Benjamin	ecc301ca0f	Add a pointer alignment helper function. Also use a slightly more conservative pattern. Instead of aligning the pointer as a uintptr_t and casting back, compute the offset and advance in pointer space. C guarantees that casting from pointer to uintptr_t and back gives the same pointer, but general integer-to-pointer conversions are generally implementation-defined. GCC does define it in the useful way, but this makes fewer dependencies. Change-Id: I70c7af735e892fe7a8333b78b39d7b1f3f1cdbef Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/48405 Reviewed-by: Adam Langley <alangley@gmail.com>	3 years ago
David Benjamin	ca4598781a	Move load/store helpers to crypto/internal.h. We have loads of variations of these. Align them in one set. This avoids the HOST_* macros defined by md32_common.h, so it'll be a little easier to make it a more conventional header. Change-Id: Id47fe7b51a8f961bd87839f8146d8a5aa8027aa6 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/46425 Reviewed-by: Adam Langley <agl@google.com>	4 years ago
David Benjamin	edfe4133d2	Automatically enable C11 atomics when available. It's now 2021. Hopefully we can at least assume anyone building with -std=c11 also has a corresponding set of headers. Plus, even if you don't, Clang seems to provide a header. (So C11 atomics work in clang-cl.) Also apparently atomics are optional, so this checks __STDC_NO_ATOMICS__. This does not set C11 as the minimum version. If you build with -std=c99, we'll silently use the non-atomics implementation. That's a little magical, so I've kept OPENSSL_C11_ATOMIC as a way to assert that you really want C11 atomics. Mostly it turns into a -std=c11 && !MSVC self-assert. Update-Note: If something fails to compile, we'll revert this and adjust the check, or add an opt-out, or give up. Also, if building with -std=c99, consider -std=c11. Change-Id: I1a8074c367a765c5a0f087db8c250e050df2dde8 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/46344 Reviewed-by: Adam Langley <agl@google.com>	4 years ago
Adam Langley	f7b87c83fa	fips: add counters. In order to provide evidence to auditors that high-level functions end up calling into the FIPS module, provide counters that allow for such monitoring. Change-Id: I55d45299f3050bf58077715ffa280210db156116 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/46124 Commit-Queue: Adam Langley <agl@google.com> Reviewed-by: David Benjamin <davidben@google.com>	4 years ago
Adam Langley	fb0c05cac2	acvp: add CMAC-AES support. Change by Dan Janni. Change-Id: I3f059e7b1a822c6f97128ca92a693499a3f7fa8f Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/41984 Commit-Queue: Adam Langley <agl@google.com> Reviewed-by: David Benjamin <davidben@google.com>	4 years ago

43 Commits (ec87e1a7672044cbd7d484f2478cd618a7f1f81a)