This PR implements a work-stealing thread pool for use inside
EventEngine implementations. Because of historical risks here, I've
guarded the new implementation behind an experiment flag:
`GRPC_EXPERIMENTS=work_stealing`. Current default behavior is the
original thread pool implementation.
Benchmarks look very promising:
```
bazel test \
--test_timeout=300 \
--config=opt -c opt \
--test_output=streamed \
--test_arg='--benchmark_format=csv' \
--test_arg='--benchmark_min_time=0.15' \
--test_arg='--benchmark_filter=_FanOut' \
--test_arg='--benchmark_repetitions=15' \
--test_arg='--benchmark_report_aggregates_only=true' \
test/cpp/microbenchmarks:bm_thread_pool
```
2023-05-04: `bm_thread_pool` benchmark results on my local machine (64
core ThreadRipper PRO 3995WX, 256GB memory), comparing this PR to
master:
![image](https://user-images.githubusercontent.com/295906/236315252-35ed237e-7626-486c-acfa-71a36f783d22.png)
2023-05-04: `bm_thread_pool` benchmark results in the Linux RBE
environment (unsure of machine configuration, likely small), comparing
this PR to master.
![image](https://user-images.githubusercontent.com/295906/236317164-2c5acbeb-fdac-4737-9b2d-4df9c41cb825.png)
---------
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
This allows us to replace `absl::optional<TaskHandle>` with checks
against the invalid handle.
This PR also replaces the differently-named invalid handle instances
with a uniform way of accessing static invalid instances across all
handle types, which aids a bit in testing.
* rename task-handle mutex
* rename TimerClosure
* tmp commit, building, not tested
* add test for client connection to a non-listening port
* fix posix EE tests
* re-fix windows test suite after posix compatibility
* (unfinished, backup): passing the suite's NonExistingListener client test
* remove redundant windows client test
* integrate IOCP worker loop
* initial commit of echo test tool; fixes
* move echo client to test_suite/tools; I do not yet like the setup, it's about time for a macro that generates all useful test/tool targets
* cleanup
* add --target arg to echo_client. requires URI
* Automated change: Fix sanity tests
* build fixes
* fix
* fix
* reviewer feedback
* warning fix
* delete logic on cancellation
* fix
* cancel connect deadlock; improved template code
* fix build failure with linux environments building windows targets
* fix
* fix
* no ++ for atomics
* remove the test changes, to be landed separately
* Automated change: Fix sanity tests
* remnants
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
* Reland x2: Make GetDefaultEventEngine return a shared_ptr
* remove thread leak from NativeDNSResolver
This is not going to work for resolvers that support cancellation.
* give resolvers bounded lifetimes
Some resolver own EventEngines. EventEngines cannot run off the end of
the process since they have unjoined threads (problematic in a small set
of environments). This gives resolvers bounded lifetimes, and allows
replacement of resolvers without ASAN issues of deleting resolvers in
active use (occurs in tests).
* fix
* fix windows
* fix surface init test
* fix
* sanitize
* use after move
* the test must wait for the callback to be destroyed
* windows fix: delete the resolver on iomgr shutdown, not before
* Make TimerManager threads non-joinable
On gRPC shutdown, any unjoined TimerManager threads will cause TSAN to
detect thread leaks. This fix resolves issues I saw in end2end test
shutdown in another PR, where a single timer manager thread was always
alive after the test ended.
The long-term solution is to integrate the new ThreadPool here, but this
unblocks me for now.
* backport fix
* fix
* shared_ptr<EventEngine> in EventEngine benchmarks
* Revert "Revert "[event_engine] Thread pool that can handle deletion in a callback (#30763)" (#30972)"
This reverts commit ccc787a020.
* Update thread_pool.cc
* Reland: "Make GetDefaultEventEngine return a shared_ptr (#30280)"
This reverts commit 45959e7cc1.
* Attempted fix with NoDestruct
* Not a process-wide singleton for the type. Just a NonDestruct
* fix
This works around valgrind memory leaks by giving EventEngines a fixed
lifetime. We eventually want ref-counted EventEngines internally, so this is
a step in the right direction as well.
A (currently) pthread_atfork-based fork support mechanism, allowing EventEngines - or any other object that wants to implement the Forkable interface - respond to forks.
* Refactor end2end tests to exercise each EventEngine
* fix incorrect bazel_only exclusions
* Automated change: Fix sanity tests
* microbenchmark fix
* sanitize, fix iOS flub
* Automated change: Fix sanity tests
* iOS fix
* reviewer feedback
* first pass at excluding EventEngine test expansion
Also caught a few cases where we should not test pollers, but should
test all engines. And two cases where we likely shouldn't be testing
either product.
* end2end fuzzers to be fuzzed differently via EventEngine.
* sanitize
* reviewer feedback
* remove misleading comment
* reviewer feedback: comments
* EE test_init needs to play with our build system
* fix golden file test runner
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
This allows implementers to select which subset of the conformance test
suite they wish to exercise with their implementation. This was a
request from the fuchsia team, and may be useful for partial
implementations that are composed into a complete EventEngine solution.
This avoids having to do a cherry-pick import, and is harmless since
there are no dependencies yet on the EventEngine. This test will be
re-enabled shortly after both the import and related changes are
finished.
* Reintroduce the EventEngine default factory
An application can provide an EventEngine factory function that allows
gRPC internals to create EventEngines as needed. This factory would be
used when no EventEngine is provided for some given channel or server,
and where an EventEngine otherwise could not be provided by the
application. Note that there currently is no API to provide an
EventEngine per channel or per server.
I've also deleted some previous iterations on global EventEngine and
EventEngine factory ideas. This new code lives in a public API, and
coexists with iomgr instead of being isolated to an EventEngine-specific
iomgr implementation.
* add proper namespaces, and fix description
* put factory functions in their own file (for replaceability)
* add synchronization
* generate_projects.sh
* extract event_engine_base and event_engine_factory targets
Also separate iomgr/event_engine files in the BUILD, with comments
* gpr_platform
* move all EE factory declarations to event_engine_base
Makes internal hackery easier.
* add missing deps
* reorder dep alphabetically
* comment style change
A reusable test suite for EventEngine implementations.
To exercise a custom EventEngine, simply link against :event_engine_test_suite
and provide a testing main function that sets a custom EventEngine factory:
```
#include "path/to/my_custom_event_engine.h"
#include "src/core/event_engine/test_suite/event_engine_test.h"
int main(int argc, char** argv) {
::testing::InitGoogleTest(&argc, argv);
SetEventEngineFactory(
[]() { return absl::make_unique<MyCustomEventEngine>(); });
auto result = RUN_ALL_TESTS();
return result;
}
```
This code adds an iomgr implementation that's backed by an EventEngine. This uses the EventEngine API alone, and separate work will introduce an EventEngine prototype to plug into it.
See also drfloob#1: @nicolasnoble has a pull request against this branch, implementing the libuv-based EventEngine. One goal here is to implement the iomgr code such that it can be merged independently without affecting normal builds.
This implementation can be built using bazel build --cxxopt='-DGRPC_USE_EVENT_ENGINE' :all
Some shortcuts are being taken to get a working, testable version of the engine. EventEngines are not pluggable, for example.