More changes as part of the dualstack design:
- Change resolver and LB policy APIs to support multiple addresses per
endpoint. Specifically, replace `ServerAddress` with
`EndpointAddresses`, which encodes more than one address. Per-address
channel args are retained at the same level, so they are now
per-endpoint. For now, `EndpointAddress` provides a single-address ctor
and a single-address accessor for backward compatibility, so
`ServerAdress` is an alias for `EndpointAddresses`; eventually, this
alias and the single-address methods will be removed.
- Add an `EndpointAddressSet` class, which represents an unordered set
of addresses to be used as a map key. This will be used in a number of
LB policies that need to store per-endpoint state.
- Change the LB policy API's `ChannelControlHelper::CreateSubchannel()`
method to take the address and per-endpoint channel args as separate
parameters, so that we don't need to construct a legacy `ServerAddress`
object as we create a new subchannel for each address in the endpoint.
- Change pick_first to flatten the address list.
- Change ring_hash to use `EndpointAddressSet` as the key for its
endpoint map, and to use the first address of the endpoint as the hash
key.
- Change WRR to use `EndpointAddressSet` as the key for its endpoint
weight map.
Note that support for multiple addresses per endpoint is guarded in RR
by the existing `round_robin_delegate_to_pick_fist` experiment and in
WRR by the existing `wrr_delegate_to_pick_first` experiment.
This PR does *not* include support for multiple addresses per endpoint
for the outlier_detection or xds_override_host LB policies; those will
come in subsequent PRs.
We have a bunch of experiments testing against core e2e - and this is
good for robustness, bad for CI times.
We also have a bunch of marginal but overall necessary fixtures in the
e2e suites - again good for robustness, bad for CI times.
We can eliminate some of the cross product though, and I think safely:
run experiments on a broad range of suites, but not *ALL* the suites,
and get a bunch of our CI time back.
Here I introduce an environment variable: `GRPC_CI_EXPERIMENTS` that's
set when running bazel @experiment= configs, cleared otherwise (so we
can still execute those tests directly when necessary). When that env
var is set we filter out a bunch of suites from the test configurations.
The previous approach of generating strings was not converging well.
Instead, load a bitfield from the protobuf and use the bits to select
experiments. The fuzzers can explore this space swiftly.
Downside is that as experiments rotate in/out the corpus gets a bit
messed up, but I'm reasonably confident we'll recover quickly.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: Mark D. Roth <roth@google.com>
Co-authored-by: markdroth <markdroth@users.noreply.github.com>
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Update from gtcooke94:
This PR adds support to build gRPC and it's tests with OpenSSL3. There were some
hiccups with tests as the tests with openssl haven't been built or exercised in a
few months, so they needed some work to fix.
Right now I expect all test files to pass except the following:
- h2_ssl_cert_test
- ssl_transport_security_utils_test
I confirmed locally that these tests fail with OpenSSL 1.1.1 as well,
thus we are at least not introducing regressions. Thus, I've added compiler directives around these tests so they only build when using BoringSSL.
---------
Co-authored-by: Gregory Cooke <gregorycooke@google.com>
Co-authored-by: Esun Kim <veblush@google.com>
Fix sticky-TF behavior such that once we enter TRANSIENT_FAILURE, we do
not leave that state if we get a new address list.
Also, fix handling of subchannels in state TRANSIENT_FAILURE.
Previously, if a subchannel was already in state TRANSIENT_FAILURE when
we wanted to start a connection attempt on it (e.g., because the
subchannel already existed from a different channel, or because it
already existed in the previous subchannel list), we would wait for it
to report IDLE before attempting to connect. This PR changes pick_first
to instead immediately skip the subchannel and move on to the next one.
Now, the only time we wait for a subchannel in TRANSIENT_FAILURE is when
we wrap back around to the first subchannel in the list.
This doesn't matter for gtest (it does its own sorting later), but for
the fuzzers this will probably save a bit of churn in the corpus and
lead to faster discovery of interesting cases.
CNR a WindowsEventEngine listener flake in:
* 10k local Windows development machine runs
* 50k Windows RBE runs
* 10k Windows VM runs
It fails ~5 times per day on the master CI jobs.
This PR adds some logging to try to see if an edge is missed, and
switches the thread pool implementation to see if that makes the flake
go away. If the flakes disappear, I'll try removing one or the other to
see if either independently fix the problem (hopefully not logging).
---------
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
Why: Cleanup for chttp2_transport ahead of promise conversion - lots of
logic has become interleaved throughout chttp2, so some effort to
isolate logic out is warranted ahead of that conversion.
What: Split configuration and policy tracking for each of ping rate
throttling and abuse detection into their own modules. Add tests for
them.
Incidentally: Split channel args into their own header so that we can
split the policy stuff into separate build targets.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
I'd been adding the following stanza regularly to debug flakes/fuzz
failures:
```
Expect(1, CoreEnd2endTest::MaybePerformAction{[&](bool success) {
Crash(absl::StrCat(
"Unexpected completion of client side call: success=",
success ? "true" : "false", " status=", server_status.ToString(),
" initial_md=", server_initial_metadata.ToString()));
}});
```
it was helpful because it indicated why a call batch finished
successfully and helped quickly identify next steps.
It occurred to me however that this would better be done inside of the
framework, and for *all* ops that have outputs, so this PR does just
that. Any time a batch with an op that outputs information finishes
successfully but unexpectedly we now display those outputs in human
readable form in the error message.
Sample output:
```
[ RUN ] CorpusExamples/FuzzerCorpusTest.RunOneExample/0
RUN TEST: Http2SingleHopTest.SimpleDelayedRequestShort/Chttp2SimpleSslFullstack
E0101 00:00:05.000000000 396633 simple_delayed_request.cc:37] Create client side call
E0101 00:00:05.000000000 396633 simple_delayed_request.cc:41] Start initial batch
E0101 00:00:05.000000000 396633 simple_delayed_request.cc:47] Start server
E0101 00:00:05.000000000 396633 cq_verifier.cc:364] Verify tag(101)-✅ for 600000ms
test/core/end2end/cq_verifier.cc:316: Unexpected event: OP_COMPLETE: tag:0x1 OK
with:
incoming_metadata: {}
status_on_client: status=4 msg=Deadline Exceeded trailing_metadata={}
checked @ test/core/end2end/tests/simple_delayed_request.cc:51
expected:
test/core/end2end/tests/simple_delayed_request.cc:50: tag(101) success=true
```
Noticed this failing on an internal cl due to deadline exceeded errors.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
- Change the `ResolverFactory::GetDefaultAuthority()` method to %-encode
the authority by default, so individual resolver impls don't need to
remember to do this.
- Remove the hack in the xds resolver for setting the authority to
everything after the last `/` character.
- Change the `unix`, `unix-abstract`, and `vsock` resolvers to use a
real authority instead of hard-coding to "localhost".
In chttp2: a pending but not yet sent goaway should block incoming
requests just like a sent one (we will sent that data momentarily!)
In the test:
- handle the case of the connection idle timeout happening before the
request arrives at the server
- disable retries, as these cause the request to get stuck (as we don't
have an additional server to retry on)
Fix b/287897932
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
We want writes to participate in event re-ordering, but it's unlikely
that we can sustain one byte per 500ms on all tests and keep them
passing (which is the degenerate case right now).
Tune write delays down to 50ms for the moment, though I expect we'll
want to talk about going lower.
omgwtfbbq
This test relies on WAIT_FOR_READY semantics, but we don't do that in
the proxy, so it got assigned the wrong suite.
Fix the suite, fix the flakes.
Also add some handy dandy logging to help figure this stuff out in the
future.
I can still make the old algorithm break and assign duplicate names on
my machine... make it a little more robust.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>