Relands #34785, which was reverted in #34818.
The first commit is the revert. The second commit removes the gtest
dependency from the xds_server library, which should address the
testonly problem internally.
More changes as part of the dualstack design:
- Change resolver and LB policy APIs to support multiple addresses per
endpoint. Specifically, replace `ServerAddress` with
`EndpointAddresses`, which encodes more than one address. Per-address
channel args are retained at the same level, so they are now
per-endpoint. For now, `EndpointAddress` provides a single-address ctor
and a single-address accessor for backward compatibility, so
`ServerAdress` is an alias for `EndpointAddresses`; eventually, this
alias and the single-address methods will be removed.
- Add an `EndpointAddressSet` class, which represents an unordered set
of addresses to be used as a map key. This will be used in a number of
LB policies that need to store per-endpoint state.
- Change the LB policy API's `ChannelControlHelper::CreateSubchannel()`
method to take the address and per-endpoint channel args as separate
parameters, so that we don't need to construct a legacy `ServerAddress`
object as we create a new subchannel for each address in the endpoint.
- Change pick_first to flatten the address list.
- Change ring_hash to use `EndpointAddressSet` as the key for its
endpoint map, and to use the first address of the endpoint as the hash
key.
- Change WRR to use `EndpointAddressSet` as the key for its endpoint
weight map.
Note that support for multiple addresses per endpoint is guarded in RR
by the existing `round_robin_delegate_to_pick_fist` experiment and in
WRR by the existing `wrr_delegate_to_pick_first` experiment.
This PR does *not* include support for multiple addresses per endpoint
for the outlier_detection or xds_override_host LB policies; those will
come in subsequent PRs.
This should address one of the failures we're seeing in #34224.
The test failure is caused by the changes in timing triggering a race
condition. In the code at head, we delay sending out the subscription
for the first CDS watch until we've already seen the other two CDS
watches, because the previous send_message op has not yet completed, and
by the time it does, we've seen all 3 watches, so we can send a
subscription for all 3 at the same time. With the WorkSerializer change,
the send_message op is complete by the time we see the first CDS watch,
so we subscribe to only that resource, and then later add the other two.
The result is that we'll NACK twice with two different messages, the
first one including only the error about the first resource, and the
second one including all three.
I suspect this same race condition would have been triggered eventually
by the EventEngine migration anyway; the current test basically depends
on the single-thread timing of the iomgr approach. So I'm addressing it
by replacing the e2e test with a unit test that covers the same cases
without the timing issue.
Old message:
new_cluster_1: UNAVAILABLE: errors validating xds_cluster_resolver LB
policy config: [field:discoveryMechanisms error:must be non-empty]
New message:
new_cluster_1: FAILED_PRECONDITION: aggregate cluster graph has no leaf
clusters
De-experiment pick first since we have both affinity and randomness E2E
test running successfully.
---------
Co-authored-by: Yash Tibrewal <yashkt@google.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
We decided to not populate `policy_name` with the HTTP filter name in
xDS case. So removing it from `GenerateServiceConfig`. This will be
consistent across languages. The gRFC
[PR](https://github.com/grpc/proposal/pull/346) has been updated.
This reverts the following PRs: #32692#33087#33093#33427#33568
These changes seem to have introduced some flaky crashes. Reverting
while I investigate.
More work on the dualstack backend design:
- Change ring_hash policy to delegate to pick_first instead of creating
subchannels directly.
- Note that, as mentioned in the WIP gRFC, because we lazily create the
pick_first child policies, so there's no need to swap over to a new list
as an atomic whole. As a result, we don't use the endpoint_list library
in this policy; instead, we just update a map in-place.
- Remove now-unused subchannel_list library.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Added tests involve:
1. Checking the # of logger invocations with multiple RBACs in the
chain.
2. Verifying content in audit context with action and audit condition
permutations.
3. Confirm custom logger and built-in logger configurations are working.
4. Confirm the feature is protected by the environment variable.
---------
Co-authored-by: rockspore <rockspore@users.noreply.github.com>
- switch to json_object_loader for config parsing
- use `absl::string_view` instead of `const char*` for cert provider
names
- change cert provider registry to use a map instead of a vector
- remove unused mesh_ca cert provider factory
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This bug occurred when the same xDS server was configured twice in the
same bootstrap config, once in an authority and again as the top-level
server. In that case, we were incorrectly failing to de-dup them and
were creating a separate channel for the LRS stream than the one that
already existed for the ADS stream. We fix this by canonicalizing the
server keys the same way in both cases.
As a separate follow-up item, I will work on trying to find a better way
to key these maps that does not suffer from this kind of fragility.
Expand server promises to run with C++ end2end tests.
Across connected_channel/call/batch_builder/pipe/transport:
- fix a bug where read errors weren't propagated from transport to call
so that we can populate failed_before_recv_message for the c++ bindings
- ensure those errors are not, however, used to populate the returned
call status
Add a new latch call arg to lazily propagate the bound CQ for a server
call (and client call, but here it's used degenerately - it's always
populated). This allows server calls to be properly bound to
pollsets.(1)/(2)
In call.cc:
- move some profiling code from FilterStackCall to Call, and then use it
in PromiseBasedCall (this should be cleaned up with tracing work)
- implement GetServerAuthority
In server.cc:
- use an RAII pattern on `MatchResult` to avoid a bug whereby a tag
could be dropped if we cancel a request after it's been matched but
before it's published
- fix deadline export to ServerContext
In resource_quota_server.cc:
- fix some long standing flakes (that were finally obvious with the new
test code) - it's legal here to have client calls not arrive at the
server due to resource starvation, work through that (includes adding
expectations during a `Step` call, which required some small tweaks to
cq_verifier)
In the C++ end2end_test.cc:
- strengthen a flaky test so it passes consistently (it's likely we'll
revisit this with the fuzzing efforts to strengthen it into an actually
robust test)
(1) It's time to remove this concept
(2) Surprisingly the only test that *reliably* demonstrates this not
being done is time_change_test
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This reverts commit 4b46dbc19e.
Reason: this seems to be breaking load reports in certain cases,
b/276944116
Let's revert so this doesn't accidentally get released.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->