The pick_first policy creates a list of subchannels for each resolver update and then iterates over the list, attempting to connect to each subchannel in turn, until one of them succeeds. However, once a subchannel does succeed, the policy unrefs the other subchannels but still retains a bunch of now-unnecessary state in the subchannel list itself. This wastes a bunch of memory, especially now that petiole policies are delegating to pick_first. This PR contains a new pick_first implementation that stops retaining that state, which significantly reduces per-channel memory.
There is one behavior change here, which is that if we have a connected subchannel and we get a resolver update that no longer includes that address, we now go IDLE instead of proactively trying to connect to the new addresses.
Closes#34766
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/34766 from markdroth:pick_first_free_memory_after_connecting 7236b4321f
PiperOrigin-RevId: 623887639
Previously we wrote `DualRefCounted : public Orphanable`, but this is wrong: `Orphan`-ing is a private implementation detail to `DualRefCounted`, but a public interface to `Orphanable`.
This bug means that it's possible to write `OrphanablePtr<T>` when `T` is derived from `DualRefCounted`, leading to hard to diagnose bugs - especially when moving a previously `Orphanable` type to be `DualRefCounted`.
This change removes the inheritance from `Orphanable`, and instead adds an overridable method `Orphaned` that implementors of `DualRefCounted` can implement.
In this way we get:
* compiler errors if someone chooses to write `OrphanablePtr` for one of these types
* compiler errors if someone implements `Orphan()` instead of `Orphaned()` in the wrong place (or vice versa)
Closes#36194
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/36194 from ctiller:orf b96b831a96
PiperOrigin-RevId: 620916632
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#36070
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/36070 from yijiem:grpc-metrics 72653727b1
PiperOrigin-RevId: 618529035
PanCakes to the rescue!
We noticed that our 'sanity' test was going to fail, but we think we can fix that automatically, so we put together this PR to do just that!
If you'd like to opt-out of these PR's, add yourself to NO_AUTOFIX_USERS in .github/workflows/pr-auto-fix.yaml
Closes#36059
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/36059 from grpc:create-pull-request/patch-02276f4 fb93164443
PiperOrigin-RevId: 613642099
This breaks the following pieces out of the `grpc_client_channel` BUILD target:
- backend_metric_parser
- oob_backend_metric
- child_policy_handler
- backup_poller
- service_config_channel_arg_filter
- client_channel_channelz
- client_channel_internal_header
- subchannel_connector
- subchannel_pool_interface
- config_selector
- client_channel_service_config_parser
- retry_service_config_parser
- retry_throttle
The code left in the `grpc_client_channel` target will need more work to pull apart.
Closes#35879
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/35879 from markdroth:client_channel_build_split f388a37edc
PiperOrigin-RevId: 608806548
This change is needed in order to get certain LB-policy-unit-test-framework-based tests to pass, which involve updates on PF policies.
Without this change, in such tests, health watchers on the original subchannel can be left hanging around. The underlying reason is related to 842057d8d5/test/core/client_channel/lb_policy/lb_policy_test_lib.h (L203).
Related: cl/563857636
Closes#35865
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/35865 from apolcyn:update_pf_Test 1a5e6b5591
PiperOrigin-RevId: 607844309
As title. Pulling these additions out from a larger change.
Related: cl/563857636
Closes#35860
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/35860 from apolcyn:test_lib_changes 40b6455638
PiperOrigin-RevId: 606708025
This new directory combines code from the following locations:
- src/core/ext/filters/client_channel/resolver
- src/core/lib/resolver
Closes#35804
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/35804 from markdroth:client_channel_resolver_reorg2 30660e6b00
PiperOrigin-RevId: 604665835
This new directory combines code from the following locations:
- src/core/ext/filters/client_channel/lb_policy
- src/core/lib/load_balancing
Closes#35786
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/35786 from markdroth:client_channel_resolver_reorg 98554efb98
PiperOrigin-RevId: 604351832
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#35210
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/35210 from yijiem:csm-service-label 6a6a7d1774
PiperOrigin-RevId: 597641393
I realized that this field wasn't actually necessary, since the string is already present in the map key.
Closes#35503
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/35503 from markdroth:xds_config_remove_cluster_name 94d5edc133
PiperOrigin-RevId: 597375018
Previously, `RefCountedPtr<>` and `WeakRefCountedPtr<>` incorrectly allowed
implicit casting of any type to any other type. This hadn't caused a
problem until recently, but now that it has, we need to fix it. I have
fixed this by changing these smart pointer types to allow type
conversions only when the type used is convertible to the type of the
smart pointer. This means that if `Subclass` inherits from `Base`, then
we can set a `RefCountedPtr<BaseClass>` to a value of type
`RefCountedPtr<Subclass>`, but we cannot do the reverse.
We had been (ab)using this bug to make it more convenient to deal with
down-casting in subclasses of ref-counted types. For example, because
`Resolver` inherits from `InternallyRefCounted<Resolver>`, calling
`Ref()` on a subclass of `Resolver` will return `RefCountedPtr<Resolver>`
rather than returning the subclass's type. The ability to implicitly
convert to the subclass type made this a bit easier to deal with. Now
that that ability is gone, we need a different way of dealing with that
problem.
I considered several ways of dealing with this, but none of them are
quite as ergonomic as I would ideally like. For now, I've settled on
requiring callers to explicitly down-cast as needed, although I have
provided some utility functions to make this slightly easier:
- `RefCounted<>`, `InternallyRefCounted<>`, and `DualRefCounted<>` all
provide a templated `RefAsSubclass<>()` method that will return a new
ref as a subclass. The type used with `RefAsSubclass()` must be a
subclass of the type passed to `RefCounted<>`, `InternallyRefCounted<>`,
or `DualRefCounted<>`.
- In addition, `DualRefCounted<>` provides a templated `WeakRefAsSubclass<T>()`
method. This is the same as `RefAsSubclass()`, except that it returns
a weak ref instead of a strong ref.
- In `RefCountedPtr<>`, I have added a new `Ref()` method that takes
debug tracing parameters. This can be used instead of calling `Ref()`
on the underlying object in cases where the caller already has a
`RefCountedPtr<>` and is calling `Ref()` only to specify the debug
tracing parameters. Using this method on `RefCountedPtr<>` is more
ergonomic, because the smart pointer is already using the right
subclass, so no down-casting is needed.
- In `WeakRefCountedPtr<>`, I have added a new `WeakRef()` method that
takes debug tracing parameters. This is the same as the new `Ref()`
method on `RefCountedPtr<>`.
- In both `RefCountedPtr<>` and `WeakRefCountedPtr<>`, I have added a
templated `TakeAsSubclass<>()` method that takes the ref out of the
smart pointer and returns a new smart pointer of the down-casted type.
Just as with the `RefAsSubclass()` method above, the type used with
`TakeAsSubclass()` must be a subclass of the type passed to
`RefCountedPtr<>` or `WeakRefCountedPtr<>`.
Note that I have *not* provided an `AsSubclass<>()` variant of the
`RefIfNonZero()` methods. Those methods are used relatively rarely, so
it's not as important for them to be quite so ergonomic. Callers of
these methods that need to down-cast can use
`RefIfNonZero().TakeAsSubclass<>()`.
PiperOrigin-RevId: 592327447
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
-->
Closes#35251
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/35251 from yijiem:fix-dns-resolver-cooldown-test 857835200a
PiperOrigin-RevId: 589159895
This avoids storing unnecessary copies of the address list in each node of the LB policy tree.
Closes#34753
COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/34753 from markdroth:lb_address_list_iterator 1d39465fbc
PiperOrigin-RevId: 582891475
Changes to fake resolver:
- Add `WaitForReresolutionRequest()` method to fake resolver response
generator to allow tests to tell when re-resolution has been requested.
- Change fake resolver response generator API to have only one mechanism
for injecting results, regardless of whether the result is an error or
whether it's triggered by a re-resolution.
Changes to grpclb_end2end_test:
- Change balancer interface such that instead of setting a list of
responses with fixed delays, the test can control exactly when each
response is set.
- Change balancer impl to always send the initial LB response, as
expected by the grpclb protocol.
- Change balancer impl to always read load reports, even if load
reporting is not expected to be enabled. (The latter case will still
cause the test to fail.) Reads are done in a different thread than
writes.
- Allow each test to directly control how many backends and balancers
are started and the client load reporting interval, so that (a) we don't
waste resources starting servers we don't need and (b) there is no need
to arbitrarily split tests across different test classes.
- Add timeouts to `WaitForAllBackends()` functionality, so that tests
will fail with a useful error rather than timing out.
- Improved ergonomics of various helper functions in the test framework.
In the process of making these changes, I found a couple of bugs:
- A bug in pick_first, which I fixed in #34885.
- A bug in grpclb, in which we were using the wrong condition to decide
whether to propagate a re-resolution request from the child policy,
which I've fixed in this PR. (This bug probably originated way back in
#18344.)
This should address a lot of the flakes seen in grpclb_e2e_test
recently.
This fixes a bug accidentally introduced in #33753. The symptom is that
if we exit idle and then get a new address list before any of the
subchannels in the old list can report their initial connectivity state,
we will incorrectly ignore the new address list.
The original logic from #34615 was incorrect in cases where one address
family has a different number of addresses than the other(s).
Fixes b/307937051.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
- Fixes support for the same address being present more than once in the
address list, which was accidentally broken in #34244.
- Change the call attribute to encode the hash as an integer instead of
a string.
More changes as part of the dualstack design:
- Change resolver and LB policy APIs to support multiple addresses per
endpoint. Specifically, replace `ServerAddress` with
`EndpointAddresses`, which encodes more than one address. Per-address
channel args are retained at the same level, so they are now
per-endpoint. For now, `EndpointAddress` provides a single-address ctor
and a single-address accessor for backward compatibility, so
`ServerAdress` is an alias for `EndpointAddresses`; eventually, this
alias and the single-address methods will be removed.
- Add an `EndpointAddressSet` class, which represents an unordered set
of addresses to be used as a map key. This will be used in a number of
LB policies that need to store per-endpoint state.
- Change the LB policy API's `ChannelControlHelper::CreateSubchannel()`
method to take the address and per-endpoint channel args as separate
parameters, so that we don't need to construct a legacy `ServerAddress`
object as we create a new subchannel for each address in the endpoint.
- Change pick_first to flatten the address list.
- Change ring_hash to use `EndpointAddressSet` as the key for its
endpoint map, and to use the first address of the endpoint as the hash
key.
- Change WRR to use `EndpointAddressSet` as the key for its endpoint
weight map.
Note that support for multiple addresses per endpoint is guarded in RR
by the existing `round_robin_delegate_to_pick_fist` experiment and in
WRR by the existing `wrr_delegate_to_pick_first` experiment.
This PR does *not* include support for multiple addresses per endpoint
for the outlier_detection or xds_override_host LB policies; those will
come in subsequent PRs.
Add some basic metrics to work serializer, keep them process wide for
now (though it may be interesting to get these into channelz in the
future).
Collected are:
- time spent running a work serializer when it starts
- time spent actually executing work when the work serializer runs
- number of items executed each run
A high disparity between the first two indicates our dispatching
mechanism is adding large amounts of latency (perhaps due to thread
starvation like effects).
A high value for any of these indicate contention on the serializer.
It's likely a future iteration on these will select different metrics -
I'm not *entirely* sure which will be useful in production analysis yet.
I'm using `std::chrono::steady_clock` here for precision (nanoseconds)
with a compact representation (better than timespec) and a robust &
portable api - I think it's appropriate for metrics, but wouldn't use it
much beyond that at this point.
Most recent attempt was #34320, reverted in #34335.
The first commit here is a pure revert. The second commit fixes the
outlier_detection unit test to pass both with and without the
experiment.