More work on the dualstack backend design:
- Change round_robin to delegate to pick_first instead of creating
subchannels directly.
- Change pick_first such that when it is the child of a petiole policy,
it will unconditionally start a health watch.
- Change the client-side health checking code such that if client-side
health checking is not enabled, it will return the subchannel's raw
connectivity state.
- As part of this, we introduce a new endpoint_list library to be used
by petiole policies, which is intended to replace the existing
subchannel_list library. The only policy that will still directly
interact with subchannels is pick_first, so the relevant parts of the
subchannel_list functionality have been copied directly into that
policy. The subchannel_list library will be removed after all petiole
policies are updated to delegate to pick_first.
I'm fixing the ALTS/Envoy transport socket extension (which is currently
broken). Along the way, I'm trying to remove as many uses of gRPC
internals as possible (with the eventual goal of only relying on public
gRPC APIs and the alts_zero_copy_grpc_protector). To this end, I need to
remove the ExecCtx check in the alts_zero_copy_grpc_protector_create
function, so that Envoy can call into this function without needing to
create an ExecCtx.
Currently, the outlier_detection policy reports ejection by intercepting
a subchannel's raw connectivity state watch. In the dualstack backend
design, we will instead want to report ejection via the health watch.
This PR is a first step toward that goal.
Specific changes in this PR:
- Add `type()` method to `InternalSubchannelDataWatcherInterface`, to
make it possible for LB policies to intercept data watchers.
- Use that mechanism in the outlier_detection policy to report ejection
both via raw connectivity state watches and via health watches. The hack
to prevent outlier_detection from working with pick_first from #33336
has been changed to affect only the raw connectivity watch, not the
health watch.
- Change health check client to fall back to reporting the raw
connectivity state if client-side health checking is not enabled. This
will allow pick_first to unconditionally start a health watch when it is
running under a petiole policy, which will be a no-op if neither health
checking nor outlier detection are configured.
Once we are done changing all of the petiole policies to delegate to
pick_first, we will remove the code that allows outlier_detection to
work via the raw connectivity state, so it will work only via the health
watch.
The address attribute interface was intended to provide a mechanism to
pass attributes separately from channel args, for values that do not
affect subchannel behavior and therefore do not need to be present in
the subchannel key, which does include channel args. However, the
mechanism as currently designed is fairly clunky and is probably not the
direction we will want to go in the long term.
Eventually, we will want some mechanism for registering channel args,
which would provide a cleaner way to indicate that a given channel arg
should not be used in the subchannel key, so that we don't need a
completely different mechanism. For now, this PR is just doing an
interim step, which is to establish a special channel arg key prefix to
indicate that an arg is not needed in the subchannel key.
Add -weak_framework CoreFoundation to the ruby extension link line on
arm64-darwin, to address "undefined symbols" issues from #33483
This is a variation on #33513 in response to feedback and so I've made
sure @alto-ruby is credited as co-author.
Closes#33483
Supersedes #33513
cc @apolcyn
Co-authored-by: alto-ruby <altorubys@gmail.com>
This change simplifies `EventEngine::DNSResolver`'s API based on the
proposal:
[go/event-engine-dns-resolver-api-changes](http://go/event-engine-dns-resolver-api-changes).
Note that this API change + the implementation described in
[go/event-engine-dns-resolver-implementation](http://go/event-engine-dns-resolver-implementation)
has already been tested against our main test suites and are passing
them.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Fixes#24206 by ensuring that IOCP/socket errors in the iomgr on_read
callback are properly annotated with the gRPC Unavailable status. The
WindowsEventEngine was already doing this correctly (try running the
client with `$env:GRPC_EXPERIMENTS="event_engine_client"`).
This also adds two small cleanups:
* Cleanly prints statuses with their child statuses in a few spots
within the chttp2 transport logging (previously, child messages were
printed with garbled bits)
* Adds friendly names to a subset of WSA errors that we're likely to see
from common operations. The top-level status message will no longer just
say "WSA Error" in many cases.
CC @Hamza-Q
This fixes a dumb bug from #33359, where I used `==` instead of `<` in
the comparator functor.
It also avoids unnecessary down-casting of the `unique_ptr<>`.
This reverts commit e107ff5e99.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Noticed this in a nested forking test in
https://github.com/grpc/grpc/pull/33430 (by nested fork - forked child
process forks again).
Before this change, the EventEngine and non-EventEngine pollers were
competing with each other in their calls to
`Fork::SetResetChildPollingEngineFunc`. In the first child's after-fork
handler, the EE engine would [clear
out](123da4a866/src/core/lib/event_engine/posix_engine/ev_epoll1_linux.cc (L260))
the post-fork handlers of the non-EE poller, leaving the grandchild
without the right post-fork cleanup method.
New versions of Clang (and GCC) warn about this
grpc_authorization_policy_provider.cc:124:17: error: format specifies
type 'int' but the argument has type 'absl::StatusCode'
[-Werror,-Wformat]
124 | "authorization policy reload status. code=%d error_details=%s",
| ~~
125 | status.code(), std::string(status.message()).c_str());
In chttp2: a pending but not yet sent goaway should block incoming
requests just like a sent one (we will sent that data momentarily!)
In the test:
- handle the case of the connection idle timeout happening before the
request arrives at the server
- disable retries, as these cause the request to get stuck (as we don't
have an additional server to retry on)
Fix b/287897932
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Noticed some inconsistencies in our keepalive configuration -
* Earlier, even if keepalive pings were disabled, we would be scheduling
keepalive pings at an interval of INT_MAX ms.
* We were not using `g_default_client_keepalive_permit_without_calls` /
`g_default_server_keepalive_permit_without_calls`. They are both false
by default but they can be overridden in
`grpc_chttp2_config_default_keepalive_args`.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Child classes of polling resolvers might have use cases where they would want to request a resolution explicitly which honor the min time between requests. As RequestReresolutionLocked needs to be called in from work_serializer which is usually owned by polling resolver completely, allow passing the work serializer back to the child class.
In rare cases across gRPC's end2end tests (only in Windows RBE with
debug builds, interestingly enough), the endpoint may be destroyed
before the final IOCP read callback has run. This edge case is only
triggered when all the following are true:
* the previous read operation received data (not an error, not
0-length), so more data is expected in the stream.
* the socket has not yet shut down
* the application destroyed its endpoint before (or during) the IOCP
callback execution.
* the Read operation has not yet called the client's on_read callback.
This is a valid scenario, and it is expected that the engine
implementation should call the application callbacks with error statuses
when this occurs. This PR fixes two associated bugs.
Previously we were running these at call termination, but internally we
have things that want to access finalizer cleaned up data after that
point.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Fix fuzzer found bug b/286716972
Follows up on https://github.com/grpc/grpc/pull/33266 but gets the edge
case right of when there's a read queued before the peer closes - in
that case we weren't waking up the read.
PanCakes to the rescue!
We noticed that our 'sanity' test was going to fail, but we think we can
fix that automatically, so we put together this PR to do just that!
If you'd like to opt-out of these PR's, add yourself to NO_AUTOFIX_USERS
in .github/workflows/pr-auto-fix.yaml
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Otherwise when we fill in peer in `SetCommonEntryFields` it will be
empty/invalid.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
The following bugs are fixed:
* Missing ExecCtx in event engine endpoints and listeners
* Ref counting issue with iomgr endpoint which causes crashes in
overloaded situations
The PR includes a test which triggers these bugs by simulating an
overloaded system.
With some delay, this is a PR for
https://github.com/grpc/grpc/issues/32564 (and previously
https://github.com/grpc/grpc/pull/31791).
I looked into adding a regular `py_test` for this change [as
suggested](https://github.com/grpc/grpc/pull/31791#issuecomment-1423245116)
but I am not aware of any effect that the presence of a .pyi stub file
would have at runtime and where some sort of type-checking in a .py
script would be affected. Stub files are only for use by type checkers &
IDE's. I mean, something like this would work:
```
import helloworld_pb2
py_file = helloworld_pb2.__file__
pyi_file = py_file + 'i’
self.assertTrue(os.path.exists(pyi_file))
```
But that seems really hacky to me. Instead I created a simple rule test
for `py_proto_library` with Bazel Skylib which tests the declared
outputs for an example `py_proto_library` target. Indirectly, this also
tests that the declared output files are actually generated. Please let
me know if this is sufficient.