- Increase kubernetes library default for urlib3 retries to 10
- Add custom retry logic to all API calls made by framework.k8s
Custom retry logic handles various errors we're experienced over
two years, and based on ~140 failure reports:
1. Errors returned by the k8s API server itself:
- 401 Unauthorized
- 409 Conflict
- 429 Too Many Requests
- 500 Internal Server Error
2. Connection errors that might indicate k8s API server is temporarily
unavailable (such as a restart, upgrade, etc):
- All `NewConnectionError`s, f.e. "Connection timed out",
"Connection refused"
- All "connection aborted" `ProtocolError`s, f.e. "Remote end
closed connection without response", "Connection reset by peer"
ref b/178378578, b/258546394
Built on https://github.com/grpc/grpc/pull/32560
When calling EventEngine::Read, if a synchronous WSARecv call completes
successfully and 1) the read buffer is not full, and 2) the stream
remains open, then the endpoint will now chain execution of more
synchronous WSARecvs. The chain is broken and the on_read callback is
called when either there are errors, the next call would block, the
buffer is full, or the stream is closed.
Something like this is helpful to prevent excessive read callback
execution under a flood of tiny payloads, presuming messages are not
being combined as one would usually expect (see
`//test/core/iomgr:endpoint_pair_test`, and Nagle's algorithm).
Note that there is no behavior change associated with this PR. In other
words, folks that use `GRPC_ARG_ENABLE_PER_MESSAGE_DECOMPRESSION` and
`GRPC_ARG_ENABLE_PER_MESSAGE_COMPRESSION` will still see the same
behavior as before.
The actual change - The compression filter will always be added to the
filter stack for HTTP transports even if it is a no-op due to the above
channel args.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
`bazel query deps(//src/proto/...)` seems unnecessary (regenerated
projects are identical) and causes trouble with protobuf 22.x (since it
basically breaks `tools/buildgen/generate_projects.sh` run and that
makes upgrade experiments painful).
It is reported in https://github.com/grpc/grpc/issues/32356 that there
is a race on vptr for `UnimplementedAsyncRequest` which would cause
crashes for multi-threaded server if clients send unimplemented RPC
request to the server.
The cause is that the server requests a call for
`UnimplementedAsyncRequest` in its base class `GenericAsyncRequest` when
the `vptr` still points to the base class's `vtable`. If the call went
in and another server thread picks up the tag before the `vptr` points
back to the derived class's `vtable`, it would call the wrong virtual
function and also this is a data race. This fix makes the request of the
call inside the derived class's constructor.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
PSM Interop: Local dev various improvements
- Cleanup resources on ctrl+c
- Add startup probes to address the issue with port forwarding starting
before the workload listens on a port
- Remove misleading restartPolicy: it's silently ignored by k8s
- Extra debug message with port-forwarding command
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
- sort source files to ensure stable ordering
- generate one source file per line
together this should produce diffs that are much more readable by humans
when sources get added/removed to/from protobuf (and
make_grpcio_tools.py is used to regenerate).
First step in the modernization of our RBE stack (see
go/rbe-tech-debt-notes).
- Get rid of the deprecated rbe_autoconfig and start using
[rbe_configs_gen](https://github.com/bazelbuild/bazel-toolchains#rbe_configs_gen---cli-tool-to-generate-configs)
+ check in the generated toolchain configs.
- Switch from marketplace.gcr.io/google/rbe-ubuntu16-04 to
marketplace.gcr.io/google/rbe-ubuntu18-04 (this image is still not owned
by us, but at least it's newer and demonstrates how a switch to a newer
docker image is done).
- provide script for generating the linux RBE toolchain configs.
- cleanup RBE configuration in the bazelrc files used for remote build
This check only works if all handshake RPCs have an OK status, and it's
racey e.g. if the client is cancelling handshake RPCs (being when an RPC
is cancelled, termination of the RPC at the client is asynchronous from
termination at the server, so the client can resume the queue before the
server RPC completes).
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
With iomgr, this test is effectively rate limited by ExecCtx and the
single thread running pollset_work, which results in thousands of tiny
writes happening before every read. A small set of _synchronous_ 8k
reads then dominate the read-side of the test. This is an efficient
balance.
With the Windows EventEngine, the fully asynchronous, multi-threaded
reads and writes end up alternating roughly 1:1, meaning that a read
callback is executed for every tiny handful of bytes, tens of thousands
of times. Compared to the Posix EventEngine, without things like TCP_INQ
and/or recvmsg's timeout, I don't know of any great signal for how much
data can safely be received in a batch (e.g., we don't want to wait for
data that will never come, and we don't want to run callbacks for 2
bytes over and over again if we have KB in the pipe).
I believe the Windows EventEngine is WAI. I can significantly improve
this test performance by artificially slowing the reader down (adding a
>= 1ms sleep), but I believe that improves this use case to the
detriment of all others.
This fixes a bug where connections cannot be made in IPv4-only
environments. To test, hard-code `IsIpv6LoopbackAvailable` to return
false.
Example Error:
`
D0309 00:29:49.514359445 235 tcp_client.cc:67] (event_engine)
EventEngine::Connect Status: INTERNAL: socket: Address family not
supported by protocol
`
This can also be reproduced in gRPC's benchmark environment, which does
not have IPv6 enabled.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
To support TPC feature for BYOID (3PI), we need to remove the validation
the pattern of impersonation endpoints, sts endpoints and token info
endpoints since they are different in TPC regions.
A security review is already passed at b/261634871
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
When the handshaker_service_url is in "host:port" format such as it
normally is when using ALTS in GCE (in which case it comes from then
this makes no difference as the authority and the URL are the same. But
when different URLs are used, the correct authority to use is not always
the same as the URL. For example if the URL is unix:///some/path then
the correct authority is "localhost". This is correctly computed by
grpc_core::UnixResolverFactory and stored as the channel's default
authority, but we throw that away when we override the authority for
individual RPCs.
Note indeed that the majority of other callers of grpc_channel_create_*
pass nullptr for the host/authority argument.
It looks like nobody ever created ALTS redentials from Python with a
list of accepted service accounts before.
Simple reproduction:
```
import grpc
grpc.alts_channel_credentials(None) # works
grpc.alts_channel_credentials(['foo']) # fails
```
Without this change, generates this error:
```
[...]
File "src/python/grpcio/grpc/_cython/_cygrpc/credentials.pyx.pxi", line 414, in _cython.cygrpc.channel_credentials_alts
File "src/python/grpcio/grpc/_cython/_cygrpc/credentials.pyx.pxi", line 403, in _cython.cygrpc.ALTSChannelCredentials.__cinit__
TypeError: expected bytes, str found
```
(And the error cannot be worked around by the caller by passing a bytes
object from the Python side: you still get the same error.)
PR #32215 added the verified root cert subject to the lower level
`tsi_peer`. This PR is a companion to that and completes the feature by
bubbling the information up to the `TsiCustomVerificationCheckRequest`
which is part of the user facing API for implementing custom
verification callbacks.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Discovered via `bazel test
--test_env=GRPC_EXPERIMENTS=event_engine_client
//test/core/iomgr:endpoint_pair_test`. CI experiments can be enabled
generally on Windows once a few fixes and improvements are completed.
The `method_exists` function requires a fully qualified class name to be
sent to check if a method exists. The current class was missing the
namespace, which means the function always returns `false`. In our
application this caused the credentials to be loaded many times over,
which ate up some CPU. This bug fix ensures that this is only run once
per request.
This prevents deadlock against wire writer issues.
Currently, there are some `transport_stream_receiver_` callbacks
triggered by NDK binder may acquire `WireReaderImpl::mu_` first then
`WireWriterImpl::write_mu_`. We don't like see this.
We have this problem since some client and server are in the same
process. The behavior of NDK binder seems more aggressive when the Tx
and Rx are in the same process.
Follow-up to https://github.com/grpc/grpc/pull/32229.
https://github.com/grpc/grpc/pull/32229 incremented the `ExecCtx` count
unconditionally. It was previously impossible for a thread to exit
`IncExecCtxCount` while `fork_complete_` was `false`. These same threads
then went on to _decrement_ `count_` while the fork was still in
progress, putting `count_` well below its expected range ([0, 1] while
blocking and [2, inf) while not blocking). This resulted in cases where
`count_` would be stuck at a negative number with a thread infinitely
looping through `IncExecCtxCount`.
This PR instead opts EE threads out of ExecCtx counting entirely. They
handle clean-up of their threads separately through a separate set of
handlers registered by an entirely separate invocation of
`pthread_atfork`. This resolves the issue pointed out in [this
comment](https://github.com/grpc/grpc/issues/31885#issuecomment-1426445192).
There are potentially surprising deployment bugs that can cause `EMFILE`
to be hit. For example, file descriptor limits can be easily reached if
- the round robin LB policy is used
- the load balancer hands out an assignment with a lot of backends
- using debian's default 1024 file descriptor limit.
To make such problems more apparent, we can pay special attention to
this error and log ERROR when it happens.
Related: b/265199104
Third try for #32466.
This adds an interop client / server for GCP Observability integration
testing.
Everything is new here with no refactor. Plan is to get this in first
before trying to refactor out the flags.
Avoids some compilation problems on older MSVC's, opens the door for
some future optimizations.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>