Noticed that our behavior conflicts with what's mandated:
https://www.rfc-editor.org/rfc/rfc9113.html#section-5.1.2
So here's a quick fix - I like it because it's not a disconnection,
which always feels like one outage avoided.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Instead of fixing a target size for writes, try to adapt it a little to
observed bandwidth.
The initial algorithm tries to get large writes within 100-1000ms
maximum delay - this range probably wants to be tuned, but let's see.
The hope here is that on slow connections we can not back buffer so much
and so when we need to send a ping-ack it's possible without great
delay.
Experiment 1: On RST_STREAM: reduce MAX_CONCURRENT_STREAMS for one round
trip.
Experiment 2: If a settings frame is outstanding with a lower
MAX_CONCURRENT_STREAMS than is configured, and we receive a new incoming
stream that would exceed the new cap, randomly reject it.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Cap requests per read, rst_stream handled per read.
If these caps are exceeded, offload processing of the connection to a
backing thread pool, and allow other connections to make progress.
Previously chttp2 would allow infinite requests prior to a settings ack
- as the agreed upon limit for requests in that state is infinite.
Instead, after MAX_CONCURRENT_STREAMS requests have been attempted,
start blanket cancelling requests until the settings ack is received.
This can be done efficiently without allocating request state
structures.
We originally wanted a common format with internal OWNERS files on the
thought that we'd use them and import them and oh gosh that was the
wrong thing to think.
With upcoming changes to tree management having them here is going to
cause problems, so lets just update the CODEOWNERS files directly.
Isolate ping callback tracking to its own file.
Also takes the opportunity to simplify keepalive code by applying the
ping timeout to all pings.
Adds an experiment to allow multiple pings outstanding too (this was
originally an accidental behavior change of the work, but one that I
think may be useful going forward).
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
We disabled this a little while ago for lack of CI bandwidth, but #34404
ought to have freed up enough capacity that we can keep running this.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Summary -
On the server-side, we are changing the point at which we decide whether
a method is registered or not from the surface to the transport at the
point where we are done receiving initial metadata and before we invoke
the recv_initial_metadata_ready closures from the filters. The main
motivation for this is to allow filters to check whether the incoming
method is a registered or not. The exact use-case is for observability
where we only want to record the method if it is registered. We store
the information about the registered method in the initial metadata.
On the client-side, we also set information about whether the method is
registered or not in the outgoing initial metadata.
Since we are effectively changing the lookup point of the registered
method, there are slight concerns of this being a potentially breaking
change, so we are guarding this with an experiment to be safe.
Changes -
* Transport API changes -
* Along with `accept_stream_fn`, a new callback
`registered_method_matcher_cb` will be sent down as a transport op on
the server side. When initial metadata is received on the server side,
this callback is invoked. This happens before invoking the
`recv_initial_metadata_ready` closure.
* Metadata changes -
* We add a new non-serializable metadata trait `GrpcRegisteredMethod()`.
On the client-side, the value is a uintptr_t with a value of 1 if the
call has a registered/known method, or 0, if it's not known. On the
server side, the value is a (ChannelRegisteredMethod*). This metadata
information can be used throughout the stack to check whether a call is
registered or not.
* Server Changes -
* When a new transport connection is accepted, the server sets
`registered_method_matcher_cb` along with `accept_stream_fn`. This
function checks whether the method is registered or not and sets the
RegisteredMethod matcher in the metadata for use later.
* Client Changes -
* Set the metadata on call creation on whether the method is registered
or not.
Working towards testing against CSM Observability. Added ability to
register a prometheus exporter with our Opentelemetry plugin. This will
allow our metrics to be available at the standard prometheus port
`:9464`.
We have a bunch of experiments testing against core e2e - and this is
good for robustness, bad for CI times.
We also have a bunch of marginal but overall necessary fixtures in the
e2e suites - again good for robustness, bad for CI times.
We can eliminate some of the cross product though, and I think safely:
run experiments on a broad range of suites, but not *ALL* the suites,
and get a bunch of our CI time back.
Here I introduce an environment variable: `GRPC_CI_EXPERIMENTS` that's
set when running bazel @experiment= configs, cleared otherwise (so we
can still execute those tests directly when necessary). When that env
var is set we filter out a bunch of suites from the test configurations.
Original PR was #34307, reverted in #34318 due to internal test
failures.
The first commit is a revert of the revert. The second commit contains
the fix.
The original idea here was that `SubchannelWrapper::Orphan()`, which is
called when the strong refcount reaches 0, would take a new weak ref and
then hop into the `WorkSerializer` before dropping that weak ref, thus
ensuring that the `SubchannelWrapper` is destroyed inside the
`WorkSerializer` (which is needed because the `SubchannelWrapper` dtor
cleans up some state in the channel related to the subchannel). The
problem is that `DualRefCounted<>::Unref()` itself actually increments
the weak ref count before calling `Orphan()` and then decrements it
afterwards. So in the case where the `SubchannelWrapper` is unreffed
outside of the `WorkSerializer` and no other thread happens to be
holding the `WorkSerializer`, the weak ref that we were taking in
`Orphan()` was unreffed inline, which meant that it wasn't actually the
last weak ref -- the last weak ref was the one taken by
`DualRefCounted<>::Unref()`, and it wasn't released until after the
`WorkSerializer` was released.
To this this problem, we move the code from the `SubchannelWrapper` dtor
that cleans up the channel's state into the `WorkSerializer` callback
that is scheduled in `Orphan()`. Thus, regardless of whether or not the
last weak ref is released inside of the `WorkSerializer`, we are
definitely doing that cleanup inside the `WorkSerializer`, which is what
we actually care about.
Also adds an experiment to guard this behavior.
Most recent attempt was #34320, reverted in #34335.
The first commit here is a pure revert. The second commit fixes the
outlier_detection unit test to pass both with and without the
experiment.
In certain situations the current flow control algorithm can result in
sending one flow control update write for every write sent (known
situation: rollout of promise based server calls with qps_test).
Fix things up so that the updates are only sent when truly needed, and
then fix the fallout (turns out our fuzzer had some bugs)
I've placed actual logic changes behind an experiment so that it can be
incrementally & safely rolled out.
This API was [removed in Python
3.12](https://github.com/python/cpython/issues/98040).
Fixes Python 3.12 support in `grpcio` tests.
This is relevant to https://github.com/grpc/grpc/issues/33063.
See also https://github.com/grpc/grpc/pull/33492.
----
I have actually only tested this in a form backported to grpc 1.48.4,
and I am not able to test the change to `bazel/_gevent_test_main.py`
directly. However, the backported form allows me to build grpc 1.48.4
for Fedora Rawhide with Python 3.12, and I believe the version in this
PR to be correct—especially, if CI passes for Python 3.11, I believe
this part of the test code will continue to work in Python 3.12.
We added this as an exploratory measure for a customer that thought they
were using open census (this turned out to be emphatically false).
Remove it since it's probably not how we ultimately want to do this, and
wait for something better to come along.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
- Upgrade bazel
- Reduce the number of places where bazel version needs to be upgraded
in future.
- also make sure the list of bazel versions to test by bazelified tests
is loaded from supported_versions.txt (it was hardcoded before).
- ~~Try upgrading windows RBE build to bazel 6.3.2 as well.~~
The core idea:
- the source of truth for supported bazel versions is in
`bazel/supported_versions.txt`
- the first version listed in `bazel/supported_versions.txt` is
considered to be the "primary" bazel version and is going to be used in
most places thoroughout the repo.
- use templates to include the primary bazel version in testing
dockerfiles and in a newly introduced `.bazelversion` files (which gets
loaded by our existing `tools/bazel` wrapper).
~~Supersedes https://github.com/grpc/grpc/pull/33880~~
The current `py_grpc_library` results in the wrong grpc proto python
code path when grpc is a third-party source code in a Bazel project.
This PR should fix it.
fixes#31011
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: Richard Belleville <rbellevi@google.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Not adding CMake support yet
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
The previous version (`3.12`) is 7 years old and does not support the
newest Python 3 versions. This causes issues to move certain test
targets (which depends on `pyyaml`) to Python 3 when some CI environment
(e.g. `arm64v8/debian:11`) does not have Python 2 installed. And in
general, we should move away from Python 2. Thus, updated `pyyaml` to
the latest version.
This hopefully should also fix the
`prod:grpc/core/master/linux/arm64/grpc_bazel_test_c_cpp` job breakage.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Over the past 5 days, this experiment has not introduced any new flakes,
nor increased any flake rates. Let's enable it for debug builds. To
prevent issues over the weekend, I plan to merge it next week, July 31st
(with announcement).