This pull request adds another hook service on the maintenance server.
This will enable clients to gradually migrate from the standalone hook
server.
Changes:
1. Hook service can now be used separately.
2. Copied latest protos and updated the hook service to new API.
3. Added the hook service to the maintenance server.
Working towards testing against CSM Observability. Added ability to
register a prometheus exporter with our Opentelemetry plugin. This will
allow our metrics to be available at the standard prometheus port
`:9464`.
Since many tests now run reliably as bazelified tests on RBE, we can
remove them from presubmit runs
to speedup testing of PRs.
(for now, these jobs will still run on master, they can be removed from
master as a followup).
- linux/grpc_distribtests_standalone is now fully covered by bazel test
suite
a3b4c797a7/tools/bazelify_tests/test/BUILD (L202),
setting them to `presubmit=False` will stop tests from running on PRs.
- stop running tests from grpc_bazel_distribtest on PR, instead rely on
bazel distribtests running as bazelified tests.
We have a bunch of experiments testing against core e2e - and this is
good for robustness, bad for CI times.
We also have a bunch of marginal but overall necessary fixtures in the
e2e suites - again good for robustness, bad for CI times.
We can eliminate some of the cross product though, and I think safely:
run experiments on a broad range of suites, but not *ALL* the suites,
and get a bunch of our CI time back.
Here I introduce an environment variable: `GRPC_CI_EXPERIMENTS` that's
set when running bazel @experiment= configs, cleared otherwise (so we
can still execute those tests directly when necessary). When that env
var is set we filter out a bunch of suites from the test configurations.
Add some basic metrics to work serializer, keep them process wide for
now (though it may be interesting to get these into channelz in the
future).
Collected are:
- time spent running a work serializer when it starts
- time spent actually executing work when the work serializer runs
- number of items executed each run
A high disparity between the first two indicates our dispatching
mechanism is adding large amounts of latency (perhaps due to thread
starvation like effects).
A high value for any of these indicate contention on the serializer.
It's likely a future iteration on these will select different metrics -
I'm not *entirely* sure which will be useful in production analysis yet.
I'm using `std::chrono::steady_clock` here for precision (nanoseconds)
with a compact representation (better than timespec) and a robust &
portable api - I think it's appropriate for metrics, but wouldn't use it
much beyond that at this point.
This test assumed synchronous work serializer execution (or at least
faster async than we always get)... make a trivial change to keep the
test semantics but allow for the implementation to be more async.
This has been stable for a bit, everywhere that the EventEngine is
enabled. Going forward, I think the event_engine_{client|listener}
experiments can probably be used to regulate thread-pool-specific
issues.
---------
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Move the SSL_CTX to the level of the credentials rather than the
subchannel.
The SSL_CTX should only get created once per credential rather than once
per subchannel.
We should observe no behavior change with this PR, only efficiency
gains.
Most recent attempt was #34320, reverted in #34335.
The first commit here is a pure revert. The second commit fixes the
outlier_detection unit test to pass both with and without the
experiment.
To fix the following build error with the head of abseil
```
/var/local/git/grpc/test/core/tsi/ssl_transport_security_utils_test.cc:231:42: error: no member named 'StrCat' in namespace 'absl'
return absl::InternalError(absl::StrCat("Client error:", client_err));
~~~~~~^
/var/local/git/grpc/test/core/tsi/ssl_transport_security_utils_test.cc:238:42: error: no member named 'StrCat' in namespace 'absl'
return absl::InternalError(absl::StrCat("Server error:", server_err));
~~~~~~^
```
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
The previous approach of generating strings was not converging well.
Instead, load a bitfield from the protobuf and use the bits to select
experiments. The fuzzers can explore this space swiftly.
Downside is that as experiments rotate in/out the corpus gets a bit
messed up, but I'm reasonably confident we'll recover quickly.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
@ctiller PTAL for core configuration changes. I converted the type from
std::function to absl::AnyInvocable. Do you think the functor in
RegisteredBuilder should be callable just once or multiple times?
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Splitting off from https://github.com/grpc/grpc/pull/34273
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
In certain situations the current flow control algorithm can result in
sending one flow control update write for every write sent (known
situation: rollout of promise based server calls with qps_test).
Fix things up so that the updates are only sent when truly needed, and
then fix the fallout (turns out our fuzzer had some bugs)
I've placed actual logic changes behind an experiment so that it can be
incrementally & safely rolled out.
Changes -
1) Change local mesh labels to not be reported on 'started' metrics at
all (even those that we know about) to be consistent. (Since xDS labels
atleast on the server side would not be available on started metric.)
2) Add mesh_id as a local label that is populated by reading the xDS
bootstrap. As part of this, also added a minimal xds bootstrap parsing
logic.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
### Background
* `distutils` is deprecated with removal planned for Python 3.12
([pep-0632](https://peps.python.org/pep-0632/)), thus we're trying to
replace all distutils usage with setuptools.
* Please note that user still have access to `distutils` if setuptools
is installed and `SETUPTOOLS_USE_DISTUTILS` is set to `local` (The
default in setuptools, more details can be found [in this
discussion](https://github.com/pypa/setuptools/issues/2806#issuecomment-1193336591)).
### How we decide the replacement
* We're following setuptools [Porting from Distutils
guide](https://setuptools.pypa.io/en/latest/deprecated/distutils-legacy.html#porting-from-distutils)
when deciding the replacement.
#### Replacement not mentioned in the guide
* Replaced `distutils.utils.get_platform()` with
`sysconfig.get_platform()`.
* Based on the [answer
here](https://stackoverflow.com/questions/71664875/what-is-the-replacement-for-distutils-util-get-platform),
and also checked the document that `sysconfig.get_platform()` is good
enough for our use cases.
* Replaced `DistutilsOptionError` with `OptionError`.
* `setuptools.error` is exporting it as `OptionError` [in the
code](https://github.com/pypa/setuptools/blob/v59.6.0/setuptools/errors.py).
* Upgrade `setuptools` in `test_packages.sh` and changed the version
ping to `59.6.0` in `build_artifact_python.bat`.
* `distutils.errors.*` is not fully re-exported until `59.0.0` (See
[this issue](https://github.com/pypa/setuptools/issues/2698) for more
details).
### Changes not included in this PR
* We're patching some compiler related functions provided by distutils
in our code
([example](ee4efc31c1/src/python/grpcio/_spawn_patch.py (L30))),
but since `setuptools` doesn't have similar interface (See [this issue
for more details](https://github.com/pypa/setuptools/issues/2806)), we
don't have a clear path to replace them yet.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This should address one of the failures we're seeing in #34224.
The test failure is caused by the changes in timing triggering a race
condition. In the code at head, we delay sending out the subscription
for the first CDS watch until we've already seen the other two CDS
watches, because the previous send_message op has not yet completed, and
by the time it does, we've seen all 3 watches, so we can send a
subscription for all 3 at the same time. With the WorkSerializer change,
the send_message op is complete by the time we see the first CDS watch,
so we subscribe to only that resource, and then later add the other two.
The result is that we'll NACK twice with two different messages, the
first one including only the error about the first resource, and the
second one including all three.
I suspect this same race condition would have been triggered eventually
by the EventEngine migration anyway; the current test basically depends
on the single-thread timing of the iomgr approach. So I'm addressing it
by replacing the e2e test with a unit test that covers the same cases
without the timing issue.