<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
@sampajano
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This change allows metadata keys to set the appropriate compression
algorithm in the hpack encoder, without needing to change the source
text of the hpack encoder.
We'll leverage this with #32650 to allow some important but
Google-internal metadata to be compressed appropriately.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This bug occurred when the same xDS server was configured twice in the
same bootstrap config, once in an authority and again as the top-level
server. In that case, we were incorrectly failing to de-dup them and
were creating a separate channel for the LRS stream than the one that
already existed for the ADS stream. We fix this by canonicalizing the
server keys the same way in both cases.
As a separate follow-up item, I will work on trying to find a better way
to key these maps that does not suffer from this kind of fragility.
Previously, we didn't configure the failureThreshold, so it used its
default value. The final `startupProbe` looked like this:
```json
{
"startupProbe": {
"failureThreshold": 3,
"periodSeconds": 3,
"successThreshold": 1,
"tcpSocket": {
"port": 8081
},
"timeoutSeconds": 1
}
```
Because of it, the total time before k8s killed the container was 3
times `failureThreshold` * 3 seconds wait between probes `periodSeconds`
= 9 seconds total (±3 seconds waiting for the probe response).
This greatly affected PSM Security test server, some implementations of
which waited for the ADS stream to be configured before starting
listening on the maintenance port. This lead for the server container
being killed for ~7 times before a successful startup:
```
15:55:08.875586 "Killing container with a grace period"
15:53:38.875812 "Killing container with a grace period"
15:52:47.875752 "Killing container with a grace period"
15:52:38.874696 "Killing container with a grace period"
15:52:14.874491 "Killing container with a grace period"
15:52:05.875400 "Killing container with a grace period"
15:51:56.876138 "Killing container with a grace period"
```
These extra delays lead to PSM security tests timing out.
ref b/277336725
This maybe used to quickly verify the code coverage of a modified test
locally (e.g. fuzzer).
Example:
```
# Build and run target; the raw profile will be written to $LLVM_PROFILE_FILE when the program exits
$ bazel build --config=dbg --config=fuzzer_asan --config=coverage //test/core/end2end/fuzzers:api_fuzzer
$ LLVM_PROFILE_FILE="api_fuzzer.profraw" bazel-bin/test/core/end2end/fuzzers/api_fuzzer test/core/end2end/fuzzers/api_fuzzer_corpus/*
# Create coverage report
$ llvm-profdata-14 merge -sparse api_fuzzer.profraw -o api_fuzzer.profdata
$ llvm-cov-14 report ./bazel-bin/test/core/end2end/fuzzers/api_fuzzer --instr-profile=api_fuzzer.profdata
```
Sample report:
f94e444f25/gistfile1.txt
One trick is that the binary needs to be statically linked, e.g. specify
`linkstatic = 1` on the BUILD target.
See https://clang.llvm.org/docs/SourceBasedCodeCoverage.html for more
info.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
We shouldn't depend on how much the compression algorithm compresses the
bytes to. This is causing flakiness internally.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
It already started hitting the limit resulting in continuous failure.
https://github.com/grpc/grpc/pull/32603 is believed to contribute to
this time increase but let's bump it first and visit this issue later.
The very non-trivial upgrade of third_party/protobuf to 22.x
This PR strives to be as small as possible and many changes that were
compatible with protobuf 21.x and didn't have to be merged atomically
with the upgrade were already merged.
Due to the complexity of the upgrade, this PR wasn't created
automatically by a tool, but manually. Subsequent upgraded of
third_party/protobuf with our OSS release script should work again once
this change is merged.
This is best reviewed commit-by-commit, I tried to group changes in
logical areas.
Notable changes:
- the upgrade of third_party/protobuf submodule, the bazel protobuf
dependency itself
- upgrade of UPB dependency to 22.x (in the past, we used to always
upgrade upb to "main", but upb now has release branch as well). UPB
needs to be upgraded atomically with protobuf since there's a de-facto
circular dependency (new protobuf depends on new upb, which depends on
new protobuf for codegen).
- some protobuf and upb bazel rules are now aliases, so `
extract_metadata_from_bazel_xml.py` and `gen_upb_api_from_bazel_xml.py`
had to be modified to be able to follow aliases and reach the actual
aliased targets.
- some protobuf public headers were renamed, so especially
`src/compiler` needed to be updated to use the new headers.
- protobuf and upb now both depend on utf8_range project, so since we
bundle upb with grpc in some languages, we now have to bundle utf8_range
as well (hence changes in build for python, PHP, objC, cmake etc).
- protoc now depends on absl and utf8_range (previously protobuf had
absl dependency, but not for the codegen part), so python's
make_grpcio_tools.py required partial rewrite to be able to handle those
dependencies in the grpcio_tools build.
- many updates and fixes required for C++ distribtests (currently they
all pass, but we'll probably need to follow up, make protobuf's and
grpc's handling of dependencies more aligned and revisit the
distribtests)
- bunch of other changes mostly due to overhaul of protobuf's and upb's
internal build layout.
TODOs:
- [DONE] make sure IWYU and clang_tidy_code pass
- create a list of followups (e.g. work to reenable the few tests I had
to disable and to remove workaround I had to use)
- [DONE in cl/523706129] figure out problem(s) with internal import
---------
Co-authored-by: Craig Tiller <ctiller@google.com>
This enables the EventEngine experiments in end2end tests, excluding the
ResourceQuota tests which have known failures.
Some Windows tests are hanging, so they will be enabled later.
---------
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
With this change, gpr_now(GPR_CLOCK_MONOTONIC) adds 5 seconds of padding
to the returned timestamp on Linux as well as iOS. This makes gRPC start
quickly on recently started machines (e.g. freshly booted micro-VMs).
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
If an engine is created, it should be fully functional regardless of
whether gRPC-core experiments are on or off. The trade-off for now is
that when the core experiments are not enabled, the engine will be
slowly polling with nothing to do.
---------
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
Since we are adding an annotation for experiments internally, we need to
ensure the length of the annotation does not exceed 2 KB (used slightly
lower number in the check). Added a check to gen_experiments.py for
this.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This changes the `JSON` class to use `absl::variant<>` internally for
storing the type and value.
The only API-visible change here is that now the accessor methods will
crash if you use them on the wrong type, courtesy of `absl::get<>`. This
does not appear to break any tests, so it appears that we have already
cleaned up all of the cases we had in the past where we might have done
something like that.
Expand server promises to run with C++ end2end tests.
Across connected_channel/call/batch_builder/pipe/transport:
- fix a bug where read errors weren't propagated from transport to call
so that we can populate failed_before_recv_message for the c++ bindings
- ensure those errors are not, however, used to populate the returned
call status
Add a new latch call arg to lazily propagate the bound CQ for a server
call (and client call, but here it's used degenerately - it's always
populated). This allows server calls to be properly bound to
pollsets.(1)/(2)
In call.cc:
- move some profiling code from FilterStackCall to Call, and then use it
in PromiseBasedCall (this should be cleaned up with tracing work)
- implement GetServerAuthority
In server.cc:
- use an RAII pattern on `MatchResult` to avoid a bug whereby a tag
could be dropped if we cancel a request after it's been matched but
before it's published
- fix deadline export to ServerContext
In resource_quota_server.cc:
- fix some long standing flakes (that were finally obvious with the new
test code) - it's legal here to have client calls not arrive at the
server due to resource starvation, work through that (includes adding
expectations during a `Step` call, which required some small tweaks to
cq_verifier)
In the C++ end2end_test.cc:
- strengthen a flaky test so it passes consistently (it's likely we'll
revisit this with the fuzzing efforts to strengthen it into an actually
robust test)
(1) It's time to remove this concept
(2) Surprisingly the only test that *reliably* demonstrates this not
being done is time_change_test
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Various fixes that can be merged in advance, to simplify the 22.x
upgrade.
- Use NOMINMAX for grpcio_tools build to avoid build failure when absl
is added as a dependency. Fixes
https://github.com/grpc/grpc/issues/32779
- in spawn_patch command file, put arguments on multiple lines to avoid
lines going over the limit for large link commands: Fixes
https://github.com/grpc/grpc/issues/32795
- Use `python/generator.h` instead of the deprecated
`python/python_generator.h` in grpcio_tools.
These need to run at about the same times, and each runs quickly.
Doesn't seem worthwhile to fire up a VM for each of them individually.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This error can trigger for either initial or trailing metadata (and
we've had outages where the latter was the cause).
I don't think we know at this layer if we're parsing initial or trailing
- though it'd be a good exercise to plumb that through.
For now remove the word initial because it's better to give less
information than wrong information.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This reverts commit 4b46dbc19e.
Reason: this seems to be breaking load reports in certain cases,
b/276944116
Let's revert so this doesn't accidentally get released.
I added the `CallDispatchController` interface back in #26200 in
preparation for supporting the retry circuit breaker in xDS. However, we
subsequently decided not to support that feature, so we no longer need
this interface. And switching back to a simple on-commit callback should
make it easier to implement stateful session affinity for
WeightedClusters.
### Description
When invoking a RPC, sometimes operations in core will fail which result
in an `ExecuteBatchError`, currently we [simply raise this
error](https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi#L705),
and left this Exception unhandled in event loop with error message `Task
exception was never retrieved`.
### Changes included
This PR changes the behavior to log the `ExecuteBatchError` instead of
raising it.
This shouldn't change existing behavior because even without this
change, the exception will simply left on event loop.
### Testing
Tested by manually `set_expection` in [this
future](https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi#L84),
verified that error was logged correctly and no exception was left in
event loop, sample log:
```
ERROR:grpc._cython.cygrpc:ExecuteBatchError raised in core by servicer method [/grpc.testing.TestService/UnaryCall]
Traceback (most recent call last):
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 682, in _cython.cygrpc._handle_exceptions
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 802, in _handle_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 547, in _handle_unary_unary_rpc
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 452, in _finish_handler_with_unary_response
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 106, in execute_wrong_batch
_cython.cygrpc.ExecuteBatchError: Failed "execute_batch_including_SendInitialMetadataOperation": (<_cython.cygrpc.SendInitialMetadataOperation object at 0x7fa86d936ca0>, <_cython.cygrpc.SendMessageOperation object at 0x7fa86cf48cb0>, <_cython.cygrpc.SendStatusFromServerOperation object at 0x7fa86cf4c110>)
```
This PR aims to de-experimentalize the APIs for GCP Observability.
We would have ideally wanted public feedback before declaring the APIs
stable, but we need stable APIs for GA.
Changes made after API review with @markdroth@veblush, @ctiller and the
entire Core/C++ team -
* The old experimental APIs `grpc::experimental::GcpObservabilityInit`
and `grpc::experimental::GcpObservabilityClose` are now deprecated and
will be deleted after v.1.55 release.
* The new API gets rid of the Close method and follows the RAII idiom
with a single `grpc::GcpObservability::Init()` call that returns an
`GcpObservability` object, the lifetime of which controls when
observability data is flushed.
* The `GcpObservability` class could in the future add more methods. For
example, a debug method that shows the current configuration.
* Document that GcpObservability initialization and flushing (on
`GcpObservability` destruction) are blocking calls.
* Document that gRPC is still usable if GcpObservability initialization
failed. (Added a test to prove the same).
* Since we don't have a good way to flush stats and tracing with
OpenCensus, the examples required users to sleep for 25 seconds. This
sleep is now part of `GcpObservability` destruction.
Additional Implementation details -
* `GcpObservability::Init` is now marked with `GRPC_MUST_USE_RESULT` to
make sure that the results are used. We ideally want users to store it,
but this is better than nothing.
* Added a note on GCP Observability lifetime guarantees.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Fix python type hints for async iteration on `UnaryStreamCall` and
`StreamStreamCall`. Those classes _are_ `AsyncIterable`s and `__aiter__`
returns an `AsyncIterator`.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Followup for https://github.com/grpc/grpc/pull/32649 (which disabled the
tests mentioned below).
Also sets correct path for tests build by ninja on windows, so that they
don't get skipped.
Once merged, I'll backport to 1.54.x and 1.53.x
The original issue with tests being skipped.
```
+ python3 workspace_c_windows_dbg_native/tools/run_tests/run_tests.py -t -j 8 -x run_tests/c_windows_dbg_native/sponge_log.xml --report_suite_name c_windows_dbg_native -l c -c dbg --iomgr_platform native --bq_result_table aggregate_results --measure_cpu_costs
2023-03-20 07:56:53,523 START: tools\run_tests\helper_scripts\build_cxx.bat
2023-03-20 08:04:51,388 PASSED: tools\run_tests\helper_scripts\build_cxx.bat [time=477.9sec, retries=0:0; cpu_cost=0.0; estimated=1.0]
2023-03-20 08:04:52,434 detected port server running version 21
2023-03-20 08:04:52,672 my port server is version 21
2023-03-20 08:04:52,703 SUCCESS: All tests passed
WARNING: binary not found, skipping cmake/build/Debug/bad_server_response_test.exe
WARNING: binary not found, skipping cmake/build/Debug/connection_refused_test.exe
WARNING: binary not found, skipping cmake/build/Debug/goaway_server_test.exe
WARNING: binary not found, skipping cmake/build/Debug/invalid_call_argument_test.exe
WARNING: binary not found, skipping cmake/build/Debug/multiple_server_queues_test.exe
WARNING: binary not found, skipping cmake/build/Debug/no_server_test.exe
WARNING: binary not found, skipping cmake/build/Debug/pollset_windows_starvation_test.exe
WARNING: binary not found, skipping cmake/build/Debug/public_headers_must_be_c89.exe
```
This is a prerequisite for upgrading to protobuf 22.x
(upb and protobuf now depend on utf8_range)
Currently utf8_range isn't referenced by anything, but it's better to
bring the subtree in advance to make the protobuf upgrade PR smaller.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Notes:
- `+trace` fixtures haven't run since 2016, so they're disabled for now
(7ad2d0b463 (diff-780fce7267c34170c1d0ea15cc9f65a7f4b79fefe955d185c44e8b3251cf9e38R76))
- all current fixtures define `FEATURE_MASK_SUPPORTS_AUTHORITY_HEADER`
and hence `authority_not_supported` has not been run in years - deleted
- bad_hostname similarly hasn't been triggered in a long while, so
deleted
- load_reporting_hook has never been enabled, so deleted
(f23fb4cf31/test/core/end2end/generate_tests.bzl (L145-L148))
- filter_latency & filter_status_code rely on global variables and so
don't convert particularly cleanly - and their value seems marginal, so
deleted
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
When using this internally, we noticed that it's impossible to use
custom_metadata.h without creating a dependency cycle between
:custom_metadata and :grpc_base.
A full build refactoring is too large right now, so merge that header
into :grpc_base for the time being.
Also, separate `SimpleSliceBasedMetadata` into its own file, so that it
can be reused in custom_metadata.h.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>