`tools/run_tests/sanity/check_absl_mutex.sh` was broken, a missing paren
crashed the script if run locally. It's unclear yet how our sanity
checks were not complaining about this, `run_tests.py` does not save the
log.
I've noticed we add the cleanup hook after setting up the
infrastructure. Thus, if infra setup failed, the cleanup won't work.
This fixes it, and adds extra checks to not call
`cls.test_client_runner` if it's not set.
Fail test if client or server pods restarted during test.
#### Testing
Tested locally, test will fail with message similar to:
```
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/google/home/xuanwn/workspace/xds/grpc/tools/run_tests/xds_k8s_test_driver/framework/xds_k8s_testcase.py", line 501, in tearDown
))
AssertionError: 5 != 0 : Server pods unexpectedly restarted {sever_restarts} times during test.
----------------------------------------------------------------------
Ran 1 test in 886.867s
```
Better logging for `assertRpcStatusCodes`.
(got tired of looking up the status names)
#### Unexpected status found
Before:
```
AssertionError: AssertionError: Expected only status 15 but found status 0 for method UNARY_CALL:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
}
}
```
After:
```
AssertionError: Expected only status (15, DATA_LOSS), but found status (0, OK) for method UNARY_CALL:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
}
}
```
#### No traffic with expected status
Before:
```
AssertionError: 0 not greater than 0
```
After:
```
AssertionError: 0 not greater than 0 : Expected non-zero RPCs with status (15, DATA_LOSS) for method UNARY_CALL, got:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
result {
key: 15
value: 0
}
}
}
```
Before this change, `Found subchannel in state READY` and `Channel to
xds:///psm-grpc-server:61404 transitioned to state ` would dump the full
channel/subchannel, in some implementations that expose
ChannelData.trace (f.e. go) would add 300 extra lines of log.
Now we print a brief repr-like chanel/subchannel info:
```
Found subchannel in state READY: <Subchannel subchannel_id=9 target=10.110.1.44:8080 state=READY>
Channel to xds:///psm-grpc-server:61404 transitioned to state READY: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=READY>
```
Also while waiting for the channel, we log channel_id now too:
```
Waiting to report a READY channel to xds:///psm-grpc-server:61404
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=READY>
```
Similar to what we already do in other test suites:
- Try cleaning up resources three times.
- If unsuccessful, don't fail the test and just log the error. The
cleanup script should be the one to deal with this.
ref b/282081851
Add a new binary that runs all core end2end tests in fuzzing mode.
In this mode FuzzingEventEngine is substituted for the default event
engine. This means that time is simulated, as is IO. The FEE gets
control of callback delays also.
In our tests the `Step()` function becomes, instead of a single call to
`completion_queue_next`, a series of calls to that function and
`FuzzingEventEngine::Tick`, driving forward the event loop until
progress can be made.
PR guide:
---
**New binaries**
`core_end2end_test_fuzzer` - the new fuzzer itself
`seed_end2end_corpus` - a tool that produces an interesting seed corpus
**Config changes for safe fuzzing**
The implementation tries to use the config fuzzing work we've previously
deployed in api_fuzzer to fuzz across experiments. Since some
experiments are far too experimental to be safe in such fuzzing (and
this will always be the case):
- a new flag is added to experiments to opt-out of this fuzzing
- a new hook is added to the config system to allow variables to
re-write their inputs before setting them during the fuzz
**Event manager/IO changes**
Changes are made to the event engine shims so that tcp_server_posix can
run with a non-FD carrying EventEngine. These are in my mind a bit
clunky, but they work and they're in code that we expect to delete in
the medium term, so I think overall the approach is good.
**Changes to time**
A small tweak is made to fix a bug initializing time for fuzzers in
time.cc - we were previously failing to initialize
`g_process_epoch_cycles`
**Changes to `Crash`**
A version that prints to stdio is added so that we can reliably print a
crash from the fuzzer.
**Changes to CqVerifier**
Hooks are added to allow the top level loop to hook the verification
functions with a function that steps time between CQ polls.
**Changes to end2end fixtures**
State machinery moves from the fixture to the test infra, to keep the
customizations for fuzzing or not in one place. This means that fixtures
are now just client/server factories, which is overall nice.
It did necessitate moving some bespoke machinery into
h2_ssl_cert_test.cc - this file is beginning to be problematic in
borrowing parts but not all of the e2e test machinery. Some future PR
needs to solve this.
A cq arg is added to the Make functions since the cq is now owned by the
test and not the fixture.
**Changes to test registration**
`TEST_P` is replaced by `CORE_END2END_TEST` and our own test registry is
used as a first depot for test information.
The gtest version of these tests: queries that registry to manually
register tests with gtest. This ultimately changes the name of our tests
again (I think for the last time) - the new names are shorter and more
readable, so I don't count this as a regression.
The fuzzer version of these tests: constructs a database of fuzzable
tests that it can consult to look up a particular suite/test/config
combination specified by the fuzzer to fuzz against. This gives us a
single fuzzer that can test all 3k-ish fuzzing ready tests and cross
polinate configuration between them.
**Changes to test config**
The zero size registry stuff was causing some problems with the event
engine feature macros, so instead I've removed those and used GTEST_SKIP
in the problematic tests. I think that's the approach we move towards in
the future.
**Which tests are included**
Configs that are compatible - those that do not do fd manipulation
directly (these are incompatible with FuzzingEventEngine), and those
that do not join threads on their shutdown path (as these are
incompatible with our cq wait methodology). Each we can talk about in
the future - fd manipulation would be a significant expansion of
FuzzingEventEngine, and is probably not worth it, however many uses of
background threads now should probably evolve to be EventEngine::Run
calls in the future, and then would be trivially enabled in the fuzzers.
Some tests currently fail in the fuzzing environment, a
`SKIP_IF_FUZZING` macro is used for these few to disable them if in the
fuzzing environment. We'll burn these down in the future.
**Changes to fuzzing_event_engine**
Changes are made to time: an exponential sweep forward is used now -
this catches small time precision things early, but makes decade long
timers (we have them) able to be used right now. In the future we'll
just skip time forward to the next scheduled timer, but that approach
doesn't yet work due to legacy timer system interactions.
Changes to port assignment: we ensure that ports are legal numbers
before assigning them via `grpc_pick_port_or_die`.
A race condition between time checking and io is fixed.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Resolve `TESTING_VERSION` to `dev-VERSION` when the job is initiated by
a user, and not the CI. Override this behavior with setting
`FORCE_TESTING_VERSION`.
This solves the problem with the manual job runs executed against a WIP
branch (f.e. a PR) overriding the tag of the CI-built image we use for
daily testing.
The `dev` and `dev-VERSION` "magic" values supported by the
`--testing_version` flag:
- `dev` and `dev-master` and treated as `master`: all
`config.version_gte` checks resolve to `True`.
- `dev-VERSION` is treated as `VERSION`: `dev-v1.55.x` is treated as
simply `v1.55.x`. We do this so that when manually running jobs for old
branches the feature skip check still works, and unsupported tests are
skipped.
This changes will take care of all langs/branches, no backports needed.
ref b/256845629
Previously the error message didn't provide much context, example:
```py
Traceback (most recent call last):
File "/tmpfs/tmp/tmp.BqlenMyXyk/grpc/tools/run_tests/xds_k8s_test_driver/tests/affinity_test.py", line 127, in test_affinity
self.assertLen(
AssertionError: [] has length of 0, expected 1.
```
ref b/279990584.
This PR implements a work-stealing thread pool for use inside
EventEngine implementations. Because of historical risks here, I've
guarded the new implementation behind an experiment flag:
`GRPC_EXPERIMENTS=work_stealing`. Current default behavior is the
original thread pool implementation.
Benchmarks look very promising:
```
bazel test \
--test_timeout=300 \
--config=opt -c opt \
--test_output=streamed \
--test_arg='--benchmark_format=csv' \
--test_arg='--benchmark_min_time=0.15' \
--test_arg='--benchmark_filter=_FanOut' \
--test_arg='--benchmark_repetitions=15' \
--test_arg='--benchmark_report_aggregates_only=true' \
test/cpp/microbenchmarks:bm_thread_pool
```
2023-05-04: `bm_thread_pool` benchmark results on my local machine (64
core ThreadRipper PRO 3995WX, 256GB memory), comparing this PR to
master:
![image](https://user-images.githubusercontent.com/295906/236315252-35ed237e-7626-486c-acfa-71a36f783d22.png)
2023-05-04: `bm_thread_pool` benchmark results in the Linux RBE
environment (unsure of machine configuration, likely small), comparing
this PR to master.
![image](https://user-images.githubusercontent.com/295906/236317164-2c5acbeb-fdac-4737-9b2d-4df9c41cb825.png)
---------
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: Sergii Tkachenko <hi@sergii.org>
Reverts grpc/grpc#32924. This breaks the build again, unfortunately.
From `test/core/event_engine/cf:cf_engine_test`:
```
error: module .../grpc/test/core/event_engine/cf:cf_engine_test does not depend on a module exporting 'grpc/support/port_platform.h'
```
@sampajano I recommend looking into CI tests to catch iOS problems
before merging. We can enable EventEngine experiments in the CI
generally once this PR lands, but this broken test is not one of those
experiments. A normal build should have caught this.
cc @HannahShiSFB
Makes some awkward fixes to compression filter, call, connected channel
to hold the semantics we have upheld now in tests.
Once the fixes described here
https://github.com/grpc/grpc/blob/master/src/core/lib/channel/connected_channel.cc#L636
are in this gets a lot less ad-hoc, but that's likely going to be
post-landing promises client & server side.
We specifically need special handling for server side cancellation in
response to reads wrt the inproc transport - which doesn't track
cancellation thoroughly enough itself.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
- Fix broken `bin/run_channelz.py` helper
- Create `bin/run_ping_pong.py` helper that runs the baseline (aka
"ping_pong") test against preconfigured infra
- Setup automatic port forwarding when running `bin/run_channelz.py` and
`bin/run_ping_pong.py`
- Create `bin/cleanup_cluster.sh` helper to wipe xds out resources based
namespaces present on the cluster
Note: this involves a small change to the non-helper code, but it's just
moving a the part that makes XdsTestServer/XdsTestClient instance for a
given pod.
Audit logging APIs for both built-in loggers and third-party logger
implementations.
C++ uses using decls referring to C-Core APIs.
---------
Co-authored-by: rockspore <rockspore@users.noreply.github.com>
Third-party loggers will be added in subsequent PRs once the logger
factory APIs are available to validate the configs here.
This registry is used in `xds_http_rbac_filter.cc` to generate service
config json.
Fix at-head tests (this is a missing piece of
https://github.com/grpc/grpc/pull/32905) with the following error;
```
/var/local/git/grpc/tools/run_tests/helper_scripts/build_python.sh: line 126: python3.8: command not found
```
While a proper fix is on the way, this mitigates the number of
duplicated container logs in the xds test server/client pod logs.
The issue is that we only wait between stream restarts when an exception
is caught, which isn't always the reason the stream gets broken. Another
reason is the main container being shut down by k8s. In this situation,
we essentially do
```py
while True:
try:
restart_stream()
read_all_logs_from_pod_start()
except Exception:
logger.warning('error')
wait_seconds(1)
```
This PR makes it
```py
while True:
try:
restart_stream()
read_all_logs_from_pod_start()
except Exception:
logger.warning('error')
finally:
wait_seconds(5)
```
`tearDownClass` is not executed when `setUpClass` failed. In URL Map
test suite, this leads to a test client that failed to start not being
cleaned up.
This PR change the URL Map test suite to register a custom
`addClassCleanup` callback, instead of relying on the `tearDownClass`.
Unlike `tearDownClass`, cleanup callbacks are executed when the
`setUpClass` failed.
ref b/276761453
The PR also creates a separate BUILD target for:
- chttp2 context list
- iomgr buffer_list
- iomgr internal errqueue
This would allow the context list to be included as standalone
dependencies for EventEngine implementations.
As Protobuf is going to support Cord to reduce memory copy when
[de]serializing Cord fields, gRPC is going to leverage it. This
implementation is based on the internal one but it's slightly modified
to use the public APIs of Cord. only
Followup for https://github.com/grpc/grpc/pull/31141.
IWYU and clang-tidy have been "moved" to a separate kokoro job, but as
it turns out the sanity job still runs all of `[sanity, clang-tidy,
iwyu]`, which makes the grpc_sanity jobs very slow.
The issue is that grpc_sanity selects tasks that have "sanity" label on
them and as of now, clang-tidy and iwyu still do.
It can be verified by:
```
tools/run_tests/run_tests_matrix.py -f sanity --dry_run
Will run these tests:
run_tests_sanity_linux_dbg_native: "python3 tools/run_tests/run_tests.py --use_docker -t -j 2 -x run_tests/sanity_linux_dbg_native/sponge_log.xml --report_suite_name sanity_linux_dbg_native -l sanity -c dbg --iomgr_platform native --report_multi_target"
run_tests_clang-tidy_linux_dbg_native: "python3 tools/run_tests/run_tests.py --use_docker -t -j 2 -x run_tests/clang-tidy_linux_dbg_native/sponge_log.xml --report_suite_name clang-tidy_linux_dbg_native -l clang-tidy -c dbg --iomgr_platform native --report_multi_target"
run_tests_iwyu_linux_dbg_native: "python3 tools/run_tests/run_tests.py --use_docker -t -j 2 -x run_tests/iwyu_linux_dbg_native/sponge_log.xml --report_suite_name iwyu_linux_dbg_native -l iwyu -c dbg --iomgr_platform native --report_multi_target"
```
This PR should fix this (be removing the umbrella "sanity" label from
clang-tidy and iwyu)
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
@sampajano
Previously, we didn't configure the failureThreshold, so it used its
default value. The final `startupProbe` looked like this:
```json
{
"startupProbe": {
"failureThreshold": 3,
"periodSeconds": 3,
"successThreshold": 1,
"tcpSocket": {
"port": 8081
},
"timeoutSeconds": 1
}
```
Because of it, the total time before k8s killed the container was 3
times `failureThreshold` * 3 seconds wait between probes `periodSeconds`
= 9 seconds total (±3 seconds waiting for the probe response).
This greatly affected PSM Security test server, some implementations of
which waited for the ADS stream to be configured before starting
listening on the maintenance port. This lead for the server container
being killed for ~7 times before a successful startup:
```
15:55:08.875586 "Killing container with a grace period"
15:53:38.875812 "Killing container with a grace period"
15:52:47.875752 "Killing container with a grace period"
15:52:38.874696 "Killing container with a grace period"
15:52:14.874491 "Killing container with a grace period"
15:52:05.875400 "Killing container with a grace period"
15:51:56.876138 "Killing container with a grace period"
```
These extra delays lead to PSM security tests timing out.
ref b/277336725
The very non-trivial upgrade of third_party/protobuf to 22.x
This PR strives to be as small as possible and many changes that were
compatible with protobuf 21.x and didn't have to be merged atomically
with the upgrade were already merged.
Due to the complexity of the upgrade, this PR wasn't created
automatically by a tool, but manually. Subsequent upgraded of
third_party/protobuf with our OSS release script should work again once
this change is merged.
This is best reviewed commit-by-commit, I tried to group changes in
logical areas.
Notable changes:
- the upgrade of third_party/protobuf submodule, the bazel protobuf
dependency itself
- upgrade of UPB dependency to 22.x (in the past, we used to always
upgrade upb to "main", but upb now has release branch as well). UPB
needs to be upgraded atomically with protobuf since there's a de-facto
circular dependency (new protobuf depends on new upb, which depends on
new protobuf for codegen).
- some protobuf and upb bazel rules are now aliases, so `
extract_metadata_from_bazel_xml.py` and `gen_upb_api_from_bazel_xml.py`
had to be modified to be able to follow aliases and reach the actual
aliased targets.
- some protobuf public headers were renamed, so especially
`src/compiler` needed to be updated to use the new headers.
- protobuf and upb now both depend on utf8_range project, so since we
bundle upb with grpc in some languages, we now have to bundle utf8_range
as well (hence changes in build for python, PHP, objC, cmake etc).
- protoc now depends on absl and utf8_range (previously protobuf had
absl dependency, but not for the codegen part), so python's
make_grpcio_tools.py required partial rewrite to be able to handle those
dependencies in the grpcio_tools build.
- many updates and fixes required for C++ distribtests (currently they
all pass, but we'll probably need to follow up, make protobuf's and
grpc's handling of dependencies more aligned and revisit the
distribtests)
- bunch of other changes mostly due to overhaul of protobuf's and upb's
internal build layout.
TODOs:
- [DONE] make sure IWYU and clang_tidy_code pass
- create a list of followups (e.g. work to reenable the few tests I had
to disable and to remove workaround I had to use)
- [DONE in cl/523706129] figure out problem(s) with internal import
---------
Co-authored-by: Craig Tiller <ctiller@google.com>
Followup for https://github.com/grpc/grpc/pull/32649 (which disabled the
tests mentioned below).
Also sets correct path for tests build by ninja on windows, so that they
don't get skipped.
Once merged, I'll backport to 1.54.x and 1.53.x
The original issue with tests being skipped.
```
+ python3 workspace_c_windows_dbg_native/tools/run_tests/run_tests.py -t -j 8 -x run_tests/c_windows_dbg_native/sponge_log.xml --report_suite_name c_windows_dbg_native -l c -c dbg --iomgr_platform native --bq_result_table aggregate_results --measure_cpu_costs
2023-03-20 07:56:53,523 START: tools\run_tests\helper_scripts\build_cxx.bat
2023-03-20 08:04:51,388 PASSED: tools\run_tests\helper_scripts\build_cxx.bat [time=477.9sec, retries=0:0; cpu_cost=0.0; estimated=1.0]
2023-03-20 08:04:52,434 detected port server running version 21
2023-03-20 08:04:52,672 my port server is version 21
2023-03-20 08:04:52,703 SUCCESS: All tests passed
WARNING: binary not found, skipping cmake/build/Debug/bad_server_response_test.exe
WARNING: binary not found, skipping cmake/build/Debug/connection_refused_test.exe
WARNING: binary not found, skipping cmake/build/Debug/goaway_server_test.exe
WARNING: binary not found, skipping cmake/build/Debug/invalid_call_argument_test.exe
WARNING: binary not found, skipping cmake/build/Debug/multiple_server_queues_test.exe
WARNING: binary not found, skipping cmake/build/Debug/no_server_test.exe
WARNING: binary not found, skipping cmake/build/Debug/pollset_windows_starvation_test.exe
WARNING: binary not found, skipping cmake/build/Debug/public_headers_must_be_c89.exe
```
Notes:
- `+trace` fixtures haven't run since 2016, so they're disabled for now
(7ad2d0b463 (diff-780fce7267c34170c1d0ea15cc9f65a7f4b79fefe955d185c44e8b3251cf9e38R76))
- all current fixtures define `FEATURE_MASK_SUPPORTS_AUTHORITY_HEADER`
and hence `authority_not_supported` has not been run in years - deleted
- bad_hostname similarly hasn't been triggered in a long while, so
deleted
- load_reporting_hook has never been enabled, so deleted
(f23fb4cf31/test/core/end2end/generate_tests.bzl (L145-L148))
- filter_latency & filter_status_code rely on global variables and so
don't convert particularly cleanly - and their value seems marginal, so
deleted
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Built atop #31448
Offers a simple framework for testing filters.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>