<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
@sampajano
Previously, we didn't configure the failureThreshold, so it used its
default value. The final `startupProbe` looked like this:
```json
{
"startupProbe": {
"failureThreshold": 3,
"periodSeconds": 3,
"successThreshold": 1,
"tcpSocket": {
"port": 8081
},
"timeoutSeconds": 1
}
```
Because of it, the total time before k8s killed the container was 3
times `failureThreshold` * 3 seconds wait between probes `periodSeconds`
= 9 seconds total (±3 seconds waiting for the probe response).
This greatly affected PSM Security test server, some implementations of
which waited for the ADS stream to be configured before starting
listening on the maintenance port. This lead for the server container
being killed for ~7 times before a successful startup:
```
15:55:08.875586 "Killing container with a grace period"
15:53:38.875812 "Killing container with a grace period"
15:52:47.875752 "Killing container with a grace period"
15:52:38.874696 "Killing container with a grace period"
15:52:14.874491 "Killing container with a grace period"
15:52:05.875400 "Killing container with a grace period"
15:51:56.876138 "Killing container with a grace period"
```
These extra delays lead to PSM security tests timing out.
ref b/277336725
The very non-trivial upgrade of third_party/protobuf to 22.x
This PR strives to be as small as possible and many changes that were
compatible with protobuf 21.x and didn't have to be merged atomically
with the upgrade were already merged.
Due to the complexity of the upgrade, this PR wasn't created
automatically by a tool, but manually. Subsequent upgraded of
third_party/protobuf with our OSS release script should work again once
this change is merged.
This is best reviewed commit-by-commit, I tried to group changes in
logical areas.
Notable changes:
- the upgrade of third_party/protobuf submodule, the bazel protobuf
dependency itself
- upgrade of UPB dependency to 22.x (in the past, we used to always
upgrade upb to "main", but upb now has release branch as well). UPB
needs to be upgraded atomically with protobuf since there's a de-facto
circular dependency (new protobuf depends on new upb, which depends on
new protobuf for codegen).
- some protobuf and upb bazel rules are now aliases, so `
extract_metadata_from_bazel_xml.py` and `gen_upb_api_from_bazel_xml.py`
had to be modified to be able to follow aliases and reach the actual
aliased targets.
- some protobuf public headers were renamed, so especially
`src/compiler` needed to be updated to use the new headers.
- protobuf and upb now both depend on utf8_range project, so since we
bundle upb with grpc in some languages, we now have to bundle utf8_range
as well (hence changes in build for python, PHP, objC, cmake etc).
- protoc now depends on absl and utf8_range (previously protobuf had
absl dependency, but not for the codegen part), so python's
make_grpcio_tools.py required partial rewrite to be able to handle those
dependencies in the grpcio_tools build.
- many updates and fixes required for C++ distribtests (currently they
all pass, but we'll probably need to follow up, make protobuf's and
grpc's handling of dependencies more aligned and revisit the
distribtests)
- bunch of other changes mostly due to overhaul of protobuf's and upb's
internal build layout.
TODOs:
- [DONE] make sure IWYU and clang_tidy_code pass
- create a list of followups (e.g. work to reenable the few tests I had
to disable and to remove workaround I had to use)
- [DONE in cl/523706129] figure out problem(s) with internal import
---------
Co-authored-by: Craig Tiller <ctiller@google.com>
Followup for https://github.com/grpc/grpc/pull/32649 (which disabled the
tests mentioned below).
Also sets correct path for tests build by ninja on windows, so that they
don't get skipped.
Once merged, I'll backport to 1.54.x and 1.53.x
The original issue with tests being skipped.
```
+ python3 workspace_c_windows_dbg_native/tools/run_tests/run_tests.py -t -j 8 -x run_tests/c_windows_dbg_native/sponge_log.xml --report_suite_name c_windows_dbg_native -l c -c dbg --iomgr_platform native --bq_result_table aggregate_results --measure_cpu_costs
2023-03-20 07:56:53,523 START: tools\run_tests\helper_scripts\build_cxx.bat
2023-03-20 08:04:51,388 PASSED: tools\run_tests\helper_scripts\build_cxx.bat [time=477.9sec, retries=0:0; cpu_cost=0.0; estimated=1.0]
2023-03-20 08:04:52,434 detected port server running version 21
2023-03-20 08:04:52,672 my port server is version 21
2023-03-20 08:04:52,703 SUCCESS: All tests passed
WARNING: binary not found, skipping cmake/build/Debug/bad_server_response_test.exe
WARNING: binary not found, skipping cmake/build/Debug/connection_refused_test.exe
WARNING: binary not found, skipping cmake/build/Debug/goaway_server_test.exe
WARNING: binary not found, skipping cmake/build/Debug/invalid_call_argument_test.exe
WARNING: binary not found, skipping cmake/build/Debug/multiple_server_queues_test.exe
WARNING: binary not found, skipping cmake/build/Debug/no_server_test.exe
WARNING: binary not found, skipping cmake/build/Debug/pollset_windows_starvation_test.exe
WARNING: binary not found, skipping cmake/build/Debug/public_headers_must_be_c89.exe
```
Notes:
- `+trace` fixtures haven't run since 2016, so they're disabled for now
(7ad2d0b463 (diff-780fce7267c34170c1d0ea15cc9f65a7f4b79fefe955d185c44e8b3251cf9e38R76))
- all current fixtures define `FEATURE_MASK_SUPPORTS_AUTHORITY_HEADER`
and hence `authority_not_supported` has not been run in years - deleted
- bad_hostname similarly hasn't been triggered in a long while, so
deleted
- load_reporting_hook has never been enabled, so deleted
(f23fb4cf31/test/core/end2end/generate_tests.bzl (L145-L148))
- filter_latency & filter_status_code rely on global variables and so
don't convert particularly cleanly - and their value seems marginal, so
deleted
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Built atop #31448
Offers a simple framework for testing filters.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This reverts commit 7bd9267f32.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Earlier, we were simply using a 64 bit random number, but the spec
actually calls for UUIDv4.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This allows us to replace `absl::optional<TaskHandle>` with checks
against the invalid handle.
This PR also replaces the differently-named invalid handle instances
with a uniform way of accessing static invalid instances across all
handle types, which aids a bit in testing.
Add an event manager that spawns threads just as much as it possibly
can... to expose TSAN to the myriad thread ordering problems in our code
base.
Next steps for this will be to add a new test mode for tsan + thready
event engine + a few other doodads to increase threads in the system
(party.cc in particular has a good place for a hook).
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
(part of removing support for VS2017)
Also see https://github.com/grpc/grpc/pull/32649
Also see https://github.com/grpc/grpc/pull/32615
The switch to grpc-win2019 windows workers has already happened:
(cl/517400022).
Once this PR lands, I'll backport to 1.53.x branch as well (since that
release removes the VS2017 support).
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
Initial PR to establish a bazel dependency on
https://github.com/google/fuzztest, with which I'm planning on basing a
hardening program.
Casting a relatively wide net with reviewers: I'm genuinely interested
in feedback building up the docs, and general ergonomics of this change.
I've located relevant files in the `fuzztest/...` directory. The tests
only build with the `--config fuzztest` bazel argument for now (because
of needing C++17), so locating them separately keeps `bazel test
test/...` working as it does today. In a few years time, when we adopt
C++17, we'll be able to rationalize the test directories a little bit.
We'll need to add some kokoro jobs (maybe with this PR?) to execute the
relevant tests.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This was missed from https://github.com/grpc/grpc/pull/32596, resulting
in k8s teardown exiting early with:
```
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: OMITTED
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.apps \"psm-grpc-client\" not found","reason":"NotFound","details":{"name":"psm-grpc-client","group":"apps","kind":"deployments"},"code":404}
```
With this change, teardown resumes with:
> `k8s_base_runner.py:282] Deployment psm-grpc-client deletion failed:
Kubernetes API returned 404 Not Found: deployments.apps
"psm-grpc-client" not found`
Fix incompatibilities identified when running adhoc runs on the new
custom win2019 image.
After merging this, it should be possible to switch to the new image
without breaking any tests.
- for most fixes I added a comment that explains why they're necessary.
- the new image won't have VS2015 installed, so I'm switching the protoc
artifact build to VS2017
This PR will need to be backported to older release branches to ensure
the windows tests continue working on those branches as well (IMHO I
haven't made any changes that would be difficult to backport and I tried
to keeps the diff as small as possible to avoid issues when
backporting).
After we switch to the new image (and all the windows tests are green),
we can incrementally move the builds that are still using VS2017 to
VS2019.
This is a big rewrite of global config.
It does a few things, all somewhat intertwined:
1. centralize the list of configuration we have to a yaml file that can
be parsed, and code generated from it
2. add an initialization and a reset stage so that config vars can be
centrally accessed very quickly without the need for caching them
3. makes the syntax more C++ like (less macros!)
4. (optionally) adds absl flags to the OSS build
This first round of changes is intended to keep the system where it is
without major changes. We pick up absl flags to match internal code and
remove one point of deviation - but importantly continue to read from
the environment variables. In doing so we don't force absl flags on our
customers - it's possible to configure grpc without the flags - but
instead allow users that do use absl flags to configure grpc using that
mechanism. Importantly this lets internal customers configure grpc the
same everywhere.
Future changes along this path will be two-fold:
1. Move documentation generation into the code generation step, so that
within the source of truth yaml file we can find all documentation and
data about a configuration knob - eliminating the chance of forgetting
to document something in all the right places.
2. Provide fuzzing over configurations. Currently most config variables
get stashed in static constants across the codebase. To fuzz over these
we'd need a way to reset those cached values between fuzzing rounds,
something that is terrifically difficult right now, but with these
changes should simply be a reset on `ConfigVars`.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
- Increase kubernetes library default for urlib3 retries to 10
- Add custom retry logic to all API calls made by framework.k8s
Custom retry logic handles various errors we're experienced over
two years, and based on ~140 failure reports:
1. Errors returned by the k8s API server itself:
- 401 Unauthorized
- 409 Conflict
- 429 Too Many Requests
- 500 Internal Server Error
2. Connection errors that might indicate k8s API server is temporarily
unavailable (such as a restart, upgrade, etc):
- All `NewConnectionError`s, f.e. "Connection timed out",
"Connection refused"
- All "connection aborted" `ProtocolError`s, f.e. "Remote end
closed connection without response", "Connection reset by peer"
ref b/178378578, b/258546394
PSM Interop: Local dev various improvements
- Cleanup resources on ctrl+c
- Add startup probes to address the issue with port forwarding starting
before the workload listens on a port
- Remove misleading restartPolicy: it's silently ignored by k8s
- Extra debug message with port-forwarding command
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
PR #32215 added the verified root cert subject to the lower level
`tsi_peer`. This PR is a companion to that and completes the feature by
bubbling the information up to the `TsiCustomVerificationCheckRequest`
which is part of the user facing API for implementing custom
verification callbacks.
There are potentially surprising deployment bugs that can cause `EMFILE`
to be hit. For example, file descriptor limits can be easily reached if
- the round robin LB policy is used
- the load balancer hands out an assignment with a lot of backends
- using debian's default 1024 file descriptor limit.
To make such problems more apparent, we can pay special attention to
this error and log ERROR when it happens.
Related: b/265199104
This test has been occasionally failing on CI with "Bus Error" crashes
while requiring the grpc shared library.
These crashes have been unreproducible locally. Let's continue debugging
(b/266212253) but skip this on CI.
Upgrade boringssl to the latest "master-with-bazel"
- use the `'USE_HEADERMAP' => 'NO'` fix for ObjC
- update the key for asm optimizations on mac/apple in python's setup.py
This PR depends on monterey fixes here:
https://github.com/grpc/grpc/pull/32493 and the boringssl's build
simplification
https://boringssl-review.googlesource.com/c/boringssl/+/56465.
---------
Co-authored-by: Hannah Shi <hannahshisfb@gmail.com>
Make remaining objC jobs compatible with kokoro monterey workers and
prepare for boringssl upgrade.
The changes here are taken from https://github.com/grpc/grpc/pull/32357,
but they should be merged in a separate PR
(we need the changes to be able to upgrade to monterey anyway and
there's no reason to make the boringssl upgrade PR more complicated by
bundling more fixes into it).
I've checked that the grpc_basictests_objc_examples and
grpc_ios_binary_size are green if switched to monterey.
Unfortunately it's hard to make grpc_basictests_objc_examples pass on
both monterey and mojave, so I suggest merging this PR at the same time
as CL to upgrade the kokoro jobs to monterey.
- that way both PR and continuous runs will remain green
- older branches would need a backport anyway
---------
Co-authored-by: Hannah Shi <hannahshisfb@gmail.com>