This change is to update the TD bootstrap generator for prod tests. This
is part of the TD release process. The new image has already been merged
to staging and tested locally in google3.
cc: @sergiitk PTAL.
Previously black wouldn't install, as it required newer `packaging`
package.
This fixes `pip install -r requirements-dev.txt`. In addition, `black`
in dev dependencies file is changed to `black[d]`, which bundles
`blackd` binary (["black as a
server"](https://black.readthedocs.io/en/stable/usage_and_configuration/black_as_a_server.html)).
Fixes an issue when an active context selected automatically picked up
as context for `secondary_k8s_api_manager`.
This was introducing an error in GAMMA Baseline PoC
```
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('100.71.2.143', 56723), raddr=('35.199.174.232', 443)>
```
Here's how the secondary context is incorrectly falls back to the
default context when `--secondary_kube_context` is not set:
```
k8s.py:142] Using kubernetes context "gke_grpc-testing_us-central1-a_psm-interop-security", active host: https://35.202.85.90
k8s.py:142] Using kubernetes context "None", active host: https://35.202.85.90
```
- Add Github Action to conditionally run PSM Interop unit tests:
- Only run when changes are detected in
`tools/run_tests/xds_k8s_test_driver` or any of the proto files used by
the driver
- Only run against PRs and pushes to `master`, `v1.*.*` branches
- Runs using `python3.9` and `python3.10`
- Ready to be added to the list of required GitHub checks
- Add `tools/run_tests/xds_k8s_test_driver/tests/unit/__main__.py` test
loader that recursively discovers all unit tests in
`tools/run_tests/xds_k8s_test_driver/tests/unit`
- Add basic coverage for `XdsTestClient` and `XdsTestServer` to verify
the test loader picks up all folders
Related:
- First unit tests without automated CI added in #34097
The tests are skipped incorrectly because `config.server_lang` is
incorrectly compared with the string value "java", instead of
`skips.Lang.JAVA`.
This has been broken since #26998.
```
xds_url_map_testcase.py:372] ----- Testing TestTimeoutInRouteRule -----
xds_url_map_testcase.py:373] Logs timezone: UTC
skips.py:121] Skipping TestConfig(client_lang='java', server_lang='java', version='v1.57.x')
[ SKIPPED ] setUpClass (timeout_test.TestTimeoutInRouteRule)
xds_url_map_testcase.py:372] ----- Testing TestTimeoutInApplication -----
xds_url_map_testcase.py:373] Logs timezone: UTC
skips.py:121] Skipping TestConfig(client_lang='java', server_lang='java', version='v1.57.x')
[ SKIPPED ] setUpClass (timeout_test.TestTimeoutInApplication)
```
This is to make sure upgrading packaging module won't break our logic on
version-based version skipping.
This also fixes a small issue with `dev-` prefix - it should only be
allowed on the left side of the comparison.
Context: packaging module needs to be upgraded to be compatible with
`blackd`.
Disables the warning produced by kubernetes/client/rest.py calling the
deprecated `urllib3.response.HTTPResponse.getheaders`,
`urllib3.response.HTTPResponse.getheader` methods:
```
venv-test/lib/python3.9/site-packages/kubernetes/client/rest.py:44: DeprecationWarning: HTTPResponse.getheaders() is deprecated and will be removed in urllib3 v2.1.0. Instead access HTTPResponse.headers directly.
return self.urllib3_response.getheaders()
```
This issue introduced by openapi-generator, and solved in `v6.4.0`. To
fix the issue properly, kubernetes/python folks need to regenerate the
library using newer openapi-generator. The most recent release `v27.2.0`
still used openapi-generator
[`v4.3.0`](https://github.com/kubernetes-client/python/blob/v27.2.0/kubernetes/.openapi-generator/VERSION).
Since they release two times a year, and the 2 major version difference
of openapi-generator, the fix may take a while.
Created an issue in their repo:
https://github.com/kubernetes-client/python/issues/2101.
Addresses some issues of the initial triage hint PR:
https://github.com/grpc/grpc/pull/33898.
1. Print unhealthy backend name before the health info - previously it
was unclear health status of which backend is dumped
2. Add missing `retry_err.add_note(note)` calls
3. Turn off the highlighter in triager hints, which isn't rendered
properly in the stack trace saved to junit.xml
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
I was hoping this would solve the issue with `DeprecationWarning:
HTTPResponse.getheaders() is deprecated`, but it didn't.
Anyway, we should be updating this from time to time.
Changelog:
https://github.com/kubernetes-client/python/blob/release-27.0/CHANGELOG.md
The client library changes from `25.3.0` to `27.2.0` are minimal.
The majority of the changelog is API updates pulled from k8s upstream.
This clearly indicates which errors are "blanket" errors and are not a
root cause on their own.
This also moves the debug info with the last known status of an object
the framework was waiting for, but bailed out due to a timeout.
Previously it was printed as the last error message in the test, and
this PR prints it after the stack trace that caused the test failure.
In addition, I added a similar debug information to the "wait for NEGs
to become healthy". Now it prints the statuses of unhealthy backends
To achieve that, I mimicked upcoming [PEP
678](https://peps.python.org/pep-0678/) Exception notes feature. When
we're upgrade to py11, we'll be able to remove `add_note()` methods, and
get the same functionality for free.
This PR fixes the bootstrap generator interop test by making the node
metadata flag dependent on version, which was causing a breakage
previously as all bootstrap generator version's don't necessarily
support the deexpiermentalized flag.
The most important change here is to setting `resource_suffix` and
`server_xds_port` flags to "generate randomly" by default. Previously we
were suggesting static values, and devs ended up with resource conflict
errors.
PanCakes to the rescue!
We noticed that our 'sanity' test was going to fail, but we think we can
fix that automatically, so we put together this PR to do just that!
If you'd like to opt-out of these PR's, add yourself to NO_AUTOFIX_USERS
in .github/workflows/pr-auto-fix.yaml
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
As part of the dualstack backend designs, subchannels will be created
lazily. Therefore, instead of asserting that there is 1 READY subchannel
and `n - 1` IDLE subchannels, we just assert that there is 1 READY
subchannel.
Similar to https://github.com/grpc/grpc/pull/33542.
Note that there's a ticket to automatically use the one specified in the
--server_image_canonical flag, but for now we just hardcode.
ref b/261911148, b/282106799.
This is a no-op, just reordering `requirements.lock`.
By providing `-r requirements.txt` to `pip freeze` it's able to break up
dependencies required via `requirements.txt`, and sub-dependencies
installed to satisfy them.
- Switched from yapf to black
- Reconfigure isort for black
- Resolve black/pylint idiosyncrasies
Note: I used `--experimental-string-processing` because black was
producing "implicit string concatenation", similar to what described
here: https://github.com/psf/black/issues/1837. While currently this
feature is experimental, it will be enabled by default:
https://github.com/psf/black/issues/2188. After running black with the
new string processing so that the generated code merges these `"hello" "
world"` strings concatenations, then I removed
`--experimental-string-processing` for stability, and regenerated the
code again.
To the reviewer: don't even try to open "Files Changed" tab 😄 It's
better to review commit-by-commit, and ignore `run black and isort`.
Do not clutter the final error we see at the end with the before/after
stats.
#### Examples
###### Expected only status A, but found status B for method M:
```
[ FAILED ] CustomLbTest.test_custom_lb_config
======================================================================
FAIL: test_custom_lb_config (__main__.CustomLbTest)
CustomLbTest.test_custom_lb_config
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/sergiitk/Development/grpc/tools/run_tests/xds_k8s_test_driver/tests/custom_lb_test.py", line 113, in test_custom_lb_config
self.assertRpcStatusCodes(test_client,
File "/Users/sergiitk/Development/grpc/tools/run_tests/xds_k8s_test_driver/framework/xds_k8s_testcase.py", line 345, in assertRpcStatusCodes
found_status = helpers_grpc.status_from_int(found_status_int)
AssertionError: Expected only status (15, DATA_LOSS), but found status (0, OK) for method UNARY_CALL.
Diff stats:
- method: UNARY_CALL
rpcs_started: 251
result:
(0, OK): 251
```
###### Expected non-zero RPCs with status A for method M.
```
[ FAILED ] AuthzTest.test_plaintext_allow
======================================================================
FAIL: test_plaintext_allow (__main__.AuthzTest)
AuthzTest.test_plaintext_allow
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/sergiitk/Development/grpc/tools/run_tests/xds_k8s_test_driver/tests/authz_test.py", line 224, in test_plaintext_allow
self.configure_and_assert(test_client, 'host-wildcard',
File "/Users/sergiitk/Development/grpc/tools/run_tests/xds_k8s_test_driver/tests/authz_test.py", line 204, in configure_and_assert
self.assertRpcStatusCodes(test_client,
File "/Users/sergiitk/Development/grpc/tools/run_tests/xds_k8s_test_driver/framework/xds_k8s_testcase.py", line 355, in assertRpcStatusCodes
self.assertGreater(stats.result[expected_status_int],
AssertionError: 0 not greater than 0 : Expected non-zero completed RPCs with status (0, OK) for method EMPTY_CALL.
Diff stats:
- method: EMPTY_CALL
rpcs_started: 13
result: {}
```
Improvements to the `LoadBalancerAccumulatedStatsRequest` output. Makes
it readable.
This greatly affects `assertRpcStatusCodes()` output, used in authz and
custom_lb.
No before and after stats, just useful diff stats from now. Minimal and
readable.
Also diff stats have `rpcs_started` now.
![image](https://github.com/grpc/grpc/assets/672669/a4e38d82-be5a-4f31-9d88-da2bf9712d9b)
Output example:
```
--- Starting subTest __main__.AuthzTest.test_plaintext_allow.01_host_wildcard ---
[psm-grpc-client-765bfbf868-jqjm7:51561] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), wait_for_ready=True, timeout=600)
[psm-grpc-client-765bfbf868-jqjm7:51561] >> RPC XdsUpdateClientConfigureService.Configure(request=ClientConfigureRequest({'types': ['EMPTY_CALL'], 'metadata': [{'key': 'test', 'value': 'host-wildcard'}]}), timeout=5, wait_for_ready=True)
[psm-grpc-client-765bfbf868-jqjm7:51561] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), wait_for_ready=True, timeout=600)
[psm-grpc-client-765bfbf868-jqjm7:51561] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), wait_for_ready=True, timeout=600)
[psm-grpc-client-765bfbf868-jqjm7] << Received accumulated stats difference. Expecting RPCs with status (0, OK) for method EMPTY_CALL.
- method: EMPTY_CALL
rpcs_started: 13
result:
(0, OK): 14
--- Finished subTest __main__.AuthzTest.test_plaintext_allow.01_host_wildcard ---
```
In case of test failure, it'll still print all stats at the end,
including before and after:
```
AssertionError: Expected only status (15, DATA_LOSS), but found status (0, OK) for method UNARY_CALL.
Stats before:
- method: UNARY_CALL
rpcs_started: 2153
result:
(14, UNAVAILABLE): 1674
(0, OK): 479
Stats after:
- method: UNARY_CALL
rpcs_started: 2404
result:
(0, OK): 730
(14, UNAVAILABLE): 1674
Diff stats:
- method: UNARY_CALL
rpcs_started: 251
result:
(0, OK): 251
```
And as I was at it, also made `LoadBalancerStatsResponse` nice:
![image](https://github.com/grpc/grpc/assets/672669/b15908a7-bae4-41a0-a2f7-c903e398432a)
Fixes the issue introduced in https://github.com/grpc/grpc/pull/33104,
where stopping the current run didn't reset `self.time_start_requested`,
`self.time_start_completed`, `self.time_start_stopped`. Because of this,
the subsetting test (the only one [redeploying the client
app](10001d16a9/tools/run_tests/xds_k8s_test_driver/tests/subsetting_test.py (L73C1-L74)))
started failing with:
```py
Traceback (most recent call last):
File "xds_k8s_test_driver/tests/subsetting_test.py", line 76, in test_subsetting_basic
test_client: _XdsTestClient = self.startTestClient(
File "xds_k8s_test_driver/framework/xds_k8s_testcase.py", line 615, in startTestClient
test_client = self.client_runner.run(server_target=test_server.xds_uri,
File "xds_k8s_test_driver/framework/test_app/runners/k8s/k8s_xds_client_runner.py", line 110, in run
super().run()
File "xds_k8s_test_driver/framework/test_app/runners/k8s/k8s_base_runner.py", line 112, in run
raise RuntimeError(
RuntimeError: Deployment psm-grpc-client: has already been started at 2023-05-27T13:47:15.262461
```
This PR:
1. Instead of relying on the `time_start_requested`,
`time_start_stopped` to produce GCP links, tracks the history run of
each deployment. This fixes the issue described above, and adds support
for listing all past runs executed by a k8s runner.
2. Minor: remove the unnecessary call to `test_client.cleanup()` when
there's no past deployment runs (e.g. at the first iteration of `for i
in range(_NUM_CLIENTS):`)
I've noticed we add the cleanup hook after setting up the
infrastructure. Thus, if infra setup failed, the cleanup won't work.
This fixes it, and adds extra checks to not call
`cls.test_client_runner` if it's not set.
Fail test if client or server pods restarted during test.
#### Testing
Tested locally, test will fail with message similar to:
```
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/google/home/xuanwn/workspace/xds/grpc/tools/run_tests/xds_k8s_test_driver/framework/xds_k8s_testcase.py", line 501, in tearDown
))
AssertionError: 5 != 0 : Server pods unexpectedly restarted {sever_restarts} times during test.
----------------------------------------------------------------------
Ran 1 test in 886.867s
```
Better logging for `assertRpcStatusCodes`.
(got tired of looking up the status names)
#### Unexpected status found
Before:
```
AssertionError: AssertionError: Expected only status 15 but found status 0 for method UNARY_CALL:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
}
}
```
After:
```
AssertionError: Expected only status (15, DATA_LOSS), but found status (0, OK) for method UNARY_CALL:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
}
}
```
#### No traffic with expected status
Before:
```
AssertionError: 0 not greater than 0
```
After:
```
AssertionError: 0 not greater than 0 : Expected non-zero RPCs with status (15, DATA_LOSS) for method UNARY_CALL, got:
stats_per_method {
key: "UNARY_CALL"
value {
result {
key: 0
value: 251
}
result {
key: 15
value: 0
}
}
}
```
Before this change, `Found subchannel in state READY` and `Channel to
xds:///psm-grpc-server:61404 transitioned to state ` would dump the full
channel/subchannel, in some implementations that expose
ChannelData.trace (f.e. go) would add 300 extra lines of log.
Now we print a brief repr-like chanel/subchannel info:
```
Found subchannel in state READY: <Subchannel subchannel_id=9 target=10.110.1.44:8080 state=READY>
Channel to xds:///psm-grpc-server:61404 transitioned to state READY: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=READY>
```
Also while waiting for the channel, we log channel_id now too:
```
Waiting to report a READY channel to xds:///psm-grpc-server:61404
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=TRANSIENT_FAILURE>
Server channel: <Channel channel_id=2 target=xds:///psm-grpc-server:61404 state=READY>
```
Similar to what we already do in other test suites:
- Try cleaning up resources three times.
- If unsuccessful, don't fail the test and just log the error. The
cleanup script should be the one to deal with this.
ref b/282081851
Resolve `TESTING_VERSION` to `dev-VERSION` when the job is initiated by
a user, and not the CI. Override this behavior with setting
`FORCE_TESTING_VERSION`.
This solves the problem with the manual job runs executed against a WIP
branch (f.e. a PR) overriding the tag of the CI-built image we use for
daily testing.
The `dev` and `dev-VERSION` "magic" values supported by the
`--testing_version` flag:
- `dev` and `dev-master` and treated as `master`: all
`config.version_gte` checks resolve to `True`.
- `dev-VERSION` is treated as `VERSION`: `dev-v1.55.x` is treated as
simply `v1.55.x`. We do this so that when manually running jobs for old
branches the feature skip check still works, and unsupported tests are
skipped.
This changes will take care of all langs/branches, no backports needed.
ref b/256845629
Previously the error message didn't provide much context, example:
```py
Traceback (most recent call last):
File "/tmpfs/tmp/tmp.BqlenMyXyk/grpc/tools/run_tests/xds_k8s_test_driver/tests/affinity_test.py", line 127, in test_affinity
self.assertLen(
AssertionError: [] has length of 0, expected 1.
```
ref b/279990584.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
---------
Co-authored-by: Sergii Tkachenko <hi@sergii.org>
- Fix broken `bin/run_channelz.py` helper
- Create `bin/run_ping_pong.py` helper that runs the baseline (aka
"ping_pong") test against preconfigured infra
- Setup automatic port forwarding when running `bin/run_channelz.py` and
`bin/run_ping_pong.py`
- Create `bin/cleanup_cluster.sh` helper to wipe xds out resources based
namespaces present on the cluster
Note: this involves a small change to the non-helper code, but it's just
moving a the part that makes XdsTestServer/XdsTestClient instance for a
given pod.
While a proper fix is on the way, this mitigates the number of
duplicated container logs in the xds test server/client pod logs.
The issue is that we only wait between stream restarts when an exception
is caught, which isn't always the reason the stream gets broken. Another
reason is the main container being shut down by k8s. In this situation,
we essentially do
```py
while True:
try:
restart_stream()
read_all_logs_from_pod_start()
except Exception:
logger.warning('error')
wait_seconds(1)
```
This PR makes it
```py
while True:
try:
restart_stream()
read_all_logs_from_pod_start()
except Exception:
logger.warning('error')
finally:
wait_seconds(5)
```
`tearDownClass` is not executed when `setUpClass` failed. In URL Map
test suite, this leads to a test client that failed to start not being
cleaned up.
This PR change the URL Map test suite to register a custom
`addClassCleanup` callback, instead of relying on the `tearDownClass`.
Unlike `tearDownClass`, cleanup callbacks are executed when the
`setUpClass` failed.
ref b/276761453
Previously, we didn't configure the failureThreshold, so it used its
default value. The final `startupProbe` looked like this:
```json
{
"startupProbe": {
"failureThreshold": 3,
"periodSeconds": 3,
"successThreshold": 1,
"tcpSocket": {
"port": 8081
},
"timeoutSeconds": 1
}
```
Because of it, the total time before k8s killed the container was 3
times `failureThreshold` * 3 seconds wait between probes `periodSeconds`
= 9 seconds total (±3 seconds waiting for the probe response).
This greatly affected PSM Security test server, some implementations of
which waited for the ADS stream to be configured before starting
listening on the maintenance port. This lead for the server container
being killed for ~7 times before a successful startup:
```
15:55:08.875586 "Killing container with a grace period"
15:53:38.875812 "Killing container with a grace period"
15:52:47.875752 "Killing container with a grace period"
15:52:38.874696 "Killing container with a grace period"
15:52:14.874491 "Killing container with a grace period"
15:52:05.875400 "Killing container with a grace period"
15:51:56.876138 "Killing container with a grace period"
```
These extra delays lead to PSM security tests timing out.
ref b/277336725