* WorkQueue
* weaken the large obj stress test for Windows; documentation
* update comment
* Add WorkQueue microbenchmark. Results below ...
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
BM_WorkQueueIntptrPopFront/1 297 ns 297 ns 2343500 items_per_second=3.3679M/s
BM_WorkQueueIntptrPopFront/8 7022 ns 7020 ns 99356 items_per_second=1.13956M/s
BM_WorkQueueIntptrPopFront/64 59606 ns 59590 ns 11770 items_per_second=1074k/s
BM_WorkQueueIntptrPopFront/512 477867 ns 477748 ns 1469 items_per_second=1071.7k/s
BM_WorkQueueIntptrPopFront/4096 3815786 ns 3814925 ns 184 items_per_second=1073.68k/s
I0902 19:05:22.138022069 12 test_config.cc:194] TestEnvironment ends
================================================================================
* use int64_t for times. 0 performance change
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
BM_WorkQueueIntptrPopFront/1 277 ns 277 ns 2450292 items_per_second=3.60967M/s
BM_WorkQueueIntptrPopFront/8 6718 ns 6716 ns 105497 items_per_second=1.19126M/s
BM_WorkQueueIntptrPopFront/64 56428 ns 56401 ns 12268 items_per_second=1.13474M/s
BM_WorkQueueIntptrPopFront/512 458953 ns 458817 ns 1550 items_per_second=1.11591M/s
BM_WorkQueueIntptrPopFront/4096 3686357 ns 3685120 ns 191 items_per_second=1.1115M/s
I0902 19:25:31.549382949 12 test_config.cc:194] TestEnvironment ends
================================================================================
* add PopBack tests: same performance profile exactly
* use Mutex instead of Spinlock
It's safer, and so far equally performant in benchmarks of opt builds
* add deque test for comparison. It is faster on all tests.
* Add sparsely-populated multi-threaded benchmarks.
* fix
* fix
* refactor to help thread safety analysis
* Specialize WorkQueue for Closure*s and AnyInvocables
* remove unused callback storage
* add single-threaded benchmark for closure vs invocable
* sanitize
* missing include
* move bm_work_queue to microbenchmarks so it isn't exported
* s/workqueue/work_queue/g
* use nullptr instead of optionals for popped closures
* reviewer test suggestion
* private things are private
* add a work_queue fuzzer
Ran for 10 minutes @ 42 jobs @ 42 workers. Zero failures.
Checked in a selection of 100 good seeds after merging the thousands of
results.
* fix
* fix header guards
* nuke the corpora
* feedback
* sanitize
* Timestamp::Now
* fix
* fuzzers do not work on windows
* windows does not like multithreaded benchmark tests
* Revert "Revert "[event_engine] Thread pool that can handle deletion in a callback (#30763)" (#30972)"
This reverts commit ccc787a020.
* Update thread_pool.cc
* Revert "Revert "XdsClient: add unit test and fix watcher notification bugs (#30823)" (#30942)"
This reverts commit 6d2c4a8314.
* use GRPC_CUSTOM_JSONUTIL macro for JsonPrintOptions
* Support Python 3.11
* Update build images for 3.11
* Whoopsie
* The architecture of this thing is garbage
* Silence ownership warning
* Account for change in git behavior
* Fix directory
* I am in great pain
* Update Windows and arm linux
* Agh
* Clean up
This adds a unit test for XdsClient and fixes several watcher-notification bugs found in the process. Specifically:
- When an ADS stream fails or an xDS channel reports a connectivity failure, report an error only to the watchers for resources being subscribed to on that particular channel, not to watchers on other channels.
- Cache the error status for the channel, so that if a new watcher is started after the channel reports the error, we can immediately report that error to the new watcher.
- If a resource is NACKed and has not been previously cached, or does not exist, report that fact to any new watcher that may be started later.
- If a resource in an ADS response is unparseable but is wrapped in a `Resource` wrapper, we do know its name, so record the validation failure in the cache and report it to the watchers.
Co-authored-by: markdroth <markdroth@users.noreply.github.com>
* Upgrade google/benchmark to v1.7.0
Reason: Setup and Teardown methods are nice to have for multithreaded
tests.
* use version tag
* Revert "use version tag"
This reverts commit 6c7f5e0d35.
Fun edge case: when `rand_string()` happen to generate numbers only,
yaml interprets `deployment_id` label value as an integer,
but k8s expects label values to be strings.
K8s responds with a barely readable 400 Bad Request error:
`ReadString: expects \" or n, but found 9, error found in #10 byte of ...|ent_id`.
Prepending deployment name forces deployment_id into a string,
as well as it's just a better description.
* client_channel: rewrite illegal status codes from control plane
* rewrite illegal status codes for call creds
* move fail_lb policy out of retry_lb_fail test so it can be reused
* test resolver and LB policy status rewrites
* add test for ConfigSelector status rewriting
* attempt to add client_auth filter unit test
* fix client_auth_filter test
* cleanup test
* fix build
* fix some memory leaks
* Automated change: Fix sanity tests
* Update client_auth_filter_test.cc
* fix build
* code review comments
* clang-tidy
Co-authored-by: markdroth <markdroth@users.noreply.github.com>
Co-authored-by: Craig Tiller <ctiller@google.com>
* [cleanup] Remove profiling timers
- nobody has used this system in years
- if we needed it, we'd probably rewrite it at this point to be something more modern
- let's remove it until that need arises
* fix
* fixes
When we use retryers with `log_level=logging.INFO`, tenacity logs the result value (or an exception) after each unsuccessful retry attempt.
We often retry methods that return objects, resulting in unreadable log messages:
```
I0820 03:16:29.027635 140613877811008 before_sleep.py:45] Retrying framework.xds_k8s_testcase.IsolatedXdsKubernetesTestCase.cleanup in 10.0 seconds as it raised RetryError: RetryError[Attempts: 21, Value: {'api_version': 'v1',
'kind': 'Namespace',
'metadata': {'annotations': None,
'cluster_name': None,
'creation_timestamp': datetime.datetime(2022, 8, 20, 2, 55, 32, tzinfo=tzlocal()),
'deletion_grace_period_seconds': None,
'deletion_timestamp': datetime.datetime(2022, 8, 20, 3, 6, 27, tzinfo=tzlocal()),
'finalizers': None,
'generate_name': None,
'generation': None,
'labels': {'kubernetes.io/metadata.name': 'psm-interop-server-20220820-0253-yrmam',
'name': 'psm-interop-server-20220820-0253-yrmam',
'owner': 'xds-k8s-interop-test'},
'managed_fields': [{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:metadata': {'f:labels': {'.': {},
'f:kubernetes.io/metadata.name': {},
... (82 more lines)
```
This PR introduces custom `before_sleep` logger, that only logs the value if it's a primitive: `int, str, bool`.
Otherwise, it logs the type, example:
```
k8s_base_runner.py:311] Waiting for pod psm-grpc-client-5d5648478f-7vsf7 to start
retryers.py:192] Retrying framework.infrastructure.k8s.KubernetesNamespace.get_pod in 1.0 seconds as it returned type <class 'kubernetes.client.models.v1_pod.V1Pod'>.
retryers.py:192] Retrying framework.infrastructure.k8s.KubernetesNamespace.get_pod in 1.0 seconds as it returned type <class 'kubernetes.client.models.v1_pod.V1Pod'>.
```
Note that this only changes the behavior of the unsuccessful retries, and doesn't affect the new feature that prints formatted k8s status field on if the *final* retry attempt failed.
- Added support for pod log collection. To enable, set `--collect_app_logs` flag, and specify `--log_dir`.
- Added support and helpers for operating on the `--log_dir` (natively provided by absl)
- Added support for `--follow` to `bin/run_test_server.py` and `bin/run_test_client.py` to follow pod logs printed to stdout
- Moved `PortForwarder` from k8s.py to its own file
The collection itself will be enabled per-suite in https://github.com/grpc/grpc/pull/30735.
* xDS interop: Fix default resource prefix
No longer just security tests.
This is done to avoid confusion when debugging resources managed
by the LB tests.
* s/xds/psm
All alternative server runners except the failover test reuse the primary server runners' namespace. Failover test is using the secondary cluster, and manages its own namespace there. `reuse_namespace` disables namespace cleanup, and in this case it was set to `True` incorrectly.
- Changes the order of waiting for pods to start: wait for the pods first, then for the deployment to transition to active. This should provide more useful information in the logs, showing exactly why the pod didn't start, instead of generic "Replicas not available" ref b/200293121. This also needed for https://github.com/grpc/grpc/pull/30594
- Add support for `check_result` callback in the retryer helpers
- Completely replaces `retrying` with `tenacity`, ref b/200293121. Retrying is not longer maintained.
- Improves the readability of timeout errors: now they contain the timeout (or the attempt number) exceeded, and information why the timeout failed (exception/check function):
Before:
> `tenacity.RetryError: RetryError[<Future at 0x7f8ce156bc18 state=finished returned dict>]`
After:
> `framework.helpers.retryers.RetryError: Retry error calling framework.infrastructure.k8s.KubernetesNamespace.get_pod: timeout 0:01:00 exceeded. Check result callback returned False.`
- Improves the readability of the k8s wait operation errors: now the log includes colorized and formatted status of the k8s object being watched, instead of dumping the full k8s object. For example, here's how an error caused by using incorrect TD bootstrap image:
* Enable outlier detection k8s interop test for Java. (#30641)
* xDS interop: enable outlier detection Java tests in >= 1.49.x
Co-authored-by: Terry Wilson <terrymwilson@gmail.com>