The `XdsFaultInjectionMaxFault` test has seen a few flakes since #32326
was merged. I believe the flakiness is caused by the fact that when a
large number of RPCs are queued up before the resolver result comes in,
those RPCs are now re-processed in parallel instead of sequentially,
which can cause us to delay more RPCs than we should due to the
`max_faults` setting. To fix this, we change the test to ensure that the
channel is connected (i.e., the resolver result has already been
returned) before we start sending a large number of concurrent RPCs.
Although this is the only test that I've seen flakes in, I've made this
same change consistently to all fault injection tests that are creating
a large number of concurrent RPCs, since the same flake could affect any
of them.
This code is not plumbed through yet, but it provides the core
infrastructure needed to detect the proper GCP environment resources
needed to set up the labels/attributes/resources for stats, tracing and
logging.
Details on how the various environment resources are setup has been
derived by looking at java's cloud logging library and OpenTelemetry's
future plans. (Could be better explained in an offline review since some
links are internal).
Requesting @veblush for a full review and @markdroth for a structural
review.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Cleanup and remove ios cpp test cronet
To test manually:
./tools/bazel test //src/objective-c/tests:CppCronetTests
@sampajano
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
The pooled allocator currently has an ABA issue in the allocation path.
This change should fix that - algorithm is described reasonably well in
the PR.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
These tests are failing because they're running with too few threads,
however if we give them sufficient threads to catch bugs they're flaky.
Remove them and get the team some bandwidth back.
* Revert "Revert "Revert "Revert "server: introduce ServerMetricRecorder API and move per-call reporting from a C++ interceptor to a C-core filter (#32106)" (#32272)" (#32279)" (#32293)"
This reverts commit 1f960697c5.
* Do not create CallMetricRecorder if call is null.
* Revert "Revert "server: introduce ServerMetricRecorder API and move per-call reporting from a C++ interceptor to a C-core filter (#32106)" (#32272)"
This reverts commit deb1e25543.
* Fix by caching call metric recording stuff in async request
PR #32106 caused msan errors in some tests while de-referencing the
server object where async calls are active after the server is
destroyed. Instead cache the ServerMetricRecorder pointer.
* copyright headers fixed
* clang fixes.
There was a ~1% flake in grpclb end2end tests that was reproducible in opt builds, manifesting as a hang, usually in a the SingleBalancerTest.Fallback test. Through experimentation, I found that by skipping the death test in the grpclb end2end test suite, the hang was no longer reproducible in 10,000 runs. Similarly, moving this test to the end of the suite, or making it run first (as is the case in this PR) resulted in 0 failures in 3000 runs.
It's unclear to me yet why the death test causes things to be unstable in this way. It's clear from the logs that one test does affect the rest, grpc_init is done once for all tests, so all tests utilize the same EventEngine ... until the death test completes, and a new EventEngine is created for the next test.
I think this death test is sufficiently artificial that it's fine to change the test ordering itself, and ignore the wonky intermediate state that results from it.
Reproducing the flake:
```
tools/bazel --bazelrc=tools/remote_build/linux.bazelrc test \
-c opt \
--test_env=GRPC_TRACE=event_engine \
--runs_per_test=5000 \
--test_output=summary \
test/cpp/end2end/grpclb_end2end_test@poller=epoll1
```
* WRR: port StaticStrideScheduler to OSS
* WIP
* Automated change: Fix sanity tests
* fix build
* remove unused aliases
* fix another type mismatch
* remove unnecessary include
* move benchmarks to their own file, and don't run it on windows
* Automated change: Fix sanity tests
* add OOB reporting
* generate_projects
* clang-format
* add config parser test
* clang-tidy and minimize lock contention
* add config defaults
* add oob_reporting_period config field and add basic test
* Automated change: Fix sanity tests
* fix test
* change test to use basic RR
* WIP: started exposing peer address to LB policy API
* first WRR test passing!
* small cleanup
* port RR fix to WRR
* test helper refactoring
* more test helper refactoring
* WIP: trying to fix test to have the right weights
* more WIP -- need to make pickers DualRefCounted
* fix timer ref handling and get tests working
* clang-format
* iwyu and generate_projects
* fix build
* add test for OOB reporting
* keep only READY subchannels in the picker
* add file missed in a previous commit
* fix sanity
* iwyu
* add weight expiration period
* add tests for weight update period and OOB reporting period
* Automated change: Fix sanity tests
* lower bound for timer interval
* consistently apply grpc_test_slowdown_factor()
* cache time in test
* add blackout_period tests
* avoid some unnecessary copies
* clang-format
* add field to config test
* simplify orca watcher tracking
* attempt to fix build
* iwyu
* generate_projects
* update xds proto dependency
* add xDS LB policy entry to registry
* add "_experimental" suffix to policy name
* update LB policy name and remove debug log
* add env var protection
* generate_projects
* gen_upb_api
* WRR: update tests to cover qps plumbing
* WIP
* Automated change: Fix sanity tests
* more WIP
* basic WRR e2e test working
* add OOB test
* add xDS WRR e2e test
* clang-format
* fix sanity
* ignore duplicate addresses
* Automated change: Fix sanity tests
* add new tracer to doc/environment_variables.md
* retain scheduler state across pickers
* Automated change: Fix sanity tests
* use separate mutexes for scheduler and timer
* sort addresses to avoid index churn
* remove fetch_sub for wrap around in RR case
Co-authored-by: markdroth <markdroth@users.noreply.github.com>
* WRR: port StaticStrideScheduler to OSS
* WIP
* Automated change: Fix sanity tests
* fix build
* remove unused aliases
* fix another type mismatch
* remove unnecessary include
* move benchmarks to their own file, and don't run it on windows
* Automated change: Fix sanity tests
* add OOB reporting
* generate_projects
* clang-format
* add config parser test
* clang-tidy and minimize lock contention
* add config defaults
* add oob_reporting_period config field and add basic test
* Automated change: Fix sanity tests
* fix test
* change test to use basic RR
* WIP: started exposing peer address to LB policy API
* first WRR test passing!
* small cleanup
* port RR fix to WRR
* test helper refactoring
* more test helper refactoring
* WIP: trying to fix test to have the right weights
* more WIP -- need to make pickers DualRefCounted
* fix timer ref handling and get tests working
* clang-format
* iwyu and generate_projects
* fix build
* add test for OOB reporting
* keep only READY subchannels in the picker
* add file missed in a previous commit
* fix sanity
* iwyu
* add weight expiration period
* add tests for weight update period and OOB reporting period
* Automated change: Fix sanity tests
* lower bound for timer interval
* consistently apply grpc_test_slowdown_factor()
* cache time in test
* add blackout_period tests
* avoid some unnecessary copies
* clang-format
* add field to config test
* simplify orca watcher tracking
* attempt to fix build
* iwyu
* generate_projects
* add "_experimental" suffix to policy name
* WRR: update tests to cover qps plumbing
* WIP
* more WIP
* basic WRR e2e test working
* add OOB test
* fix sanity
* ignore duplicate addresses
* Automated change: Fix sanity tests
* add new tracer to doc/environment_variables.md
* retain scheduler state across pickers
* Automated change: Fix sanity tests
* use separate mutexes for scheduler and timer
* sort addresses to avoid index churn
* remove fetch_sub for wrap around in RR case
Co-authored-by: markdroth <markdroth@users.noreply.github.com>
* Gcp Observability: Lazily initialize channels post-init
* IWYU and fix build deps
* Run RegistryPostInit for client census filters too
* Remove unused function
Updates the ProtoReflectionDescriptorDatabase ctor to take a reference
to std::shared_ptr<grpc::ChannelInterface> rather than
std::shared_ptr<grpc::Channel>. This helps code that is making use of
grpc::ChannelInterface from having to perform a cast from the Base to
the Derived when creating the refelction db.
* initial
* Intermediate
* Another try
* Try multiple necesary pulls
* Filter works other than client half close
* Fixes
* Add a cancelled RPC test
* Handle trailer only responses
* Tests for disabled logging and truncated payloads
* Fix authority and peer
* Add TODOs and asserts for half-close
* Fix tests for half-close and cancel
* 2d748fcb1cf45cac62729b8346ad15e6abc79e97
* Fix sanity checks
* Strict bazel build
* Fix package
* IWYU
* Fix cmake
* Explicit cast to string
* Size casts
* Fix Arena leak and disable macos build for now
* Reviewer comments