CNR a WindowsEventEngine listener flake in:
* 10k local Windows development machine runs
* 50k Windows RBE runs
* 10k Windows VM runs
It fails ~5 times per day on the master CI jobs.
This PR adds some logging to try to see if an edge is missed, and
switches the thread pool implementation to see if that makes the flake
go away. If the flakes disappear, I'll try removing one or the other to
see if either independently fix the problem (hopefully not logging).
---------
Co-authored-by: drfloob <drfloob@users.noreply.github.com>
Reverts grpc/grpc#33819
Verified that it passed these jobs:
`grpc/core/master/linux/grpc_basictests_c_cpp_dbg`
`grpc/core/master/linux/grpc_basictests_c_cpp_opt`
`grpc/core/master/linux/grpc_portability`
Why: Cleanup for chttp2_transport ahead of promise conversion - lots of
logic has become interleaved throughout chttp2, so some effort to
isolate logic out is warranted ahead of that conversion.
What: Split configuration and policy tracking for each of ping rate
throttling and abuse detection into their own modules. Add tests for
them.
Incidentally: Split channel args into their own header so that we can
split the policy stuff into separate build targets.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This PR is expected to fix the flakes of
`//test/core/tsi:ssl_transport_security_test` when built under UBSAN.
Why is this needed? There are several tests in
`ssl_transport_security_test.cc` that involve doing many expensive
operations and PR #33638 recently added one more (namely, repeatedly
signing with an ECDSA key). The slow tests are already altered for MSAN
and TSAN, and now we need to do the same for UBSAN.
Those tests are failing on CIs which do not have twisted installed. Skip
them for now and will fix the docker images next.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This PR implements a c-ares based DNS resolver for EventEngine with the
reference from the original
[grpc_ares_wrapper.h](../blob/master/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h).
The PosixEventEngine DNSResolver is implemented on top of that. Tests
which use the client channel resolver API
([resolver.h](../blob/master/src/core/lib/resolver/resolver.h#L54)) are
ported, namely the
[resolver_component_test.cc](../blob/master/test/cpp/naming/resolver_component_test.cc)
and the
[cancel_ares_query_test.cc](../blob/master/test/cpp/naming/cancel_ares_query_test.cc).
The WindowsEventEngine DNSResolver will use the same EventEngine's
grpc_ares_wrapper and will be worked on next.
The
[resolve_address_test.cc](https://github.com/grpc/grpc/blob/master/test/core/iomgr/resolve_address_test.cc)
which uses the iomgr
[DNSResolver](../blob/master/src/core/lib/iomgr/resolve_address.h#L44)
API has been ported to EventEngine's dns_test.cc. That leaves only 2
tests which use iomgr's API, notably the
[dns_resolver_cooldown_test.cc](../blob/master/test/core/client_channel/resolvers/dns_resolver_cooldown_test.cc)
and the
[goaway_server_test.cc](../blob/master/test/core/end2end/goaway_server_test.cc)
which probably need to be restructured to use EventEngine DNSResolver
(for one thing they override the original grpc_ares_wrapper's free
functions). I will try to tackle these in the next step.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
Promises code can prevent these bad requests from even reaching the
application, which is beneficial but this test needs a minor update to
handle it.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
This test has been disabled for a long time now due to flakiness, but
it's now causing problems with the import. And stress tests don't
provide positive ROI anyway, so let's just get rid of it.
Based on the discussion at:
595a75cc5d..e3b402a8fa (r1244325752)
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
There is a bug in the SSL stack that was only partially fixed in #29176:
if more than 17kb is written to the BIO buffer, then everything over
17kb will be discarded, and the SSL handshake will fail with a bad
record mac error or hang if not enough bytes have arrived yet.
It's relatively uncommon to hit this bug, because the TLS handshake
messages need to be much larger than normal for you to have a chance of
hitting this bug. However, there was a separate bug in the SSL stack
(recently fixed in #33558) that causes the ServerHello produced by a
gRPC-C++ TLS server to grow linearly in size with the size of the trust
bundle; these 2 bugs combined to cause a large number of TLS handshake
failures for gRPC-C++ clients talking to gRPC-C++ servers when the
server had a large trust bundle.
This PR fixes the bug by ensuring that all bytes are successfully
written to the BIO buffer. An initial quick fix for this bug was planned
in #33611, but abandoned because we were worried about temporarily
doubling the memory footprint of all SSL channels.
The complexity in this PR is mostly in the test: it is fairly tricky to
force gRPC-C++'s SSL stack to generate a sufficiently large ServerHello
to trigger this bug.
Note that the plugin is still under `grpc::internal` namespace and not
under `experimental` intentionally.
<!--
If you know who should review your pull request, please assign it to
that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the
appropriate
lang label.
-->
This reverts the following PRs: #32692#33087#33093#33427#33568
These changes seem to have introduced some flaky crashes. Reverting
while I investigate.
As per gRFC A58, when WRR sees a subchannel report READY, it reset the
non_empty_since value, thus restarting the blackout period. However,
there were two cases where we were incorrectly triggering this code:
1. When WRR got an updated address list that contained addresses that
were already present on the old list and whose subchannels were already
in READY state, the initial notification for those subchannels on the
new list was READY, which incorrectly triggered resetting the
non_empty_since value.
2. Due to a bug in the outlier_detection policy, whenever an update was
propagated down through the OD policy without actually enabling OD, it
would incorrectly send a duplicate connectivity state notification for
the subchannels. This meant that a subchannel that was already in state
READY would report READY again, which would also incorrectly trigger
resetting the non_empty_since value.
This PR makes two changes:
1. Fix the bug in outlier_detection that caused it to generate the
spurious duplicate READY updates.
2. Fix WRR to reset the non_empty_since value when a subchannel goes
READY only if the subchannel has seen a previous state update and only
if that previous state was not READY. (The duplicate READY notifications
should not actually happen anymore now that the OD policy has been
fixed, but better to be defensive.)
Fixes b/290983884.
Adds access token lifetime configuration for workload identity
federation with service account impersonation for both explicit and
implicit flows.
Changes:
1. Adds a new member "service_account_impersonation" to the
ExternalAccountCredentials class. "token_lifetime_seconds" is a member
of "service_account_impersonation".
2. Adds validation checks, like token_lifetime_seconds should be between
the minimum and maximum accepted value, during the creation of an
ExternalAccountCredentials object.
3. Appends "lifetime" to the body of the service account impersonation
request.
Tests:
1. Modifies a test to check if the default value is passed when
"service_account_impersonation" is empty.
2. Adds tests to check if the token_lifetime_seconds value is propagated
to the request body.
3. Adds tests to verify that an error is thrown when
token_lifetime_seconds is invalid.
- Support call finalizers in filter test.
- Add an accessor to the filter implementation from the channel, so that
it can be interrogated by tests.
- Matcher to ensure that some metadata is *not* in a metadata batch
(functionality needed to support the additional testing we talked about
this morning)
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
In real services most of our time ends up in the `Read1()` function,
which populates one byte into the bit buffer.
Change this to read in as many as possible bytes at a time into that
buffer.
Additionally, generate all possible (to some depth) parser geometries,
and add a benchmark for them. Run that benchmark and select the best
geometry for decoding base64 strings (since this is the main use-case).
(gives about a 30% speed boost parsing base64 then huffman encoded
random binary strings)
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
tcp_posix_test is incorrectly assuming that all endpoint_writes with
timestamps enabled will be successfully traced. Remove the timestamps
checking related tests to prevent flakes when the test is enabled
internally.
Most of the time parsing succeeds, and only rarely do we see an error.
This change reduces the parse memento size from 120 bytes to 56 bytes.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>
I'd been adding the following stanza regularly to debug flakes/fuzz
failures:
```
Expect(1, CoreEnd2endTest::MaybePerformAction{[&](bool success) {
Crash(absl::StrCat(
"Unexpected completion of client side call: success=",
success ? "true" : "false", " status=", server_status.ToString(),
" initial_md=", server_initial_metadata.ToString()));
}});
```
it was helpful because it indicated why a call batch finished
successfully and helped quickly identify next steps.
It occurred to me however that this would better be done inside of the
framework, and for *all* ops that have outputs, so this PR does just
that. Any time a batch with an op that outputs information finishes
successfully but unexpectedly we now display those outputs in human
readable form in the error message.
Sample output:
```
[ RUN ] CorpusExamples/FuzzerCorpusTest.RunOneExample/0
RUN TEST: Http2SingleHopTest.SimpleDelayedRequestShort/Chttp2SimpleSslFullstack
E0101 00:00:05.000000000 396633 simple_delayed_request.cc:37] Create client side call
E0101 00:00:05.000000000 396633 simple_delayed_request.cc:41] Start initial batch
E0101 00:00:05.000000000 396633 simple_delayed_request.cc:47] Start server
E0101 00:00:05.000000000 396633 cq_verifier.cc:364] Verify tag(101)-✅ for 600000ms
test/core/end2end/cq_verifier.cc:316: Unexpected event: OP_COMPLETE: tag:0x1 OK
with:
incoming_metadata: {}
status_on_client: status=4 msg=Deadline Exceeded trailing_metadata={}
checked @ test/core/end2end/tests/simple_delayed_request.cc:51
expected:
test/core/end2end/tests/simple_delayed_request.cc:50: tag(101) success=true
```
The intuition here is that these strings may end up in the hpack table,
and then unnecessarily extend the lifetime of the read blocks.
Instead, take a copy of these short strings when we need to and allow
the incoming large memory object to be discarded.
---------
Co-authored-by: ctiller <ctiller@users.noreply.github.com>