All these changes need to go together to make sense
- changes to use new version of upb in bazel
- allowing includes in build target option
- script for generating c code (upb) for protos
- generated code for example protos
- adding changes for non-bazel builds
- change sanity tests to ignore the generated files.
TCP_INQ is a socket option we added to Linux to report pending bytes
on the socket as a control message.
Using TCP_INQ we can accurately decide whether to continue read or not.
Add an urgent parameter, when we do not want to wait for EPOLLIN.
This commit improves the latency of 1 RPC unary (minimal benchmark)
significantly:
Before:
l_50: 61.3584984733
l_90: 94.8328711277
l_99: 126.211351174
l_999: 158.722406029
After:
l_50: 51.3546011488 (-16%)
l_90: 72.3420731581 (-23%)
l_99: 103.280218974 (-18%)
l_999: 130.905689996 (-17%)
grpc_byte_buffer_reader_next() copies and references the slice. This
is not always necessary since the caller will not use the slice
after destroying the byte buffer.
A prominent example is the protobuf parser, which
calls grpc_byte_buffer_reader_next() and immediately unrefs the slice
after the call. This ref() and unref() calls can be very expensive
in the hot path.
This commit introduces grpc_byte_buffer_reader_peek() which
essentialy return a pointer to the slice in the buffer, i.e.,
no copies, and no refs.
QPS of 1MiB 1 Channel callback benchmark increases by 5%.
More importantly insructions per cycle is increased by 10%.
Also add tests and benchmarks for byte_buffer_reader_peek()
Passing grpc_slice by value and/or returning it can be very costly,
introducing many extra instructions to push the structure to the
stack and poping it.
This CL, wherever possible, changes grpc_slice to be passed by
value.
On a local benchmark, I obserse 4-7% improvements in latency and QPS.
There are still copies to the slice_ref vtable which @arjunroy
is fixing as part of his major effort to use grpc_core::RefCount
for slices and devirtualizing them.
We flush these closures only when the connection goes IDLE.
This will cause no completion being sent, if we have a continuous
stream of bytes that never stops, causing a memory bloat because
we never call the callbacks of the ops.
For example, we use 100s of GiB of memory after a minute of exchanging
1MiB RPCs with callback API.
This patch runs the closures when we have done running
one write action.
After this change memory remains stable for the 1MiB benchmark.
QPS is increased by 200 QPS (520 -> 749), and latency is dropped
by 70ms, because we were basically page-faulting on every RPC.