Joshua Haberman
4132034634
Addressed PR comment.
4 years ago
Joshua Haberman
ed708fcd5d
Addressed PR comments.
4 years ago
Joshua Haberman
876abae2db
Removed some debug printing and simplified checktag slightly.
4 years ago
Joshua Haberman
286441afa7
Fixed a size regression due to inlining UTF-8 verification.
...
Overall size/speed impact on fasttable decoder is now:
name old time/op new time/op delta
ArenaOneAlloc 21.5ns ± 0% 21.5ns ± 0% ~ (p=0.060 n=12+12)
ArenaInitialBlockOneAlloc 6.33ns ± 0% 6.33ns ± 0% ~ (p=0.413 n=11+12)
LoadDescriptor_Upb 43.4µs ± 1% 45.5µs ± 1% +4.79% (p=0.000 n=12+12)
LoadAdsDescriptor_Upb 2.50ms ± 0% 2.51ms ± 2% ~ (p=0.512 n=10+11)
LoadDescriptor_Proto2 240µs ± 0% 240µs ± 0% -0.25% (p=0.000 n=12+12)
LoadAdsDescriptor_Proto2 12.9ms ± 0% 12.9ms ± 0% +0.20% (p=0.014 n=10+12)
Parse_Upb_FileDesc<UseArena,Copy> 4.99µs ± 0% 5.04µs ± 0% +0.98% (p=0.000 n=11+10)
Parse_Upb_FileDesc<UseArena,Alias> 4.02µs ± 0% 4.18µs ± 0% +4.16% (p=0.000 n=10+12)
Parse_Upb_FileDesc<InitBlock,Copy> 4.49µs ± 0% 4.54µs ± 0% +1.16% (p=0.000 n=11+10)
Parse_Upb_FileDesc<InitBlock,Alias> 3.60µs ± 0% 3.80µs ± 0% +5.73% (p=0.000 n=12+11)
Parse_Proto2<FileDesc,NoArena,Copy> 29.3µs ± 0% 29.3µs ± 0% ~ (p=0.069 n=11+12)
Parse_Proto2<FileDesc,UseArena,Copy> 20.2µs ± 3% 20.3µs ± 2% ~ (p=0.880 n=12+11)
Parse_Proto2<FileDesc,InitBlock,Copy> 16.5µs ± 0% 16.5µs ± 0% ~ (p=1.000 n=12+12)
Parse_Proto2<FileDescSV,InitBlock,Alias> 16.4µs ± 0% 16.4µs ± 1% ~ (p=0.590 n=12+12)
SerializeDescriptor_Proto2 5.31µs ± 1% 6.65µs ±29% +25.07% (p=0.000 n=12+12)
SerializeDescriptor_Upb 12.4µs ± 0% 12.5µs ± 0% +1.23% (p=0.000 n=12+12)
FILE SIZE VM SIZE
-------------- --------------
+16% +128 [ = ] 0 [Unmapped]
-1.2% -4 -1.2% -4 [section .text]
[NEW] +2 [NEW] +2 fastdecode_isdonefallback
[DEL] -6 [DEL] -6 fastdecode_longstring_noutf8
-0.2% -124 -0.2% -124 upb/decode_fast.c
+5.8% +64 +6.0% +64 upb_pom_1bt_max64b
+2.7% +64 +2.7% +64 upb_ppv8_2bt
+2.7% +32 +2.8% +32 upb_psm_1bt_max256b
+2.8% +32 +3.0% +32 upb_psm_1bt_max64b
+2.8% +32 +3.0% +32 upb_psm_2bt_max64b
+4.0% +24 +4.2% +24 upb_psv8_1bt
+2.0% +16 +2.1% +16 upb_prf4_2bt
+1.3% +16 +1.4% +16 upb_prz8_2bt
-0.3% -4 -0.3% -4 [3 Others]
-1.6% -8 -1.7% -8 upb_cob_1bt
-1.6% -8 -1.7% -8 upb_csb_1bt
-2.5% -16 -2.6% -16 upb_pov4_1bt
-1.3% -16 -1.3% -16 upb_prv8_2bt
-2.5% -16 -2.7% -16 upb_psv4_1bt
-2.5% -16 -2.6% -16 upb_psv4_2bt
-3.0% -32 -3.1% -32 upb_prs_2bt
-2.6% -32 -2.6% -32 upb_prv4_2bt
-4.9% -48 -5.1% -48 upb_prb_2bt
-3.9% -48 -4.0% -48 upb_prv4_1bt
-7.2% -72 -7.5% -72 upb_prb_1bt
-7.8% -88 -8.0% -88 upb_prs_1bt
[ = ] 0 -0.1% -128 TOTAL
There is a bit of speed regression, but it appears there were bigger
CPU regressions prior to this. We probably need some separate
optimization attention again to get back to the performance numbers
we had when fasttable was first submitted.
4 years ago
Joshua Haberman
e84793dd73
Cleaned up debugging artifacts.
4 years ago
Joshua Haberman
a4b35aa388
Everything passes except 4 conformance tests.
4 years ago
Joshua Haberman
3881393907
Renamed .int.h to _internal.h, for greater clarity.
4 years ago
Joshua Haberman
823eb09694
Update all 2011 dates to 2021.
4 years ago
Joshua Haberman
e59d2c8fa7
Added license headers to all files.
4 years ago
Joshua Haberman
e4343f0fa3
Update comment for ARM64.
4 years ago
Joshua Haberman
65d166a6ba
Added API for copy vs. alias and added benchmarks to test both.
...
Benchmark output:
$ bazel-bin/benchmarks/benchmark '--benchmark_filter=BM_Parse'
2020-11-11 15:39:04
Running bazel-bin/benchmarks/benchmark
Run on (72 X 3700 MHz CPU s)
CPU Caches:
L1 Data 32K (x36)
L1 Instruction 32K (x36)
L2 Unified 1024K (x36)
L3 Unified 25344K (x2)
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------------
BM_Parse_Upb_FileDesc<UseArena, Copy> 4134 ns 4134 ns 168714 1.69152GB/s
BM_Parse_Upb_FileDesc<UseArena, Alias> 3487 ns 3487 ns 199509 2.00526GB/s
BM_Parse_Upb_FileDesc<InitBlock, Copy> 3727 ns 3726 ns 187581 1.87643GB/s
BM_Parse_Upb_FileDesc<InitBlock, Alias> 3110 ns 3110 ns 224970 2.24866GB/s
BM_Parse_Proto2<FileDesc, NoArena, Copy> 31132 ns 31132 ns 22437 229.995MB/s
BM_Parse_Proto2<FileDesc, UseArena, Copy> 21011 ns 21009 ns 33922 340.812MB/s
BM_Parse_Proto2<FileDesc, InitBlock, Copy> 17976 ns 17975 ns 38808 398.337MB/s
BM_Parse_Proto2<FileDescSV, InitBlock, Alias> 17357 ns 17356 ns 40244 412.539MB/s
4 years ago
Joshua Haberman
982b634bc5
Fixed a few minor bugs found by fuzzing.
4 years ago
Joshua Haberman
1eb7bd39e7
Some formatting fixes.
4 years ago
Joshua Haberman
154f2c25f4
Added UTF-8 validation for proto3 string fields.
4 years ago
Joshua Haberman
e8f9eac68c
Added #defines UPB_ENABLE_FASTTABLE and UPB_TRY_ENABLE_FASTTABLE.
...
These control whether fasttable decoding is on.
4 years ago
Joshua Haberman
e86541ac1d
Fixed the build after the merge.
4 years ago
Joshua Haberman
efd576b698
Added -std=gnu99 for fastdecode and ran Buildifier.
4 years ago
Joshua Haberman
b928696942
A few more fixes, and test fastdecode under Kokoro.
4 years ago
Joshua Haberman
55f3569cd2
A few minor fixes and more assertions.
4 years ago
Joshua Haberman
46eb82467a
Added comment to decode_fast.h.
4 years ago
Joshua Haberman
bd9f8f580d
Fixed a few bugs with the fast decoder.
...
1. For long tags we were putting table entries in the wrong slot.
2. For repeated strings, when the buffer flipped to no longer alias we
were failing to notice and kept aliasing anyway.
4 years ago
Joshua Haberman
021db6fcd5
Allow larger tags into the table if they are unique mod 31.
...
Also fixed a bug with fixed packed in decode_fast.c.
4 years ago
Joshua Haberman
86d9908c55
Fastdecode support for packed fields.
...
This is not very optimized yet. There is a lot of room to
optimize it further.
4 years ago
Joshua Haberman
e3e797b680
Added fasttable support for oneofs.
4 years ago
Joshua Haberman
7ffa9c181a
Fixed some small bugs and performance problems in string copying.
...
Before this CL, with alias=false:
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_Parse_Upb_FileDesc_WithInitialBlock 3715 ns 3715 ns 188916 1.88206GB/s
Performance counter stats for 'bazel-bin/benchmarks/benchmark --benchmark_filter=BM_Parse_Upb_FileDesc_WithInitialBlock':
1,122.92 msec task-clock # 0.979 CPUs utilized
3 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
196 page-faults # 0.175 K/sec
4,144,746,717 cycles # 3.691 GHz
15,351,966,804 instructions # 3.70 insn per cycle
2,590,281,905 branches # 2306.728 M/sec
2,996,157 branch-misses # 0.12% of all branches
1.146615328 seconds time elapsed
1.115578000 seconds user
0.008025000 seconds sys
After this CL:
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_Parse_Upb_FileDesc_WithInitialBlock 3554 ns 3554 ns 197527 1.9674GB/s
Performance counter stats for 'bazel-bin/benchmarks/benchmark --benchmark_filter=BM_Parse_Upb_FileDesc_WithInitialBlock':
1,105.34 msec task-clock # 0.982 CPUs utilized
3 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
197 page-faults # 0.178 K/sec
4,077,736,892 cycles # 3.689 GHz
15,442,709,352 instructions # 3.79 insn per cycle
2,435,131,301 branches # 2203.068 M/sec
2,643,775 branch-misses # 0.11% of all branches
1.125393845 seconds time elapsed
1.097770000 seconds user
0.008012000 seconds sys
4 years ago
Joshua Haberman
e2c709e047
Repeated string and primitive support.
...
Much of the code was adapted from Gerben's code in:
6333031195
4 years ago
Joshua Haberman
d81ba58215
Optimized short string copying.
...
This sped up the alias=false case:
Before:
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_Parse_Upb_FileDesc_WithInitialBlock 4562 ns 4562 ns 153251 1.53276GB/s
Performance counter stats for 'bazel-bin/benchmarks/benchmark --benchmark_filter=BM_Parse_Upb_FileDesc_WithInitialBlock':
1,216.65 msec task-clock # 0.936 CPUs utilized
6 context-switches # 0.005 K/sec
0 cpu-migrations # 0.000 K/sec
200 page-faults # 0.164 K/sec
4,490,925,650 cycles # 3.691 GHz
16,516,403,731 instructions # 3.68 insn per cycle
2,828,536,650 branches # 2324.861 M/sec
5,425,830 branch-misses # 0.19% of all branches
1.300178903 seconds time elapsed
1.211475000 seconds user
0.072207000 seconds sys
After:
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_Parse_Upb_FileDesc_WithInitialBlock 3587 ns 3587 ns 195749 1.94935GB/s
Performance counter stats for 'bazel-bin/benchmarks/benchmark --benchmark_filter=BM_Parse_Upb_FileDesc_WithInitialBlock':
1,109.69 msec task-clock # 0.930 CPUs utilized
5 context-switches # 0.005 K/sec
0 cpu-migrations # 0.000 K/sec
198 page-faults # 0.178 K/sec
4,094,010,257 cycles # 3.689 GHz
15,672,677,812 instructions # 3.83 insn per cycle
2,589,291,160 branches # 2333.346 M/sec
3,306,386 branch-misses # 0.13% of all branches
1.193221789 seconds time elapsed
1.102538000 seconds user
0.072166000 seconds sys
4 years ago
Joshua Haberman
f3a2a79349
More optimization, back up to 2.56GB/s.
4 years ago
Joshua Haberman
199c914295
Simplify push/pop when msg fits in the current buffer.
4 years ago
Joshua Haberman
d5f5db2729
Put string-copying field parser into a separate function.
...
This helps to regain a bit of lost perf. Now at 2.3GB/s.
4 years ago
gerben-s
9e68ec033f
Add repeated varints and fixed parsers
4 years ago
Joshua Haberman
2a574d3d01
Added a bunch of comments for readability.
4 years ago
Joshua Haberman
5b0c5c7d4a
Dispatch inline.
4 years ago
Joshua Haberman
75edd3e59c
Changed to use table pairs, seems to ever-so-slightly regress.
4 years ago
Joshua Haberman
bca7edac8c
Cleaned up table compression a bit.
4 years ago
Joshua Haberman
b95f217996
A little speed boost, now hitting 2.51GB/s.
4 years ago
Joshua Haberman
8ed6b2fe85
Stored mask in the table pointer.
4 years ago
Joshua Haberman
a6dc88556d
Tables are compressed, but perf goes down to 2.44GB/s.
4 years ago
Joshua Haberman
f01efe8b64
Removed another C99-ism.
4 years ago
Joshua Haberman
1749082bbb
Removed C99-ism.
4 years ago
Gerben Stavenga
4053805759
Bugfixes
4 years ago
Gerben Stavenga
36662b3735
Refactor some code. I extracted some common code from all message field
...
parsers, to a tail recursive function. Removed the varint jmp table for
a simple varint parse loop, that removes the stack frames. Also careful
with not losing information in repeated message tag check. When written
mindful the checks and loads that happen can be reused for tag dispatch
if not the expected tag.
4 years ago
Joshua Haberman
9938cf8f27
Put submsg_index directly in table data. Drop oneof support for now to focus.
4 years ago
Joshua Haberman
89bd8b87e1
Fixed a few more C89 compat issues.
4 years ago
Joshua Haberman
64d293894a
Fixed bug introduced by last optimization.
4 years ago
Joshua Haberman
ff957b996c
Fixed C89 compat issues.
4 years ago
Joshua Haberman
537b6f42c2
A few updates to the benchamrk and minor implementation changes.
4 years ago
Joshua Haberman
0dcc5641eb
Replicated dispatch and implemeted array resizing logic. Up to 2.67GB/s.
4 years ago
Joshua Haberman
526e430794
I think this may have reached the optimization limit.
...
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_ArenaOneAlloc 21 ns 21 ns 32994231
BM_ArenaInitialBlockOneAlloc 6 ns 6 ns 116318005
BM_ParseDescriptorNoHeap 3028 ns 3028 ns 231138 2.34354GB/s
BM_ParseDescriptor 3557 ns 3557 ns 196583 1.99498GB/s
BM_ParseDescriptorProto2NoArena 33228 ns 33226 ns 21196 218.688MB/s
BM_ParseDescriptorProto2WithArena 22863 ns 22861 ns 30666 317.831MB/s
BM_SerializeDescriptorProto2 5444 ns 5444 ns 127368 1.30348GB/s
BM_SerializeDescriptor 12509 ns 12508 ns 55816 580.914MB/s
$ perf stat bazel-bin/benchmark --benchmark_filter=BM_ParseDescriptorNoHeap
2020-10-08 14:07:06
Running bazel-bin/benchmark
Run on (72 X 3700 MHz CPU s)
CPU Caches:
L1 Data 32K (x36)
L1 Instruction 32K (x36)
L2 Unified 1024K (x36)
L3 Unified 25344K (x2)
----------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------
BM_ParseDescriptorNoHeap 3071 ns 3071 ns 227743 2.31094GB/s
Performance counter stats for 'bazel-bin/benchmark --benchmark_filter=BM_ParseDescriptorNoHeap':
1,050.22 msec task-clock # 0.978 CPUs utilized
4 context-switches # 0.004 K/sec
0 cpu-migrations # 0.000 K/sec
179 page-faults # 0.170 K/sec
3,875,796,334 cycles # 3.690 GHz
13,282,835,967 instructions # 3.43 insn per cycle
2,887,725,848 branches # 2749.627 M/sec
8,324,912 branch-misses # 0.29% of all branches
1.073924364 seconds time elapsed
1.042806000 seconds user
0.008021000 seconds sys
Profile:
23.96% benchmark benchmark [.] upb_prm_1bt_max192b
22.44% benchmark benchmark [.] fastdecode_dispatch
18.96% benchmark benchmark [.] upb_pss_1bt
14.20% benchmark benchmark [.] upb_psv4_1bt
8.33% benchmark benchmark [.] upb_prm_1bt_max64b
6.66% benchmark benchmark [.] upb_prm_1bt_max128b
1.29% benchmark benchmark [.] upb_psm_1bt_max64b
0.77% benchmark benchmark [.] fastdecode_generic
0.55% benchmark [kernel.kallsyms] [k] smp_call_function_single
0.42% benchmark [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.42% benchmark benchmark [.] upb_psm_1bt_max256b
0.31% benchmark benchmark [.] upb_psb1_1bt
0.21% benchmark benchmark [.] upb_plv4_5bv
0.14% benchmark benchmark [.] upb_psb1_2bt
0.12% benchmark benchmark [.] decode_longvarint64
0.08% benchmark [kernel.kallsyms] [k] vsnprintf
0.07% benchmark [kernel.kallsyms] [k] _raw_spin_lock
0.07% benchmark benchmark [.] _upb_msg_new
0.06% benchmark ld-2.31.so [.] check_match
4 years ago
Joshua Haberman
4c65b25daf
Handle long varints, now 2GB/s!
4 years ago