upb has traditionally returned 16-byte-aligned pointers from arena allocation. This was out of an abundance of caution, since users could theoretically be using upb arenas to allocate memory that is then used for SSE/AVX values (eg. [`__m128`](https://docs.microsoft.com/en-us/cpp/cpp/m128?view=msvc-170), which require 16-byte alignment.
In practice, the protobuf C++ arena has used 8-byte alignment for 8 years with no significant problems I know of arising from SSE etc.
Reducing the alignment requirement to 8 will save memory. It will also help with compatibility on 32-bit architectures where `malloc()` only returns 8-byte aligned memory. The immediate motivation is to fix the win32 build for Python protobuf.
PiperOrigin-RevId: 448331777
There was a bug in our arena code where we assumed that
sizeof(upb_array) would be a multiple of 8. On i386 it was
not, and this was causing memory corruption on 32-bit builds.
New code is smaller (in both source size and compiled size) and faster.
# Speed
The decoder speeds up on all machines I tested, though the amount of speedup varies. I was only able to test Intel CPUs.
### Linux Desktop
```
CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
OS: Linux
name old time/op new time/op delta
CreateArena 4.72ns ± 0% 4.93ns ± 0% +4.47% (p=0.000 n=11+11)
ParseDescriptor 12.4µs ± 1% 9.1µs ± 1% -26.65% (p=0.000 n=11+11)
```
### Mac Laptop
```
CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
OS: macOS
name old time/op new time/op delta
CreateArena 5.33ns ± 3% 5.58ns ± 2% +4.69% (p=0.000 n=12+12)
ParseDescriptor 15.0µs ± 2% 11.9µs ± 2% -20.20% (p=0.000 n=12+12)
```
### Linux Workstation
```
CPU: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
OS: Linux
name old time/op new time/op delta
CreateArena 5.29ns ± 0% 5.52ns ± 0% +4.37% (p=0.000 n=10+12)
ParseDescriptor 18.6µs ± 0% 16.4µs ± 0% -11.54% (p=0.000 n=12+12)
```
# Size
A few source files grow marginally because of some arena functionality moved inline. But `upb/decode.c` shrinks by 30% on Linux:
```
VM SIZE
--------------
+2.1% +283 upb/json_decode.c
+24% +205 upb/msg.c
+8.4% +115 upb/upb.c
+0.9% +28 upb/reflection.c
[ = ] 0 upb/def.c
[ = ] 0 upb/encode.c
[ = ] 0 upb/json_encode.c
[ = ] 0 upb/table.c
-30.3% -1.51Ki upb/decode.c
-0.7% -738 TOTAL
```
Map parsing/serializing relies on map entries always
having a predictable order. The code that generates
layout was not respecting this in the case of string
keys and primitive values.