Previously, extensions and unknown fields were stored on opposite ends of a growing buffer:
```
|------unknown fields-------|---------unallocated space------|--extensions---|
```
Unknown fields were appended and extensions were prepended during parse. When either side ran into the other, the buffer was reallocated to fit, rounding up to the nearest power of 2. This meant that for a proto with 70,000 bytes of unknown fields, the total memory consumed could be up to 128+256+512+1024+2048+4096+8192+16384+32768+65536+131072=262016 bytes allocated in the arena. In the more common case of a large, length-delimited field it'd be just 131072 bytes; but as a 3.74x increase or a 1.87x increase, that's a lot of extra memory.
The new representation still does exponential reallocation, but only for pointers to normal arena allocations. We exploit the fact that arena allocations are aligned to store data about whether the pointer is to an extension or a `upb_StringView` of unknown fields in the low bits of the pointer itself. This costs three pointers of overhead per unknown field and one pointer of overhead per extension, but that's a fixed overhead - we won't over-allocate large buffers for large unknown fields. If this overhead proves to be a problem, more compact representations could be implemented.
In addition, because unknown field bytes are now in their own allocations, they are pointer stable - in the future, this will allow us to exploit aliasing (when enabled during parse) for both unknown fields and lazy extensions (parsed from unknown fields), which can greatly reduce memory use for messages with a lot of unknown, string, or bytes fields.
PiperOrigin-RevId: 708058272
Rename upb_Log2CeilingSize() to upb_RoundUpToPowerOfTwo() to minimize the chance of this confusion happening in the future.
In practice this condition could never be hit by descriptors generated by protoc since FeatureSet is small and 1-byte-on-wire fields.
PiperOrigin-RevId: 702381123
Before, the upb_ExtensionRegistry_AddArray API would just return a boolean
indicating whether the operation succeeded or failed. This is not descriptive
enough in some cases and made the error output confusing.
Specifically, when trying to register an extension with a duplicate extension
number, AddArray first performs a map lookup before inserting the extension
entry into the registry. The code handled lookup failure (due to duplicates)
the same way as insertion failure (due to OOM), and printed an error message
that showed OOM when there is a duplicate array entry.
This was acknolwedged in a TODO in the AddArray code comment -- which is now
fixed. :)
PiperOrigin-RevId: 700764584
Introduce a upb_MessageValue_Zero() function to use for the cases we do want a zero'd union (typically a zero MessageValue union is not needed)
PiperOrigin-RevId: 672592274
The goal of the `names.h` convention is to have a single canonical place where a code generator can define the set of symbols it exports to other code generators, and a canonical place where the name mangling logic is implemented.
Each upb code generator now has its own `names.h` file defining the symbols that it owns & exports:
* `third_party/upb/upb_generator/c/names.h` (for `foo.upb.h` files)
* `third_party/upb/upb_generator/minitable/names.h` (for `foo.upb_minitable.h` files)
* `third_party/upb/upb_generator/reflection/names.h` (for `foo.upbdefs.h` files)
This is a significant improvement over the previous situation where the name mangling functions were co-mingled in `common.h`/`mangle.h`, or sprinkled throughout the generators, with no clear structure for which code generator owns which symbols.
With this structure in place, the visibility lists for the various `names.h` files provide a clear dependency graph for how different generators depend on each other. In general, we want to keep dependencies on the "C" code generator to a minimum, since it is the largest and most complicated of upb's generated APIs, and is also the most prone to symbol name clashes.
Note that upb's `names.h` headers are somewhat unusual, in that we do not want them to depend on C++'s reflection or upb's reflection. Most `names.h` headers in protobuf would use types like `proto2::Descriptor`, but we don't want upb to depend on C++ reflection, especially during its bootstrapping process. We also don't want to force users to build upb defs just to use these name mangling functions. So we use only plain string types like `absl::string_view` and `std::string`.
PiperOrigin-RevId: 672397247
This also increases compliance by adding `default_applicable_licenses` to several `BUILD` files that previously did not have it.
PiperOrigin-RevId: 670784686
This CL is mostly a no-op, except that now google3-only code is actually stripped from OSS, instead of being preserved in `# begin:google_only` blocks.
This follows the conventions of the greater Copybara ecosystem.
PiperOrigin-RevId: 669513564
Fixes the warning:
```
warning: a function declaration without a prototype is deprecated in all versions of C [-Wstrict-prototypes]
extern const upb_MiniTable* google__protobuf__OneofDescriptorProto_msg_init();
^
void
```
PiperOrigin-RevId: 664971019
Our bootstrapping setup compiles multiple versions of the generated code for `descriptor.proto` and `plugin.proto`, one for each stage of the bootstrap. For source files (`.c`), we can always select the correct version of the file in the BUILD rules, but for header files we need to make sure the correct stage's file is always selected via `#include`.
Previously we used `cc_library(includes=[])` to make it appear as though our bootstrapped headers had the same names as the "real" headers. This allowed a lot of the code to be agnostic to whether a bootstrap header was being used, which simplified things because we did not have to change the code performing the `#include`.
Unfortunately, due to build system limitations, this sometimes led to the incorrect header getting included. This should not have been possible, because we had a clean BUILD graph that should have removed all ambiguity about which header should be available. But in non-sandboxed builds, the compiler was able to find headers that were not actually in `deps=[]`, and worse it preferred those headers over the headers that actually were in `deps=[]`. This led to unintended results and errors about layering check violations.
This CL fixes the problem by removing all use of `includes=[]`. We now spell a full pathname to all bootstrap headers, so this class of errors is no longer possible. Unfortunately this adds some complexity, as we have to hard-code these full paths in several places.
A nice improvement in this CL is that `bootstrap_upb_proto_library()` can now only be used for bootstrapping; it only exposes the `descriptor_bootstrap.h` / `plugin_bootstrap.h` files. Anyone wanting to use the normal `net/proto2/proto/descriptor.upb.h` file should depend on `//net/proto2/proto:descriptor_upb_c_proto` target instead.
PiperOrigin-RevId: 664953196
We were failing to propagate the DefPool's platform to the MiniDescriptor builder. This caused upb's code generators to incorrectly generate a field rep of `kUpb_FieldRep_8Byte` for pointer-typed extension fields instead of the 32-bit clean output:
```
UPB_SIZE(kUpb_FieldRep_4Byte, kUpb_FieldRep_8Byte)
```
PiperOrigin-RevId: 653263168
This was previously fixed in C++ (https://github.com/protocolbuffers/protobuf/issues/16549), but not ported to other languages. Delimited field encoding can be inherited by fields where it's invalid, such as non-messages and maps. In these cases, the encoding should be ignored and length-prefixed should be used.
PiperOrigin-RevId: 642792988
This "feature" hasn't been implemented yet, but this puts a placeholder down to prevent compatibility issues in future editions. Once we provide versioning support on individual feature values, we don't want them becoming usable from edition 2023 protos
PiperOrigin-RevId: 635956805
All of this information is still available by merging fixed_features and overridable_features. This new split will make validation easier for runtimes that need to do dynamic builds.
PiperOrigin-RevId: 625815212
The new fields fixed_features and overridable_features can be simply merged to recover the old aggregate defaults. By splitting them though, plugins and runtimes get some extra information about lifetimes for enforcement.
PiperOrigin-RevId: 625527117
The only public target here is the edition defaults helper macro, which can be used by external runtimes and plugins. None of this code is C++-specific though, and should be organized higher up. Appropriate aliases are also placed at the top level for public targets
PiperOrigin-RevId: 625392504