First draft of document comparing upb's design with C++ protos.

3 years ago · 7c25c5728d
parent 096f2bcb2e
commit 7c25c5728d
1 changed files with 249 additions and 0 deletions
--- a/doc/vs-cpp-protos.md
+++ b/doc/vs-cpp-protos.md
@ -0,0 +1,249 @@
+
+# upb vs. C++ Protobuf Design
+
+[upb](https://github.com/protocolbuffers/upb) is a small C protobuf library.
+While some of the design follows in the footsteps of the C++ Protobuf Library,
+upb departs from C++'s design in several key ways.  This document compares
+and contrasts the two libraries on several design points.
+
+## Design Goals
+
+Before we begin, it is worth calling out that upb and C++ have different design
+goals, and this motivates some of the differences we will see.
+
+C++ protobuf is a user-level library: it is designed to be used directly by C++
+applications.  These applications will expect a full-featured C++ API surface
+that uses C++ idioms.  The C++ library is also willing to add features to
+increase server performance, even if these features would add size or complexity
+to the library.  Because C++ protobuf is a user-level library, API stability is
+of utmost importance: breaking API changes are rare and carefully managed when
+they do occur.  The focus on C++ also means that ABI compatibility with C is not
+a priority.
+
+upb, on the other hand, is designed primarily to be wrapped by other languages.
+It is a C protobuf kernel that forms the basis on which a user-level protobuf
+library can be built.  This means we prefer to keep the API surface as small and
+orthogonal as possible.  While upb supports all protobuf features required for
+full conformance, upb prioritizes simplicity and small code size, and avoids
+adding features like lazy fields that can accelerate some use cases but at great
+cost in terms of complexity.  As upb is not aimed directly at users, there is
+much more freedom to make API-breaking changes when necessary, which helps the
+core to stay small and simple.  We want to be compatible with all FFI
+interfaces, so C ABI compatibility is a must.
+
+Despite these differences, C++ protos and upb offer [roughly the same core set
+of features](https://github.com/protocolbuffers/upb#features).
+
+## Arenas
+
+upb and C++ protos both offer arena allocation, but there are some key
+differences.
+
+### C++
+
+As a matter of history, when C++ protos were open-sourced in 2008, they did not
+support arenas.  Originally there was only unique ownership, whereby each
+message uniquely owns all child messages and will free them when the parent is
+freed.
+
+Arena allocation was added as a feature in 2014 as a way of dramatically
+reducing allocation and (especially) deallocation costs.  But the library was
+not at liberty to remove the unique ownership model, because it would break far
+too many users.  As a result, C++ has supported a **hybrid allocation model**
+ever since, allowing users to allocate messages either directly from the
+stack/heap or from an arena.  The library attempts to ensure that there are
+no dangling pointers by performing automatic copies in some cases (for example
+`a->set_allocated_b(b)`, where `a` and `b` are on different arenas).
+
+C++'s arena object itself `google::protobuf::Arena` is **thread-safe** by
+design, which allows users to allocate from multiple threads simultaneously with
+no synchronization.  The user can supply an initial block of memory to the
+arena, and can choose some parameters to control the arena block size.  The user
+can also supply block alloc/dealloc functions, but the alloc function is
+expected to always return some memory.  The C++ library in general does not
+attempt to handle out of memory conditions.
+
+### upb
+
+upb uses **arena allocation exclusively**. All messages must be allocated from
+an arena, and can only be freed by freeing the arena.  It is entirely the user's
+responsibility to ensure that there are no dangling pointers: when a user sets a
+message field, this will always trivially overwrite the pointer and will never
+perform an implicit copy.
+
+upb's `upb::Arena` is **thread-compatible**, which means it cannot be used
+concurrently without synchronization.  The arena can be seeded with an initial
+block of memory, but it does not explicitly support any parameters for choosing
+block size.  It support a custom alloc/dealloc function, and this function is
+allowed to return `NULL` if no dynamic memory is available.  This allows upb
+arenas to have a max/fixed size, and makes it possible in theory to write code
+that is tolerant to out-of-memory errors.
+
+upb's arena also supports a novel operation known as **fuse**, which joins two
+arenas together into a single lifetime.  Though both arenas must still be freed
+separately, none of the memory will actually be freed until *both* arenas have
+been freed.  This is useful for avoiding dangling pointers when reparenting a
+message with one that may be on a different arena.
+
+### Comparison
+
+**hybrid allocation vs. arena-only**:
+* The C++ hybrid allocation model introduces a great deal of complexity and
+  unpredictability into the library.  upb benefits from having a much simpler
+  and more predictable design.
+* Some of the complexity in C++'s hybrid model arises from the fact that arenas
+  were added after
+  the fact.  Designing for a hybrid model from the outset would likely yield a
+  simpler result
+* Unique ownership does support some usage patterns that arenas cannot directly
+  accommodate.  For example, you can reparent a message and the child will precisely
+  follow the lifetime of its new parent.  An arena would require you to either
+  perform a deep copy or extend the lifetime.
+
+**thread-compatible vs. thread-safe arena**
+* A thread-safe arena (as in C++) is safer and easier to use.  A thread-compatible
+  arena requires that the user prove that the arena cannot be used concurrently.
+* [Thread Sanitizer](https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual)
+  is far more accessible than it was in 2014 (when C++ introduced a thread-safe
+  arena).  We now have more tools at our disposal to ensure that we do not trigger
+  data races in a thread-compatible arena like upb.
+* A thread-compatible arena has a far simpler implementation.  The C++ thread-safe
+  arena relies on thread-local variables, which introduce complications on some
+  platforms.  It also requires far more subtle reasoning for correctness and
+  performance, and likely cannot match the performance of a thread-compatible arena.
+
+**fuse vs. no fuse**
+* The `upb_Arena_Fuse()` operation is a key part of how we can support reparenting
+  of messages when the parent may be on a different arena.  Without this, we have
+  no way of supporting `foo.bar = bar` in dynamic languages without performing a
+  deep copy.
+* A downside of `upb_Arena_Fuse()` is that passing an arena to a function can allow
+  that function to extend the lifetime of the arena in potentially
+  unpredictable ways.  This can be prevented if necessary, as fuse can fail, eg. if
+  one arena has an initial block.  But this adds some complexity by requiring callers
+  to handle the case where fuse fails.
+
+## Code Generation vs. Tables
+
+The C++ protobuf library has always been built around code generation, while upb
+generates only tables.  In other words, `foo.pb.cc` files contain functions,
+whereas `foo.upb.c` files emit only data structures.
+
+### C++
+
+C++ generated code emits a large number of functions into `foo.pb.cc` files.
+An incomplete list:
+
+* `FooMsg::FooMsg()` (constructor): initializes all fields to their default value.
+* `FooMsg::~FooMsg()` (destructor): frees any present child messages.
+* `FooMsg::Clear()`: clears all fields back to their default/empty value.
+* `FooMsg::_InternalParse()`: generated code for parsing a message.
+* `FooMsg::_InternalSerialize()`: generated code for serializing a message.
+* `FooMsg::ByteSizeLong()`: calculates serialized size, as a first pass before serializing.
+* `FooMsg::MergeFrom()`: copies/appends present fields from another message.
+* `FooMsg::IsInitialized()`: checks whether required fields are set.
+
+This code lives in the `.text` section and contains function calls to the generated
+classes for child messages.
+
+### upb
+
+upb does not generate any code into `foo.upb.c` files, only data structures.  upb uses a
+compact data table known as a *mini table* to represent the schema and all fields.
+
+upb uses mini tables to perform all of the operations that would traditionally be done
+with generated code.  Revisiting the list from the previous section:
+
+* `FooMsg::FooMsg()` (constructor): upb instead initializes all messages with `memset(msg, 0, size)`.
+   Non-zero defaults are injected in the accessors.
+* `FooMsg::~FooMsg()` (destructor): upb messages are freed by freeing the arena.
+* `FooMsg::Clear()`: can be performed with `memset(msg, 0, size)`.
+* `FooMsg::_InternalParse()`: upb's parser uses mini tables as data, instead of generating code.
+* `FooMsg::_InternalSerialize()`: upb's serializer also uses mini-tables instead of generated code.
+* `FooMsg::ByteSizeLong()`: upb performs serialization in reverse so that an initial pass is not required.
+* `FooMsg::MergeFrom()`: upb supports this via serialize+parse from the other message.
+* `FooMsg::IsInitialized()`: upb's encoder and decoder have special flags to check for required fields.
+  A util library `upb/util/required_fields.h` handles the corner cases.
+
+### Comparison
+
+If we compare compiled code size, upb is far smaller.  Here is a comparison of the code
+size of a trivial binary that does nothing but a parse and serialize of `descriptor.proto`.
+This means we are seeing both the overhead of the core library itself as well as the
+generated code (or table) for `descriptor.proto`.  (For extra clarity we should break this
+down by generated code vs core library in the future).
+
+
+| Library         | `.text` | `.data` | `.bss` |
+|------------     |---------|---------|--------|
+| upb             |  26Ki   | 0.6Ki   | 0.01Ki |
+| C++ (lite)      | 187Ki   | 2.8Ki   | 1.25Ki |
+| C++ (code size) | 904Ki   | 6.1Ki   | 1.88Ki |
+| C++ (full)      | 983Ki   | 6.1Ki   | 1.88Ki |
+
+## Bifurcated vs. Optional Reflection
+
+upb and C++ protos both offer reflection without making it mandatory.  However
+the models for enabling/disabling reflection are very different.
+
+### C++
+
+C++ messages offer full reflection by default.  Messages in C++ generally
+derive from `Message`, and the base class provides a member function
+`Reflection* Message::GetReflection()` which returns the reflection object.
+
+It follows that any message deriving from `Message` will always have reflection
+linked into the binary, whether or not the reflection object is ever used.
+Because `GetReflection()` is a function on the base class, it is not possible
+to statically determine if a given message's reflection is used:
+
+```c++
+Reflection* GetReflection(const Message& message) {
+    // Can refer to any message in the whole binary.
+    return message.GetReflection();
+}
+```
+
+The C++ library does provide a way of omitting reflection: `MessageLite`.  We can
+cause a message to be lite in two different ways:
+
+* `optimize_for = LITE_RUNTIME` in a `.proto` file will cause all messages in that
+  file to be lite.
+* `lite` as a codegen param: this will force all messages to lite, even if the
+  `.proto` file does not have `optimize_for = LITE_RUNTIME`.
+
+A lite message will derive from `MessageLite` instead of `Message`.  Since
+`MessageLite` has no `GetReflection()` function, this means no reflection is
+available, so we can avoid taking the code size hit.
+
+### upb
+
+upb does not have the `Message` vs. `MessageLite` bifurcation.  There is only one
+kind of message type `upb_Message`, which means there is no need to configure in
+a `.proto` file which messages will need reflection and which will not.
+Every message has the *option* to link in reflection from a separate `foo.upbdefs.o`
+file, without needing to change the message itself in any way.
+
+upb does not provide the equivalent of `Message::GetReflection()`: there is no
+facility for retrieving the reflection of a message whose type is not known statically.
+It would be possible to layer such a facility on top of the upb core, though this
+would probably require some kind of code generation.
+
+### Comparison
+
+* Most messages in C++ will not bother to declare themselves as "lite".  This means
+  that many C++ messages will link in reflection even when it is never used, bloating
+  binaries unnecessarily.
+* `optimize_for = LITE_RUNTIME` is difficult to use in practice, because it prevents 
+  any non-lite protos from `import`ing that file.
+* Forcing all protos to lite via a codegen parameter (for example, when building for
+  mobile) is more practical than `optimize_for = LITE_RUNTIME`.  But this will break
+  the compile for any code that tries to upcast to `Message`, or tries to use a
+  non-lite method.
+* The one major advantage of the C++ model is that it can support `msg.DebugString()`
+  on a type-erased proto.  For upb you have to explicitly pass the `upb_MessageDef*`
+  separately if you want to perform an operation like printing a proto to text format.
+
+## Explicit Registration vs. Globals
+
+TODO