Revamped the design doc and added a section about Arenas.

The old design doc had fallen out of date. Now that upb's core design has stabilized, it's time for a new design doc that walks through all of upb's major abstractions. We start with arenas; future CLs will cover other aspects of upb's design. PiperOrigin-RevId: 549048285
2 years ago · 6d2b9e6d18
parent f67198ffef
commit 6d2b9e6d18
2 changed files with 167 additions and 201 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -1,201 +0,0 @@
 # upb Design
 upb aims to be a minimal C protobuf kernel.  It has a C API, but its primary
 goal is to be the core runtime for a higher-level API.
 ## Design goals
 - Full protobuf conformance
 - Small code size
 - Fast performance (without compromising code size)
 - Easy to wrap in language runtimes
 - Easy to adapt to different memory management schemes (refcounting, GC, etc)
 ## Design parameters
 - C99
 - 32 or 64-bit CPU (assumes 4 or 8 byte pointers)
 - Uses pointer tagging, but avoids other implementation-defined behavior
 - Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
 - No global state, fully re-entrant
 ## Overall Structure
 The upb library is divided into two main parts:
 - A core message representation, which supports binary format parsing
  and serialization.
  - `upb/upb.h`: arena allocator (`upb_arena`)
  - `upb/msg_internal.h`: core message representation and parse tables
  - `upb/msg.h`: accessing metadata common to all messages, like unknown fields
  - `upb/decode.h`: binary format parsing
  - `upb/encode.h`: binary format serialization
  - `upb/table_internal.h`: hash table (used for maps)
  - `upbc/protoc-gen-upbc.cc`: compiler that generates `.upb.h`/`.upb.c` APIs for
    accessing messages without reflection.
 - A reflection add-on library that supports JSON and text format.
  - `upb/def.h`: schema representation and loading from descriptors
  - `upb/reflection.h`: reflective access to message data.
  - `upb/json_encode.h`: JSON encoding
  - `upb/json_decode.h`: JSON decoding
  - `upb/text_encode.h`: text format encoding
  - `upbc/protoc-gen-upbdefs.cc`: compiler that generates `.upbdefs.h`/`.upbdefs.c`
    APIs for loading reflection.
 ## Core Message Representation
 The representation for each message consists of:
 - One pointer (`upb_msg_internaldata*`) for unknown fields and extensions. This
  pointer is `NULL` when no unknown fields or extensions are present.
 - Hasbits for any optional/required fields.
 - Case integers for each oneof.
 - Data for each field.
 For example, a layout for a message with two `optional int32` fields would end
 up looking something like this:
 ```c
 // For illustration only, upb does not actually generate structs.
 typedef struct {
  upb_msg_internaldata* internal;  // Unknown fields and extensions.
  uint32_t hasbits;                // We are only using two hasbits.
  int32_t field1;
  int32_t field2;
 } package_name_MessageName;
 ```
 Note in particular that messages do *not* have:
 - A pointer to reflection or a parse table (upb messages are not self-describing).
 - A pointer to an arena (the arena must be explicitly passed into any function that
  allocates).
 The upb compiler computes a layout for each message, and determines the offset for
 each field using normal alignment rules (each data member must be aligned to a
 multiple of its size).  This layout is then embedded into the generated `.upb.h`
 and `.upb.c` headers in two different forms.  First as inline accessors that expect
 the data at a given offset:
 ```c
 // Example of a generated accessor, from foo.upb.h
 UPB_INLINE int32_t package_name_MessageName_field1(
    const upb_test_MessageName *msg) {
  return *UPB_PTR_AT(msg, UPB_SIZE(4, 4), int32_t);
 }
 ```
 Secondly, the layout is emitted as a table which is used by the parser and serializer.
 We call these tables "mini-tables" to distinguish them from the larger and more
 optimized "fast tables" used in `upb/decode_fast.c` (an experimental parser that is
 2-3x the speed of the main parser, though the main parser is already quite fast).
 ```c
 // Definition of mini-table structure, from upb/msg_internal.h
 typedef struct {
  uint32_t number;
  uint16_t offset;
  int16_t presence;       /* If >0, hasbit_index.  If <0, ~oneof_index. */
  uint16_t submsg_index;  /* undefined if descriptortype != MESSAGE or GROUP. */
  uint8_t descriptortype;
  int8_t mode;            /* upb_fieldmode, with flags from upb_labelflags */
 } upb_msglayout_field;
 typedef enum {
  _UPB_MODE_MAP = 0,
  _UPB_MODE_ARRAY = 1,
  _UPB_MODE_SCALAR = 2,
 } upb_fieldmode;
 typedef struct {
  const struct upb_msglayout *const* submsgs;
  const upb_msglayout_field *fields;
  uint16_t size;
  uint16_t field_count;
  bool extendable;
  uint8_t dense_below;
  uint8_t table_mask;
 } upb_msglayout;
 // Example of a generated mini-table, from foo.upb.c
 static const upb_msglayout_field upb_test_MessageName__fields[2] = {
  {1, UPB_SIZE(4, 4), 1, 0, 5, _UPB_MODE_SCALAR},
  {2, UPB_SIZE(8, 8), 2, 0, 5, _UPB_MODE_SCALAR},
 };
 const upb_msglayout upb_test_MessageName_msg_init = {
  NULL,
  &upb_test_MessageName__fields[0],
  UPB_SIZE(16, 16), 2, false, 2, 255,
 };
 ```
 The upb compiler computes separate layouts for 32 and 64 bit modes, since the
 pointer size will be 4 or 8 bytes respectively.  The upb compiler embeds both
 sizes into the source code, using a `UPB_SIZE(size32, size64)` macro that can
 choose the appropriate size at build time based on the size of `UINTPTR_MAX`.
 Note that `.upb.c` files contain data tables only.  There is no "generated code"
 except for the inline accessors in the `.upb.h` files: the entire footprint
 of `.upb.c` files is in `.rodata`, none in `.text` or `.data`.
 ## Memory Management Model
 All memory management in upb is built around arenas.  A message is never
 considered to "own" the strings or sub-messages contained within it.  Instead a
 message and all of its sub-messages/strings/etc. are all owned by an arena and
 are freed when the arena is freed.  An entire message tree will probably be
 owned by a single arena, but this is not required or enforced.  As far as upb is
 concerned, it is up to the client how to partition its arenas.  upb only requires
 that when you ask it to serialize a message, that all reachable messages are
 still alive.
 The arena supports both a user-supplied initial block and a custom allocation
 callback, so there is a lot of flexibility in memory allocation strategy.  The
 allocation callback can even be `NULL` for heap-free operation.  The main
 constraint of the arena is that all of the memory in each arena must be freed
 together.
 `upb_arena` supports a novel operation called "fuse".  When two arenas are fused
 together, their lifetimes are irreversibly joined, such that none of the arena
 blocks in either arena will be freed until *both* arenas are freed with
 `upb_arena_free()`.  This is useful when joining two messages from separate
 arenas (making one a sub-message of the other).  Fuse is a very cheap
 operation, and an unlimited number of arenas can be fused together efficiently.
 ## Reflection and Descriptors
 upb offers a fully-featured reflection library.  There are two main ways of
 using reflection:
 1. You can load descriptors from strings using `upb_symtab_addfile()`.
  The upb runtime will dynamically create mini-tables like what the upb compiler
  would have created if you had compiled this type into a `.upb.c` file.
 2. You can load descriptors using generated `.upbdefs.h` interfaces.
  This will load reflection that references the corresponding `.upb.c`
  mini-tables instead of building a new mini-table on the fly.  This lets
  you reflect on generated types that are linked into your program.
 upb's design for descriptors is similar to protobuf C++ in many ways, with
 the following correspondences:
 | C++ Type | upb type |
 | ---------| ---------|
 | `google::protobuf::DescriptorPool` | `upb_symtab`
 | `google::protobuf::Descriptor` | `upb_msgdef`
 | `google::protobuf::FieldDescriptor` | `upb_fielddef`
 | `google::protobuf::OneofDescriptor` | `upb_oneofdef`
 | `google::protobuf::EnumDescriptor` | `upb_enumdef`
 | `google::protobuf::FileDescriptor` | `upb_filedef`
 | `google::protobuf::ServiceDescriptor` | `upb_servicedef`
 | `google::protobuf::MethodDescriptor` | `upb_methoddef`
 Like in C++ descriptors (defs) are created by loading a
 `google_protobuf_FileDescriptorProto` into a `upb_symtab`.  This creates and
 links all of the def objects corresponding to that `.proto` file, and inserts
 the names into a symbol table so they can be looked up by name.
 Once you have loaded some descriptors into a `upb_symtab`, you can create and
 manipulate messages using the interfaces defined in `upb/reflection.h`.  If your
 descriptors are linked to your generated layouts using option (2) above, you can
 safely access the same messages using both reflection and generated interfaces.
--- a/docs/design.md
+++ b/docs/design.md
@ -0,0 +1,167 @@
 # upb Design
 [TOC]
 upb is a protobuf kernel written in C.  It is a fast and conformant implementation
 of protobuf, with a low-level C API that is designed to be wrapped in other
 languages.
 upb is not designed to be used by applications directly.  The C API is very
 low-level, unsafe, and changes frequently.  It is important that upb is able to
 make breaking API changes as necessary, to avoid taking on technical debt that
 would compromise upb's goals of small code size and fast performance.
 ## Design goals
 Goals:
 - Full protobuf conformance
 - Small code size
 - Fast performance (without compromising code size)
 - Easy to wrap in language runtimes
 - Easy to adapt to different memory management schemes (refcounting, GC, etc)
 Non-Goals:
 - Stable API
 - Safe API
 - Ergonomic API for applications
 Parameters:
 - C99
 - 32 or 64-bit CPU (assumes 4 or 8 byte pointers)
 - Uses pointer tagging, but avoids other implementation-defined behavior
 - Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
 - No global state, fully re-entrant
 ## Arenas
 All memory management in upb uses arenas, using the type `upb_Arena`.  Arenas
 are an alternative to `malloc()` and `free()` that significantly reduces the
 costs of memory allocation.
 Arenas obtain blocks of memory using some underlying allocator (likely
 `malloc()` and `free()`), and satisfy allocations using a simple bump allocator
 that walks through each block in linear order.  Allocations cannot be freed
 individually: it is only possible to free the arena as a whole, which frees all
 of the underlying blocks.
 Here is an example of using the `upb_Arena` type:
 ```c
  upb_Arena* arena = upb_Arena_New();
  // Perform some allocations.
  int* x = upb_Arena_Malloc(arena, sizeof(*x));
  int* y = upb_Arena_Malloc(arena, sizeof(*y));
  // We cannot free `x` and `y` separately, we can only free the arena
  // as a whole.
  upb_Arena_Free(arena);
 ```
 upb uses arenas for all memory management, and this fact is reflected in the API
 for all upb data structures.  All upb functions that allocate take a
 `upb_Arena*` parameter and perform allocations using that arena rather than
 calling `malloc()` or `free()`.
 ```c
 // upb API to create a message.
 UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table,
                                     upb_Arena* arena);
 void MakeMessage(const upb_MiniTable* mini_table) {
  upb_Arena* arena = upb_Arena_New();
  // This message is allocated on our arena.
  upb_Message* msg = upb_Message_New(mini_table, arena);
  // We can free the arena whenever we want, but we cannot free the
  // message separately from the arena.
  upb_Arena_Free(arena);
  // msg is now deleted.
 }
 ```
 Arenas are a key part of upb's performance story.  Parsing a large protobuf
 payload usually involves rapidly creating a series of messages, arrays (repeated
 fields), and maps.  It is crucial for parsing performance that these allocations
 are as fast as possible.  Equally important, freeing the tree of messages should
 be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg
 n)`.
 ### Avoiding Dangling Pointers
 Objects allocated on an arena will frequently contain pointers to other
 arena-allocated objects.  For example, a `upb_Message` will have pointers to
 sub-messages that are also arena-allocated.
 Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide
 automatic safety from dangling pointers.  Instead, upb provides tools to help
 bridge between higher-level memory management schemes (GC, refcounting, RAII,
 borrow checkers) and arenas.
 If there is only one arena, dangling pointers within the arena are impossible,
 because all objects are freed at the same time.  This is the simplest case.  The
 user must still be careful not to keep dangling pointers that point at arena
 memory after it has been freed, but dangling pointers *between* the arena
 objects will be impossible.
 But what if there are multiple arenas?  If we have a pointer from one arena to
 another, how do we ensure that this will not become a dangling pointer?
 To help with the multiple arena case, upb provides a primitive called "fuse".
 ```c
 // Fuses the lifetimes of `a` and `b`.  None of the blocks from `a` or `b`
 // will be freed until both arenas are freed.
 UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b);
 ```
 When two arenas are fused together, their lifetimes are irreversibly joined,
 such that none of the arena blocks in either arena will be freed until *both*
 arenas are freed with `upb_Arena_Free()`.  This means that dangling pointers
 between the two arenas will no longer be possible.
 Fuse is useful when joining two messages from separate arenas (making one a
 sub-message of the other).  Fuse is a relatively cheap operation, on the order
 of 150ns, and is very nearly `O(1)` in the number of arenas being fused (the
 true complexity is the inverse Ackermann function, which grows extremely
 slowly).
 Each arena does consume some memory, so repeatedly creating and fusing an
 additional arena is not free, but the CPU cost of fusing two arenas together is
 modest.
 ### Initial Block and Custom Allocators
 `upb_Arena` normally uses `malloc()` and `free()` to allocate and return its
 underlying blocks.  But this default strategy can be customized to support
 the needs of a particular language.
 The lowest-level function for creating a `upb_Arena` is:
 ```c
 // Creates an arena from the given initial block (if any -- n may be 0).
 // Additional blocks will be allocated from |alloc|.  If |alloc| is NULL,
 // this is a fixed-size arena and cannot grow.
 UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc);
 ```
 The buffer `[mem, n]` will be used as an "initial block", which is used to
 satisfy allocations before calling any underlying allocation function.  Note
 that the `upb_Arena` itself will be allocated from the initial block if
 possible, so the amount of memory available for allocation from the arena will
 be less than `n`.
 The `alloc` parameter specifies a custom memory allocation function which
 will be used once the initial block is exhausted.  The user can pass `NULL`
 as the allocation function, in which case the initial block is the only memory
 available in the arena.  This can allow upb to be used even in situations where
 there is no heap.
 It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating
 operations like `upb_Message_New()` should be checked for failure if there is
 any possibility that a fixed size arena is in use.