Revamped the design doc and added a section about Arenas.

The old design doc had fallen out of date. Now that upb's core design has stabilized, it's time for a new design doc that walks through all of upb's major abstractions. We start with arenas; future CLs will cover other aspects of upb's design. PiperOrigin-RevId: 549048285
2 years ago · 6d2b9e6d18
parent f67198ffef
commit 6d2b9e6d18
2 changed files with 167 additions and 201 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -1,201 +0,0 @@
-
-# upb Design
-
-upb aims to be a minimal C protobuf kernel.  It has a C API, but its primary
-goal is to be the core runtime for a higher-level API.
-
-## Design goals
-
- Full protobuf conformance
- Small code size
- Fast performance (without compromising code size)
- Easy to wrap in language runtimes
- Easy to adapt to different memory management schemes (refcounting, GC, etc)
-
-## Design parameters
-
- C99
- 32 or 64-bit CPU (assumes 4 or 8 byte pointers)
- Uses pointer tagging, but avoids other implementation-defined behavior
- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
- No global state, fully re-entrant
-
-
-## Overall Structure
-
-The upb library is divided into two main parts:
-
- A core message representation, which supports binary format parsing
-  and serialization.
-  - `upb/upb.h`: arena allocator (`upb_arena`)
-  - `upb/msg_internal.h`: core message representation and parse tables
-  - `upb/msg.h`: accessing metadata common to all messages, like unknown fields
-  - `upb/decode.h`: binary format parsing
-  - `upb/encode.h`: binary format serialization
-  - `upb/table_internal.h`: hash table (used for maps)
-  - `upbc/protoc-gen-upbc.cc`: compiler that generates `.upb.h`/`.upb.c` APIs for
-    accessing messages without reflection.
- A reflection add-on library that supports JSON and text format.
-  - `upb/def.h`: schema representation and loading from descriptors
-  - `upb/reflection.h`: reflective access to message data.
-  - `upb/json_encode.h`: JSON encoding
-  - `upb/json_decode.h`: JSON decoding
-  - `upb/text_encode.h`: text format encoding
-  - `upbc/protoc-gen-upbdefs.cc`: compiler that generates `.upbdefs.h`/`.upbdefs.c`
-    APIs for loading reflection.
-
-## Core Message Representation
-
-The representation for each message consists of:
- One pointer (`upb_msg_internaldata*`) for unknown fields and extensions. This
-  pointer is `NULL` when no unknown fields or extensions are present.
- Hasbits for any optional/required fields.
- Case integers for each oneof.
- Data for each field.
-
-For example, a layout for a message with two `optional int32` fields would end
-up looking something like this:
-
-```c
-// For illustration only, upb does not actually generate structs.
-typedef struct {
-  upb_msg_internaldata* internal;  // Unknown fields and extensions.
-  uint32_t hasbits;                // We are only using two hasbits.
-  int32_t field1;
-  int32_t field2;
-} package_name_MessageName;
-```
-
-Note in particular that messages do *not* have:
- A pointer to reflection or a parse table (upb messages are not self-describing).
- A pointer to an arena (the arena must be explicitly passed into any function that
-  allocates).
-
-The upb compiler computes a layout for each message, and determines the offset for
-each field using normal alignment rules (each data member must be aligned to a
-multiple of its size).  This layout is then embedded into the generated `.upb.h`
-and `.upb.c` headers in two different forms.  First as inline accessors that expect
-the data at a given offset:
-
-```c
-// Example of a generated accessor, from foo.upb.h
-UPB_INLINE int32_t package_name_MessageName_field1(
-    const upb_test_MessageName *msg) {
-  return *UPB_PTR_AT(msg, UPB_SIZE(4, 4), int32_t);
-}
-```
-
-Secondly, the layout is emitted as a table which is used by the parser and serializer.
-We call these tables "mini-tables" to distinguish them from the larger and more
-optimized "fast tables" used in `upb/decode_fast.c` (an experimental parser that is
-2-3x the speed of the main parser, though the main parser is already quite fast).
-
-```c
-// Definition of mini-table structure, from upb/msg_internal.h
-typedef struct {
-  uint32_t number;
-  uint16_t offset;
-  int16_t presence;       /* If >0, hasbit_index.  If <0, ~oneof_index. */
-  uint16_t submsg_index;  /* undefined if descriptortype != MESSAGE or GROUP. */
-  uint8_t descriptortype;
-  int8_t mode;            /* upb_fieldmode, with flags from upb_labelflags */
-} upb_msglayout_field;
-
-typedef enum {
-  _UPB_MODE_MAP = 0,
-  _UPB_MODE_ARRAY = 1,
-  _UPB_MODE_SCALAR = 2,
-} upb_fieldmode;
-
-typedef struct {
-  const struct upb_msglayout *const* submsgs;
-  const upb_msglayout_field *fields;
-  uint16_t size;
-  uint16_t field_count;
-  bool extendable;
-  uint8_t dense_below;
-  uint8_t table_mask;
-} upb_msglayout;
-
-// Example of a generated mini-table, from foo.upb.c
-static const upb_msglayout_field upb_test_MessageName__fields[2] = {
-  {1, UPB_SIZE(4, 4), 1, 0, 5, _UPB_MODE_SCALAR},
-  {2, UPB_SIZE(8, 8), 2, 0, 5, _UPB_MODE_SCALAR},
-};
-
-const upb_msglayout upb_test_MessageName_msg_init = {
-  NULL,
-  &upb_test_MessageName__fields[0],
-  UPB_SIZE(16, 16), 2, false, 2, 255,
-};
-```
-
-The upb compiler computes separate layouts for 32 and 64 bit modes, since the
-pointer size will be 4 or 8 bytes respectively.  The upb compiler embeds both
-sizes into the source code, using a `UPB_SIZE(size32, size64)` macro that can
-choose the appropriate size at build time based on the size of `UINTPTR_MAX`.
-
-Note that `.upb.c` files contain data tables only.  There is no "generated code"
-except for the inline accessors in the `.upb.h` files: the entire footprint
-of `.upb.c` files is in `.rodata`, none in `.text` or `.data`.
-
-## Memory Management Model
-
-All memory management in upb is built around arenas.  A message is never
-considered to "own" the strings or sub-messages contained within it.  Instead a
-message and all of its sub-messages/strings/etc. are all owned by an arena and
-are freed when the arena is freed.  An entire message tree will probably be
-owned by a single arena, but this is not required or enforced.  As far as upb is
-concerned, it is up to the client how to partition its arenas.  upb only requires
-that when you ask it to serialize a message, that all reachable messages are
-still alive.
-
-The arena supports both a user-supplied initial block and a custom allocation
-callback, so there is a lot of flexibility in memory allocation strategy.  The
-allocation callback can even be `NULL` for heap-free operation.  The main
-constraint of the arena is that all of the memory in each arena must be freed
-together.
-
-`upb_arena` supports a novel operation called "fuse".  When two arenas are fused
-together, their lifetimes are irreversibly joined, such that none of the arena
-blocks in either arena will be freed until *both* arenas are freed with
-`upb_arena_free()`.  This is useful when joining two messages from separate
-arenas (making one a sub-message of the other).  Fuse is a very cheap
-operation, and an unlimited number of arenas can be fused together efficiently.
-
-## Reflection and Descriptors
-
-upb offers a fully-featured reflection library.  There are two main ways of
-using reflection:
-
-1. You can load descriptors from strings using `upb_symtab_addfile()`.
-  The upb runtime will dynamically create mini-tables like what the upb compiler
-  would have created if you had compiled this type into a `.upb.c` file.
-2. You can load descriptors using generated `.upbdefs.h` interfaces.
-  This will load reflection that references the corresponding `.upb.c`
-  mini-tables instead of building a new mini-table on the fly.  This lets
-  you reflect on generated types that are linked into your program.
-
-upb's design for descriptors is similar to protobuf C++ in many ways, with
-the following correspondences:
-
-| C++ Type | upb type |
-| ---------| ---------|
-| `google::protobuf::DescriptorPool` | `upb_symtab`
-| `google::protobuf::Descriptor` | `upb_msgdef`
-| `google::protobuf::FieldDescriptor` | `upb_fielddef`
-| `google::protobuf::OneofDescriptor` | `upb_oneofdef`
-| `google::protobuf::EnumDescriptor` | `upb_enumdef`
-| `google::protobuf::FileDescriptor` | `upb_filedef`
-| `google::protobuf::ServiceDescriptor` | `upb_servicedef`
-| `google::protobuf::MethodDescriptor` | `upb_methoddef`
-
-Like in C++ descriptors (defs) are created by loading a
-`google_protobuf_FileDescriptorProto` into a `upb_symtab`.  This creates and
-links all of the def objects corresponding to that `.proto` file, and inserts
-the names into a symbol table so they can be looked up by name.
-
-Once you have loaded some descriptors into a `upb_symtab`, you can create and
-manipulate messages using the interfaces defined in `upb/reflection.h`.  If your
-descriptors are linked to your generated layouts using option (2) above, you can
-safely access the same messages using both reflection and generated interfaces.
--- a/docs/design.md
+++ b/docs/design.md
@ -0,0 +1,167 @@
+# upb Design
+
+[TOC]
+
+upb is a protobuf kernel written in C.  It is a fast and conformant implementation
+of protobuf, with a low-level C API that is designed to be wrapped in other
+languages.
+
+upb is not designed to be used by applications directly.  The C API is very
+low-level, unsafe, and changes frequently.  It is important that upb is able to
+make breaking API changes as necessary, to avoid taking on technical debt that
+would compromise upb's goals of small code size and fast performance.
+
+## Design goals
+
+Goals:
+
+- Full protobuf conformance
+- Small code size
+- Fast performance (without compromising code size)
+- Easy to wrap in language runtimes
+- Easy to adapt to different memory management schemes (refcounting, GC, etc)
+
+Non-Goals:
+
+- Stable API
+- Safe API
+- Ergonomic API for applications
+
+Parameters:
+
+- C99
+- 32 or 64-bit CPU (assumes 4 or 8 byte pointers)
+- Uses pointer tagging, but avoids other implementation-defined behavior
+- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
+- No global state, fully re-entrant
+
+## Arenas
+
+All memory management in upb uses arenas, using the type `upb_Arena`.  Arenas
+are an alternative to `malloc()` and `free()` that significantly reduces the
+costs of memory allocation.
+
+Arenas obtain blocks of memory using some underlying allocator (likely
+`malloc()` and `free()`), and satisfy allocations using a simple bump allocator
+that walks through each block in linear order.  Allocations cannot be freed
+individually: it is only possible to free the arena as a whole, which frees all
+of the underlying blocks.
+
+Here is an example of using the `upb_Arena` type:
+
+```c
+  upb_Arena* arena = upb_Arena_New();
+
+  // Perform some allocations.
+  int* x = upb_Arena_Malloc(arena, sizeof(*x));
+  int* y = upb_Arena_Malloc(arena, sizeof(*y));
+
+  // We cannot free `x` and `y` separately, we can only free the arena
+  // as a whole.
+  upb_Arena_Free(arena);
+```
+
+upb uses arenas for all memory management, and this fact is reflected in the API
+for all upb data structures.  All upb functions that allocate take a
+`upb_Arena*` parameter and perform allocations using that arena rather than
+calling `malloc()` or `free()`.
+
+```c
+// upb API to create a message.
+UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table,
+                                     upb_Arena* arena);
+
+void MakeMessage(const upb_MiniTable* mini_table) {
+  upb_Arena* arena = upb_Arena_New();
+
+  // This message is allocated on our arena.
+  upb_Message* msg = upb_Message_New(mini_table, arena);
+
+  // We can free the arena whenever we want, but we cannot free the
+  // message separately from the arena.
+  upb_Arena_Free(arena);
+
+  // msg is now deleted.
+}
+```
+
+Arenas are a key part of upb's performance story.  Parsing a large protobuf
+payload usually involves rapidly creating a series of messages, arrays (repeated
+fields), and maps.  It is crucial for parsing performance that these allocations
+are as fast as possible.  Equally important, freeing the tree of messages should
+be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg
+n)`.
+
+### Avoiding Dangling Pointers
+
+Objects allocated on an arena will frequently contain pointers to other
+arena-allocated objects.  For example, a `upb_Message` will have pointers to
+sub-messages that are also arena-allocated.
+
+Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide
+automatic safety from dangling pointers.  Instead, upb provides tools to help
+bridge between higher-level memory management schemes (GC, refcounting, RAII,
+borrow checkers) and arenas.
+
+If there is only one arena, dangling pointers within the arena are impossible,
+because all objects are freed at the same time.  This is the simplest case.  The
+user must still be careful not to keep dangling pointers that point at arena
+memory after it has been freed, but dangling pointers *between* the arena
+objects will be impossible.
+
+But what if there are multiple arenas?  If we have a pointer from one arena to
+another, how do we ensure that this will not become a dangling pointer?
+
+To help with the multiple arena case, upb provides a primitive called "fuse".
+
+```c
+// Fuses the lifetimes of `a` and `b`.  None of the blocks from `a` or `b`
+// will be freed until both arenas are freed.
+UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b);
+```
+
+When two arenas are fused together, their lifetimes are irreversibly joined,
+such that none of the arena blocks in either arena will be freed until *both*
+arenas are freed with `upb_Arena_Free()`.  This means that dangling pointers
+between the two arenas will no longer be possible.
+
+Fuse is useful when joining two messages from separate arenas (making one a
+sub-message of the other).  Fuse is a relatively cheap operation, on the order
+of 150ns, and is very nearly `O(1)` in the number of arenas being fused (the
+true complexity is the inverse Ackermann function, which grows extremely
+slowly).
+
+Each arena does consume some memory, so repeatedly creating and fusing an
+additional arena is not free, but the CPU cost of fusing two arenas together is
+modest.
+
+### Initial Block and Custom Allocators
+
+`upb_Arena` normally uses `malloc()` and `free()` to allocate and return its
+underlying blocks.  But this default strategy can be customized to support
+the needs of a particular language.
+
+The lowest-level function for creating a `upb_Arena` is:
+
+```c
+// Creates an arena from the given initial block (if any -- n may be 0).
+// Additional blocks will be allocated from |alloc|.  If |alloc| is NULL,
+// this is a fixed-size arena and cannot grow.
+UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc);
+```
+
+The buffer `[mem, n]` will be used as an "initial block", which is used to
+satisfy allocations before calling any underlying allocation function.  Note
+that the `upb_Arena` itself will be allocated from the initial block if
+possible, so the amount of memory available for allocation from the arena will
+be less than `n`.
+
+The `alloc` parameter specifies a custom memory allocation function which
+will be used once the initial block is exhausted.  The user can pass `NULL`
+as the allocation function, in which case the initial block is the only memory
+available in the arena.  This can allow upb to be used even in situations where
+there is no heap.
+
+It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating
+operations like `upb_Message_New()` should be checked for failure if there is
+any possibility that a fixed size arena is in use.