From 6d2b9e6d18910b36ad4c602bb8fc838641fac932 Mon Sep 17 00:00:00 2001 From: Joshua Haberman Date: Tue, 18 Jul 2023 10:45:28 -0700 Subject: [PATCH] Revamped the design doc and added a section about Arenas. The old design doc had fallen out of date. Now that upb's core design has stabilized, it's time for a new design doc that walks through all of upb's major abstractions. We start with arenas; future CLs will cover other aspects of upb's design. PiperOrigin-RevId: 549048285 --- DESIGN.md | 201 ------------------------------------------------- docs/design.md | 167 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 167 insertions(+), 201 deletions(-) delete mode 100644 DESIGN.md create mode 100644 docs/design.md diff --git a/DESIGN.md b/DESIGN.md deleted file mode 100644 index 41a2097e7f..0000000000 --- a/DESIGN.md +++ /dev/null @@ -1,201 +0,0 @@ - -# upb Design - -upb aims to be a minimal C protobuf kernel. It has a C API, but its primary -goal is to be the core runtime for a higher-level API. - -## Design goals - -- Full protobuf conformance -- Small code size -- Fast performance (without compromising code size) -- Easy to wrap in language runtimes -- Easy to adapt to different memory management schemes (refcounting, GC, etc) - -## Design parameters - -- C99 -- 32 or 64-bit CPU (assumes 4 or 8 byte pointers) -- Uses pointer tagging, but avoids other implementation-defined behavior -- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc) -- No global state, fully re-entrant - - -## Overall Structure - -The upb library is divided into two main parts: - -- A core message representation, which supports binary format parsing - and serialization. - - `upb/upb.h`: arena allocator (`upb_arena`) - - `upb/msg_internal.h`: core message representation and parse tables - - `upb/msg.h`: accessing metadata common to all messages, like unknown fields - - `upb/decode.h`: binary format parsing - - `upb/encode.h`: binary format serialization - - `upb/table_internal.h`: hash table (used for maps) - - `upbc/protoc-gen-upbc.cc`: compiler that generates `.upb.h`/`.upb.c` APIs for - accessing messages without reflection. -- A reflection add-on library that supports JSON and text format. - - `upb/def.h`: schema representation and loading from descriptors - - `upb/reflection.h`: reflective access to message data. - - `upb/json_encode.h`: JSON encoding - - `upb/json_decode.h`: JSON decoding - - `upb/text_encode.h`: text format encoding - - `upbc/protoc-gen-upbdefs.cc`: compiler that generates `.upbdefs.h`/`.upbdefs.c` - APIs for loading reflection. - -## Core Message Representation - -The representation for each message consists of: -- One pointer (`upb_msg_internaldata*`) for unknown fields and extensions. This - pointer is `NULL` when no unknown fields or extensions are present. -- Hasbits for any optional/required fields. -- Case integers for each oneof. -- Data for each field. - -For example, a layout for a message with two `optional int32` fields would end -up looking something like this: - -```c -// For illustration only, upb does not actually generate structs. -typedef struct { - upb_msg_internaldata* internal; // Unknown fields and extensions. - uint32_t hasbits; // We are only using two hasbits. - int32_t field1; - int32_t field2; -} package_name_MessageName; -``` - -Note in particular that messages do *not* have: -- A pointer to reflection or a parse table (upb messages are not self-describing). -- A pointer to an arena (the arena must be explicitly passed into any function that - allocates). - -The upb compiler computes a layout for each message, and determines the offset for -each field using normal alignment rules (each data member must be aligned to a -multiple of its size). This layout is then embedded into the generated `.upb.h` -and `.upb.c` headers in two different forms. First as inline accessors that expect -the data at a given offset: - -```c -// Example of a generated accessor, from foo.upb.h -UPB_INLINE int32_t package_name_MessageName_field1( - const upb_test_MessageName *msg) { - return *UPB_PTR_AT(msg, UPB_SIZE(4, 4), int32_t); -} -``` - -Secondly, the layout is emitted as a table which is used by the parser and serializer. -We call these tables "mini-tables" to distinguish them from the larger and more -optimized "fast tables" used in `upb/decode_fast.c` (an experimental parser that is -2-3x the speed of the main parser, though the main parser is already quite fast). - -```c -// Definition of mini-table structure, from upb/msg_internal.h -typedef struct { - uint32_t number; - uint16_t offset; - int16_t presence; /* If >0, hasbit_index. If <0, ~oneof_index. */ - uint16_t submsg_index; /* undefined if descriptortype != MESSAGE or GROUP. */ - uint8_t descriptortype; - int8_t mode; /* upb_fieldmode, with flags from upb_labelflags */ -} upb_msglayout_field; - -typedef enum { - _UPB_MODE_MAP = 0, - _UPB_MODE_ARRAY = 1, - _UPB_MODE_SCALAR = 2, -} upb_fieldmode; - -typedef struct { - const struct upb_msglayout *const* submsgs; - const upb_msglayout_field *fields; - uint16_t size; - uint16_t field_count; - bool extendable; - uint8_t dense_below; - uint8_t table_mask; -} upb_msglayout; - -// Example of a generated mini-table, from foo.upb.c -static const upb_msglayout_field upb_test_MessageName__fields[2] = { - {1, UPB_SIZE(4, 4), 1, 0, 5, _UPB_MODE_SCALAR}, - {2, UPB_SIZE(8, 8), 2, 0, 5, _UPB_MODE_SCALAR}, -}; - -const upb_msglayout upb_test_MessageName_msg_init = { - NULL, - &upb_test_MessageName__fields[0], - UPB_SIZE(16, 16), 2, false, 2, 255, -}; -``` - -The upb compiler computes separate layouts for 32 and 64 bit modes, since the -pointer size will be 4 or 8 bytes respectively. The upb compiler embeds both -sizes into the source code, using a `UPB_SIZE(size32, size64)` macro that can -choose the appropriate size at build time based on the size of `UINTPTR_MAX`. - -Note that `.upb.c` files contain data tables only. There is no "generated code" -except for the inline accessors in the `.upb.h` files: the entire footprint -of `.upb.c` files is in `.rodata`, none in `.text` or `.data`. - -## Memory Management Model - -All memory management in upb is built around arenas. A message is never -considered to "own" the strings or sub-messages contained within it. Instead a -message and all of its sub-messages/strings/etc. are all owned by an arena and -are freed when the arena is freed. An entire message tree will probably be -owned by a single arena, but this is not required or enforced. As far as upb is -concerned, it is up to the client how to partition its arenas. upb only requires -that when you ask it to serialize a message, that all reachable messages are -still alive. - -The arena supports both a user-supplied initial block and a custom allocation -callback, so there is a lot of flexibility in memory allocation strategy. The -allocation callback can even be `NULL` for heap-free operation. The main -constraint of the arena is that all of the memory in each arena must be freed -together. - -`upb_arena` supports a novel operation called "fuse". When two arenas are fused -together, their lifetimes are irreversibly joined, such that none of the arena -blocks in either arena will be freed until *both* arenas are freed with -`upb_arena_free()`. This is useful when joining two messages from separate -arenas (making one a sub-message of the other). Fuse is a very cheap -operation, and an unlimited number of arenas can be fused together efficiently. - -## Reflection and Descriptors - -upb offers a fully-featured reflection library. There are two main ways of -using reflection: - -1. You can load descriptors from strings using `upb_symtab_addfile()`. - The upb runtime will dynamically create mini-tables like what the upb compiler - would have created if you had compiled this type into a `.upb.c` file. -2. You can load descriptors using generated `.upbdefs.h` interfaces. - This will load reflection that references the corresponding `.upb.c` - mini-tables instead of building a new mini-table on the fly. This lets - you reflect on generated types that are linked into your program. - -upb's design for descriptors is similar to protobuf C++ in many ways, with -the following correspondences: - -| C++ Type | upb type | -| ---------| ---------| -| `google::protobuf::DescriptorPool` | `upb_symtab` -| `google::protobuf::Descriptor` | `upb_msgdef` -| `google::protobuf::FieldDescriptor` | `upb_fielddef` -| `google::protobuf::OneofDescriptor` | `upb_oneofdef` -| `google::protobuf::EnumDescriptor` | `upb_enumdef` -| `google::protobuf::FileDescriptor` | `upb_filedef` -| `google::protobuf::ServiceDescriptor` | `upb_servicedef` -| `google::protobuf::MethodDescriptor` | `upb_methoddef` - -Like in C++ descriptors (defs) are created by loading a -`google_protobuf_FileDescriptorProto` into a `upb_symtab`. This creates and -links all of the def objects corresponding to that `.proto` file, and inserts -the names into a symbol table so they can be looked up by name. - -Once you have loaded some descriptors into a `upb_symtab`, you can create and -manipulate messages using the interfaces defined in `upb/reflection.h`. If your -descriptors are linked to your generated layouts using option (2) above, you can -safely access the same messages using both reflection and generated interfaces. diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000000..a8b4b5ee85 --- /dev/null +++ b/docs/design.md @@ -0,0 +1,167 @@ +# upb Design + +[TOC] + +upb is a protobuf kernel written in C. It is a fast and conformant implementation +of protobuf, with a low-level C API that is designed to be wrapped in other +languages. + +upb is not designed to be used by applications directly. The C API is very +low-level, unsafe, and changes frequently. It is important that upb is able to +make breaking API changes as necessary, to avoid taking on technical debt that +would compromise upb's goals of small code size and fast performance. + +## Design goals + +Goals: + +- Full protobuf conformance +- Small code size +- Fast performance (without compromising code size) +- Easy to wrap in language runtimes +- Easy to adapt to different memory management schemes (refcounting, GC, etc) + +Non-Goals: + +- Stable API +- Safe API +- Ergonomic API for applications + +Parameters: + +- C99 +- 32 or 64-bit CPU (assumes 4 or 8 byte pointers) +- Uses pointer tagging, but avoids other implementation-defined behavior +- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc) +- No global state, fully re-entrant + +## Arenas + +All memory management in upb uses arenas, using the type `upb_Arena`. Arenas +are an alternative to `malloc()` and `free()` that significantly reduces the +costs of memory allocation. + +Arenas obtain blocks of memory using some underlying allocator (likely +`malloc()` and `free()`), and satisfy allocations using a simple bump allocator +that walks through each block in linear order. Allocations cannot be freed +individually: it is only possible to free the arena as a whole, which frees all +of the underlying blocks. + +Here is an example of using the `upb_Arena` type: + +```c + upb_Arena* arena = upb_Arena_New(); + + // Perform some allocations. + int* x = upb_Arena_Malloc(arena, sizeof(*x)); + int* y = upb_Arena_Malloc(arena, sizeof(*y)); + + // We cannot free `x` and `y` separately, we can only free the arena + // as a whole. + upb_Arena_Free(arena); +``` + +upb uses arenas for all memory management, and this fact is reflected in the API +for all upb data structures. All upb functions that allocate take a +`upb_Arena*` parameter and perform allocations using that arena rather than +calling `malloc()` or `free()`. + +```c +// upb API to create a message. +UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table, + upb_Arena* arena); + +void MakeMessage(const upb_MiniTable* mini_table) { + upb_Arena* arena = upb_Arena_New(); + + // This message is allocated on our arena. + upb_Message* msg = upb_Message_New(mini_table, arena); + + // We can free the arena whenever we want, but we cannot free the + // message separately from the arena. + upb_Arena_Free(arena); + + // msg is now deleted. +} +``` + +Arenas are a key part of upb's performance story. Parsing a large protobuf +payload usually involves rapidly creating a series of messages, arrays (repeated +fields), and maps. It is crucial for parsing performance that these allocations +are as fast as possible. Equally important, freeing the tree of messages should +be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg +n)`. + +### Avoiding Dangling Pointers + +Objects allocated on an arena will frequently contain pointers to other +arena-allocated objects. For example, a `upb_Message` will have pointers to +sub-messages that are also arena-allocated. + +Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide +automatic safety from dangling pointers. Instead, upb provides tools to help +bridge between higher-level memory management schemes (GC, refcounting, RAII, +borrow checkers) and arenas. + +If there is only one arena, dangling pointers within the arena are impossible, +because all objects are freed at the same time. This is the simplest case. The +user must still be careful not to keep dangling pointers that point at arena +memory after it has been freed, but dangling pointers *between* the arena +objects will be impossible. + +But what if there are multiple arenas? If we have a pointer from one arena to +another, how do we ensure that this will not become a dangling pointer? + +To help with the multiple arena case, upb provides a primitive called "fuse". + +```c +// Fuses the lifetimes of `a` and `b`. None of the blocks from `a` or `b` +// will be freed until both arenas are freed. +UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b); +``` + +When two arenas are fused together, their lifetimes are irreversibly joined, +such that none of the arena blocks in either arena will be freed until *both* +arenas are freed with `upb_Arena_Free()`. This means that dangling pointers +between the two arenas will no longer be possible. + +Fuse is useful when joining two messages from separate arenas (making one a +sub-message of the other). Fuse is a relatively cheap operation, on the order +of 150ns, and is very nearly `O(1)` in the number of arenas being fused (the +true complexity is the inverse Ackermann function, which grows extremely +slowly). + +Each arena does consume some memory, so repeatedly creating and fusing an +additional arena is not free, but the CPU cost of fusing two arenas together is +modest. + +### Initial Block and Custom Allocators + +`upb_Arena` normally uses `malloc()` and `free()` to allocate and return its +underlying blocks. But this default strategy can be customized to support +the needs of a particular language. + +The lowest-level function for creating a `upb_Arena` is: + +```c +// Creates an arena from the given initial block (if any -- n may be 0). +// Additional blocks will be allocated from |alloc|. If |alloc| is NULL, +// this is a fixed-size arena and cannot grow. +UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc); +``` + +The buffer `[mem, n]` will be used as an "initial block", which is used to +satisfy allocations before calling any underlying allocation function. Note +that the `upb_Arena` itself will be allocated from the initial block if +possible, so the amount of memory available for allocation from the arena will +be less than `n`. + +The `alloc` parameter specifies a custom memory allocation function which +will be used once the initial block is exhausted. The user can pass `NULL` +as the allocation function, in which case the initial block is the only memory +available in the arena. This can allow upb to be used even in situations where +there is no heap. + +It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating +operations like `upb_Message_New()` should be checked for failure if there is +any possibility that a fixed size arena is in use.