The old design doc had fallen out of date. Now that upb's core design has stabilized, it's time for a new design doc that walks through all of upb's major abstractions. We start with arenas; future CLs will cover other aspects of upb's design. PiperOrigin-RevId: 549048285pull/13171/head
parent
f67198ffef
commit
6d2b9e6d18
2 changed files with 167 additions and 201 deletions
@ -1,201 +0,0 @@ |
|||||||
|
|
||||||
# upb Design |
|
||||||
|
|
||||||
upb aims to be a minimal C protobuf kernel. It has a C API, but its primary |
|
||||||
goal is to be the core runtime for a higher-level API. |
|
||||||
|
|
||||||
## Design goals |
|
||||||
|
|
||||||
- Full protobuf conformance |
|
||||||
- Small code size |
|
||||||
- Fast performance (without compromising code size) |
|
||||||
- Easy to wrap in language runtimes |
|
||||||
- Easy to adapt to different memory management schemes (refcounting, GC, etc) |
|
||||||
|
|
||||||
## Design parameters |
|
||||||
|
|
||||||
- C99 |
|
||||||
- 32 or 64-bit CPU (assumes 4 or 8 byte pointers) |
|
||||||
- Uses pointer tagging, but avoids other implementation-defined behavior |
|
||||||
- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc) |
|
||||||
- No global state, fully re-entrant |
|
||||||
|
|
||||||
|
|
||||||
## Overall Structure |
|
||||||
|
|
||||||
The upb library is divided into two main parts: |
|
||||||
|
|
||||||
- A core message representation, which supports binary format parsing |
|
||||||
and serialization. |
|
||||||
- `upb/upb.h`: arena allocator (`upb_arena`) |
|
||||||
- `upb/msg_internal.h`: core message representation and parse tables |
|
||||||
- `upb/msg.h`: accessing metadata common to all messages, like unknown fields |
|
||||||
- `upb/decode.h`: binary format parsing |
|
||||||
- `upb/encode.h`: binary format serialization |
|
||||||
- `upb/table_internal.h`: hash table (used for maps) |
|
||||||
- `upbc/protoc-gen-upbc.cc`: compiler that generates `.upb.h`/`.upb.c` APIs for |
|
||||||
accessing messages without reflection. |
|
||||||
- A reflection add-on library that supports JSON and text format. |
|
||||||
- `upb/def.h`: schema representation and loading from descriptors |
|
||||||
- `upb/reflection.h`: reflective access to message data. |
|
||||||
- `upb/json_encode.h`: JSON encoding |
|
||||||
- `upb/json_decode.h`: JSON decoding |
|
||||||
- `upb/text_encode.h`: text format encoding |
|
||||||
- `upbc/protoc-gen-upbdefs.cc`: compiler that generates `.upbdefs.h`/`.upbdefs.c` |
|
||||||
APIs for loading reflection. |
|
||||||
|
|
||||||
## Core Message Representation |
|
||||||
|
|
||||||
The representation for each message consists of: |
|
||||||
- One pointer (`upb_msg_internaldata*`) for unknown fields and extensions. This |
|
||||||
pointer is `NULL` when no unknown fields or extensions are present. |
|
||||||
- Hasbits for any optional/required fields. |
|
||||||
- Case integers for each oneof. |
|
||||||
- Data for each field. |
|
||||||
|
|
||||||
For example, a layout for a message with two `optional int32` fields would end |
|
||||||
up looking something like this: |
|
||||||
|
|
||||||
```c |
|
||||||
// For illustration only, upb does not actually generate structs. |
|
||||||
typedef struct { |
|
||||||
upb_msg_internaldata* internal; // Unknown fields and extensions. |
|
||||||
uint32_t hasbits; // We are only using two hasbits. |
|
||||||
int32_t field1; |
|
||||||
int32_t field2; |
|
||||||
} package_name_MessageName; |
|
||||||
``` |
|
||||||
|
|
||||||
Note in particular that messages do *not* have: |
|
||||||
- A pointer to reflection or a parse table (upb messages are not self-describing). |
|
||||||
- A pointer to an arena (the arena must be explicitly passed into any function that |
|
||||||
allocates). |
|
||||||
|
|
||||||
The upb compiler computes a layout for each message, and determines the offset for |
|
||||||
each field using normal alignment rules (each data member must be aligned to a |
|
||||||
multiple of its size). This layout is then embedded into the generated `.upb.h` |
|
||||||
and `.upb.c` headers in two different forms. First as inline accessors that expect |
|
||||||
the data at a given offset: |
|
||||||
|
|
||||||
```c |
|
||||||
// Example of a generated accessor, from foo.upb.h |
|
||||||
UPB_INLINE int32_t package_name_MessageName_field1( |
|
||||||
const upb_test_MessageName *msg) { |
|
||||||
return *UPB_PTR_AT(msg, UPB_SIZE(4, 4), int32_t); |
|
||||||
} |
|
||||||
``` |
|
||||||
|
|
||||||
Secondly, the layout is emitted as a table which is used by the parser and serializer. |
|
||||||
We call these tables "mini-tables" to distinguish them from the larger and more |
|
||||||
optimized "fast tables" used in `upb/decode_fast.c` (an experimental parser that is |
|
||||||
2-3x the speed of the main parser, though the main parser is already quite fast). |
|
||||||
|
|
||||||
```c |
|
||||||
// Definition of mini-table structure, from upb/msg_internal.h |
|
||||||
typedef struct { |
|
||||||
uint32_t number; |
|
||||||
uint16_t offset; |
|
||||||
int16_t presence; /* If >0, hasbit_index. If <0, ~oneof_index. */ |
|
||||||
uint16_t submsg_index; /* undefined if descriptortype != MESSAGE or GROUP. */ |
|
||||||
uint8_t descriptortype; |
|
||||||
int8_t mode; /* upb_fieldmode, with flags from upb_labelflags */ |
|
||||||
} upb_msglayout_field; |
|
||||||
|
|
||||||
typedef enum { |
|
||||||
_UPB_MODE_MAP = 0, |
|
||||||
_UPB_MODE_ARRAY = 1, |
|
||||||
_UPB_MODE_SCALAR = 2, |
|
||||||
} upb_fieldmode; |
|
||||||
|
|
||||||
typedef struct { |
|
||||||
const struct upb_msglayout *const* submsgs; |
|
||||||
const upb_msglayout_field *fields; |
|
||||||
uint16_t size; |
|
||||||
uint16_t field_count; |
|
||||||
bool extendable; |
|
||||||
uint8_t dense_below; |
|
||||||
uint8_t table_mask; |
|
||||||
} upb_msglayout; |
|
||||||
|
|
||||||
// Example of a generated mini-table, from foo.upb.c |
|
||||||
static const upb_msglayout_field upb_test_MessageName__fields[2] = { |
|
||||||
{1, UPB_SIZE(4, 4), 1, 0, 5, _UPB_MODE_SCALAR}, |
|
||||||
{2, UPB_SIZE(8, 8), 2, 0, 5, _UPB_MODE_SCALAR}, |
|
||||||
}; |
|
||||||
|
|
||||||
const upb_msglayout upb_test_MessageName_msg_init = { |
|
||||||
NULL, |
|
||||||
&upb_test_MessageName__fields[0], |
|
||||||
UPB_SIZE(16, 16), 2, false, 2, 255, |
|
||||||
}; |
|
||||||
``` |
|
||||||
|
|
||||||
The upb compiler computes separate layouts for 32 and 64 bit modes, since the |
|
||||||
pointer size will be 4 or 8 bytes respectively. The upb compiler embeds both |
|
||||||
sizes into the source code, using a `UPB_SIZE(size32, size64)` macro that can |
|
||||||
choose the appropriate size at build time based on the size of `UINTPTR_MAX`. |
|
||||||
|
|
||||||
Note that `.upb.c` files contain data tables only. There is no "generated code" |
|
||||||
except for the inline accessors in the `.upb.h` files: the entire footprint |
|
||||||
of `.upb.c` files is in `.rodata`, none in `.text` or `.data`. |
|
||||||
|
|
||||||
## Memory Management Model |
|
||||||
|
|
||||||
All memory management in upb is built around arenas. A message is never |
|
||||||
considered to "own" the strings or sub-messages contained within it. Instead a |
|
||||||
message and all of its sub-messages/strings/etc. are all owned by an arena and |
|
||||||
are freed when the arena is freed. An entire message tree will probably be |
|
||||||
owned by a single arena, but this is not required or enforced. As far as upb is |
|
||||||
concerned, it is up to the client how to partition its arenas. upb only requires |
|
||||||
that when you ask it to serialize a message, that all reachable messages are |
|
||||||
still alive. |
|
||||||
|
|
||||||
The arena supports both a user-supplied initial block and a custom allocation |
|
||||||
callback, so there is a lot of flexibility in memory allocation strategy. The |
|
||||||
allocation callback can even be `NULL` for heap-free operation. The main |
|
||||||
constraint of the arena is that all of the memory in each arena must be freed |
|
||||||
together. |
|
||||||
|
|
||||||
`upb_arena` supports a novel operation called "fuse". When two arenas are fused |
|
||||||
together, their lifetimes are irreversibly joined, such that none of the arena |
|
||||||
blocks in either arena will be freed until *both* arenas are freed with |
|
||||||
`upb_arena_free()`. This is useful when joining two messages from separate |
|
||||||
arenas (making one a sub-message of the other). Fuse is a very cheap |
|
||||||
operation, and an unlimited number of arenas can be fused together efficiently. |
|
||||||
|
|
||||||
## Reflection and Descriptors |
|
||||||
|
|
||||||
upb offers a fully-featured reflection library. There are two main ways of |
|
||||||
using reflection: |
|
||||||
|
|
||||||
1. You can load descriptors from strings using `upb_symtab_addfile()`. |
|
||||||
The upb runtime will dynamically create mini-tables like what the upb compiler |
|
||||||
would have created if you had compiled this type into a `.upb.c` file. |
|
||||||
2. You can load descriptors using generated `.upbdefs.h` interfaces. |
|
||||||
This will load reflection that references the corresponding `.upb.c` |
|
||||||
mini-tables instead of building a new mini-table on the fly. This lets |
|
||||||
you reflect on generated types that are linked into your program. |
|
||||||
|
|
||||||
upb's design for descriptors is similar to protobuf C++ in many ways, with |
|
||||||
the following correspondences: |
|
||||||
|
|
||||||
| C++ Type | upb type | |
|
||||||
| ---------| ---------| |
|
||||||
| `google::protobuf::DescriptorPool` | `upb_symtab` |
|
||||||
| `google::protobuf::Descriptor` | `upb_msgdef` |
|
||||||
| `google::protobuf::FieldDescriptor` | `upb_fielddef` |
|
||||||
| `google::protobuf::OneofDescriptor` | `upb_oneofdef` |
|
||||||
| `google::protobuf::EnumDescriptor` | `upb_enumdef` |
|
||||||
| `google::protobuf::FileDescriptor` | `upb_filedef` |
|
||||||
| `google::protobuf::ServiceDescriptor` | `upb_servicedef` |
|
||||||
| `google::protobuf::MethodDescriptor` | `upb_methoddef` |
|
||||||
|
|
||||||
Like in C++ descriptors (defs) are created by loading a |
|
||||||
`google_protobuf_FileDescriptorProto` into a `upb_symtab`. This creates and |
|
||||||
links all of the def objects corresponding to that `.proto` file, and inserts |
|
||||||
the names into a symbol table so they can be looked up by name. |
|
||||||
|
|
||||||
Once you have loaded some descriptors into a `upb_symtab`, you can create and |
|
||||||
manipulate messages using the interfaces defined in `upb/reflection.h`. If your |
|
||||||
descriptors are linked to your generated layouts using option (2) above, you can |
|
||||||
safely access the same messages using both reflection and generated interfaces. |
|
@ -0,0 +1,167 @@ |
|||||||
|
# upb Design |
||||||
|
|
||||||
|
[TOC] |
||||||
|
|
||||||
|
upb is a protobuf kernel written in C. It is a fast and conformant implementation |
||||||
|
of protobuf, with a low-level C API that is designed to be wrapped in other |
||||||
|
languages. |
||||||
|
|
||||||
|
upb is not designed to be used by applications directly. The C API is very |
||||||
|
low-level, unsafe, and changes frequently. It is important that upb is able to |
||||||
|
make breaking API changes as necessary, to avoid taking on technical debt that |
||||||
|
would compromise upb's goals of small code size and fast performance. |
||||||
|
|
||||||
|
## Design goals |
||||||
|
|
||||||
|
Goals: |
||||||
|
|
||||||
|
- Full protobuf conformance |
||||||
|
- Small code size |
||||||
|
- Fast performance (without compromising code size) |
||||||
|
- Easy to wrap in language runtimes |
||||||
|
- Easy to adapt to different memory management schemes (refcounting, GC, etc) |
||||||
|
|
||||||
|
Non-Goals: |
||||||
|
|
||||||
|
- Stable API |
||||||
|
- Safe API |
||||||
|
- Ergonomic API for applications |
||||||
|
|
||||||
|
Parameters: |
||||||
|
|
||||||
|
- C99 |
||||||
|
- 32 or 64-bit CPU (assumes 4 or 8 byte pointers) |
||||||
|
- Uses pointer tagging, but avoids other implementation-defined behavior |
||||||
|
- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc) |
||||||
|
- No global state, fully re-entrant |
||||||
|
|
||||||
|
## Arenas |
||||||
|
|
||||||
|
All memory management in upb uses arenas, using the type `upb_Arena`. Arenas |
||||||
|
are an alternative to `malloc()` and `free()` that significantly reduces the |
||||||
|
costs of memory allocation. |
||||||
|
|
||||||
|
Arenas obtain blocks of memory using some underlying allocator (likely |
||||||
|
`malloc()` and `free()`), and satisfy allocations using a simple bump allocator |
||||||
|
that walks through each block in linear order. Allocations cannot be freed |
||||||
|
individually: it is only possible to free the arena as a whole, which frees all |
||||||
|
of the underlying blocks. |
||||||
|
|
||||||
|
Here is an example of using the `upb_Arena` type: |
||||||
|
|
||||||
|
```c |
||||||
|
upb_Arena* arena = upb_Arena_New(); |
||||||
|
|
||||||
|
// Perform some allocations. |
||||||
|
int* x = upb_Arena_Malloc(arena, sizeof(*x)); |
||||||
|
int* y = upb_Arena_Malloc(arena, sizeof(*y)); |
||||||
|
|
||||||
|
// We cannot free `x` and `y` separately, we can only free the arena |
||||||
|
// as a whole. |
||||||
|
upb_Arena_Free(arena); |
||||||
|
``` |
||||||
|
|
||||||
|
upb uses arenas for all memory management, and this fact is reflected in the API |
||||||
|
for all upb data structures. All upb functions that allocate take a |
||||||
|
`upb_Arena*` parameter and perform allocations using that arena rather than |
||||||
|
calling `malloc()` or `free()`. |
||||||
|
|
||||||
|
```c |
||||||
|
// upb API to create a message. |
||||||
|
UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table, |
||||||
|
upb_Arena* arena); |
||||||
|
|
||||||
|
void MakeMessage(const upb_MiniTable* mini_table) { |
||||||
|
upb_Arena* arena = upb_Arena_New(); |
||||||
|
|
||||||
|
// This message is allocated on our arena. |
||||||
|
upb_Message* msg = upb_Message_New(mini_table, arena); |
||||||
|
|
||||||
|
// We can free the arena whenever we want, but we cannot free the |
||||||
|
// message separately from the arena. |
||||||
|
upb_Arena_Free(arena); |
||||||
|
|
||||||
|
// msg is now deleted. |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
Arenas are a key part of upb's performance story. Parsing a large protobuf |
||||||
|
payload usually involves rapidly creating a series of messages, arrays (repeated |
||||||
|
fields), and maps. It is crucial for parsing performance that these allocations |
||||||
|
are as fast as possible. Equally important, freeing the tree of messages should |
||||||
|
be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg |
||||||
|
n)`. |
||||||
|
|
||||||
|
### Avoiding Dangling Pointers |
||||||
|
|
||||||
|
Objects allocated on an arena will frequently contain pointers to other |
||||||
|
arena-allocated objects. For example, a `upb_Message` will have pointers to |
||||||
|
sub-messages that are also arena-allocated. |
||||||
|
|
||||||
|
Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide |
||||||
|
automatic safety from dangling pointers. Instead, upb provides tools to help |
||||||
|
bridge between higher-level memory management schemes (GC, refcounting, RAII, |
||||||
|
borrow checkers) and arenas. |
||||||
|
|
||||||
|
If there is only one arena, dangling pointers within the arena are impossible, |
||||||
|
because all objects are freed at the same time. This is the simplest case. The |
||||||
|
user must still be careful not to keep dangling pointers that point at arena |
||||||
|
memory after it has been freed, but dangling pointers *between* the arena |
||||||
|
objects will be impossible. |
||||||
|
|
||||||
|
But what if there are multiple arenas? If we have a pointer from one arena to |
||||||
|
another, how do we ensure that this will not become a dangling pointer? |
||||||
|
|
||||||
|
To help with the multiple arena case, upb provides a primitive called "fuse". |
||||||
|
|
||||||
|
```c |
||||||
|
// Fuses the lifetimes of `a` and `b`. None of the blocks from `a` or `b` |
||||||
|
// will be freed until both arenas are freed. |
||||||
|
UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b); |
||||||
|
``` |
||||||
|
|
||||||
|
When two arenas are fused together, their lifetimes are irreversibly joined, |
||||||
|
such that none of the arena blocks in either arena will be freed until *both* |
||||||
|
arenas are freed with `upb_Arena_Free()`. This means that dangling pointers |
||||||
|
between the two arenas will no longer be possible. |
||||||
|
|
||||||
|
Fuse is useful when joining two messages from separate arenas (making one a |
||||||
|
sub-message of the other). Fuse is a relatively cheap operation, on the order |
||||||
|
of 150ns, and is very nearly `O(1)` in the number of arenas being fused (the |
||||||
|
true complexity is the inverse Ackermann function, which grows extremely |
||||||
|
slowly). |
||||||
|
|
||||||
|
Each arena does consume some memory, so repeatedly creating and fusing an |
||||||
|
additional arena is not free, but the CPU cost of fusing two arenas together is |
||||||
|
modest. |
||||||
|
|
||||||
|
### Initial Block and Custom Allocators |
||||||
|
|
||||||
|
`upb_Arena` normally uses `malloc()` and `free()` to allocate and return its |
||||||
|
underlying blocks. But this default strategy can be customized to support |
||||||
|
the needs of a particular language. |
||||||
|
|
||||||
|
The lowest-level function for creating a `upb_Arena` is: |
||||||
|
|
||||||
|
```c |
||||||
|
// Creates an arena from the given initial block (if any -- n may be 0). |
||||||
|
// Additional blocks will be allocated from |alloc|. If |alloc| is NULL, |
||||||
|
// this is a fixed-size arena and cannot grow. |
||||||
|
UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc); |
||||||
|
``` |
||||||
|
|
||||||
|
The buffer `[mem, n]` will be used as an "initial block", which is used to |
||||||
|
satisfy allocations before calling any underlying allocation function. Note |
||||||
|
that the `upb_Arena` itself will be allocated from the initial block if |
||||||
|
possible, so the amount of memory available for allocation from the arena will |
||||||
|
be less than `n`. |
||||||
|
|
||||||
|
The `alloc` parameter specifies a custom memory allocation function which |
||||||
|
will be used once the initial block is exhausted. The user can pass `NULL` |
||||||
|
as the allocation function, in which case the initial block is the only memory |
||||||
|
available in the arena. This can allow upb to be used even in situations where |
||||||
|
there is no heap. |
||||||
|
|
||||||
|
It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating |
||||||
|
operations like `upb_Message_New()` should be checked for failure if there is |
||||||
|
any possibility that a fixed size arena is in use. |
Loading…
Reference in new issue