6.4 KiB
upb Design
[TOC]
upb is a protobuf kernel written in C. It is a fast and conformant implementation of protobuf, with a low-level C API that is designed to be wrapped in other languages.
upb is not designed to be used by applications directly. The C API is very low-level, unsafe, and changes frequently. It is important that upb is able to make breaking API changes as necessary, to avoid taking on technical debt that would compromise upb's goals of small code size and fast performance.
Design goals
Goals:
- Full protobuf conformance
- Small code size
- Fast performance (without compromising code size)
- Easy to wrap in language runtimes
- Easy to adapt to different memory management schemes (refcounting, GC, etc)
Non-Goals:
- Stable API
- Safe API
- Ergonomic API for applications
Parameters:
- C99
- 32 or 64-bit CPU (assumes 4 or 8 byte pointers)
- Uses pointer tagging, but avoids other implementation-defined behavior
- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
- No global state, fully re-entrant
Arenas
All memory management in upb uses arenas, using the type upb_Arena
. Arenas
are an alternative to malloc()
and free()
that significantly reduces the
costs of memory allocation.
Arenas obtain blocks of memory using some underlying allocator (likely
malloc()
and free()
), and satisfy allocations using a simple bump allocator
that walks through each block in linear order. Allocations cannot be freed
individually: it is only possible to free the arena as a whole, which frees all
of the underlying blocks.
Here is an example of using the upb_Arena
type:
upb_Arena* arena = upb_Arena_New();
// Perform some allocations.
int* x = upb_Arena_Malloc(arena, sizeof(*x));
int* y = upb_Arena_Malloc(arena, sizeof(*y));
// We cannot free `x` and `y` separately, we can only free the arena
// as a whole.
upb_Arena_Free(arena);
upb uses arenas for all memory management, and this fact is reflected in the API
for all upb data structures. All upb functions that allocate take a
upb_Arena*
parameter and perform allocations using that arena rather than
calling malloc()
or free()
.
// upb API to create a message.
UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table,
upb_Arena* arena);
void MakeMessage(const upb_MiniTable* mini_table) {
upb_Arena* arena = upb_Arena_New();
// This message is allocated on our arena.
upb_Message* msg = upb_Message_New(mini_table, arena);
// We can free the arena whenever we want, but we cannot free the
// message separately from the arena.
upb_Arena_Free(arena);
// msg is now deleted.
}
Arenas are a key part of upb's performance story. Parsing a large protobuf
payload usually involves rapidly creating a series of messages, arrays (repeated
fields), and maps. It is crucial for parsing performance that these allocations
are as fast as possible. Equally important, freeing the tree of messages should
be as fast as possible, and arenas can reduce this cost from O(n)
to O(lg n)
.
Avoiding Dangling Pointers
Objects allocated on an arena will frequently contain pointers to other
arena-allocated objects. For example, a upb_Message
will have pointers to
sub-messages that are also arena-allocated.
Unlike unique ownership schemes (such as unique_ptr<>
), arenas cannot provide
automatic safety from dangling pointers. Instead, upb provides tools to help
bridge between higher-level memory management schemes (GC, refcounting, RAII,
borrow checkers) and arenas.
If there is only one arena, dangling pointers within the arena are impossible, because all objects are freed at the same time. This is the simplest case. The user must still be careful not to keep dangling pointers that point at arena memory after it has been freed, but dangling pointers between the arena objects will be impossible.
But what if there are multiple arenas? If we have a pointer from one arena to another, how do we ensure that this will not become a dangling pointer?
To help with the multiple arena case, upb provides a primitive called "fuse".
// Fuses the lifetimes of `a` and `b`. None of the blocks from `a` or `b`
// will be freed until both arenas are freed.
UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b);
When two arenas are fused together, their lifetimes are irreversibly joined,
such that none of the arena blocks in either arena will be freed until both
arenas are freed with upb_Arena_Free()
. This means that dangling pointers
between the two arenas will no longer be possible.
Fuse is useful when joining two messages from separate arenas (making one a
sub-message of the other). Fuse is a relatively cheap operation, on the order
of 150ns, and is very nearly O(1)
in the number of arenas being fused (the
true complexity is the inverse Ackermann function, which grows extremely
slowly).
Each arena does consume some memory, so repeatedly creating and fusing an additional arena is not free, but the CPU cost of fusing two arenas together is modest.
Initial Block and Custom Allocators
upb_Arena
normally uses malloc()
and free()
to allocate and return its
underlying blocks. But this default strategy can be customized to support
the needs of a particular language.
The lowest-level function for creating a upb_Arena
is:
// Creates an arena from the given initial block (if any -- n may be 0).
// Additional blocks will be allocated from |alloc|. If |alloc| is NULL,
// this is a fixed-size arena and cannot grow.
UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc);
The buffer [mem, n]
will be used as an "initial block", which is used to
satisfy allocations before calling any underlying allocation function. Note
that the upb_Arena
itself will be allocated from the initial block if
possible, so the amount of memory available for allocation from the arena will
be less than n
.
The alloc
parameter specifies a custom memory allocation function which
will be used once the initial block is exhausted. The user can pass NULL
as the allocation function, in which case the initial block is the only memory
available in the arena. This can allow upb to be used even in situations where
there is no heap.
It follows that upb_Arena_Malloc()
is a fallible operation, and all allocating
operations like upb_Message_New()
should be checked for failure if there is
any possibility that a fixed size arena is in use.