Protocol Buffers - Google's data interchange format (grpc依赖)
https://developers.google.com/protocol-buffers/
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
375 lines
14 KiB
375 lines
14 KiB
3 years ago
|
|
||
|
<!---
|
||
|
This document contains embedded graphviz diagrams inside ```dot blocks.
|
||
|
|
||
|
To convert it to rendered form using render.py:
|
||
|
$ ./render.py wrapping-upb.in.md
|
||
|
|
||
|
You can also live-preview this document with all diagrams using Markdown Preview Enhanced
|
||
|
in Visual Studio Code:
|
||
|
https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced
|
||
|
--->
|
||
|
|
||
|
# Wrapping upb in other languages
|
||
|
|
||
|
upb is a C kernel that is designed to be wrapped in other languages. This is a
|
||
|
guide for creating a new protobuf implementation based on upb.
|
||
|
|
||
|
## What you will need
|
||
|
|
||
|
There are certain things that the language runtime must provide in order to be
|
||
|
wrapped by upb.
|
||
|
|
||
|
1. **Finalizers, Destructors, or Cleaners**: This is one unavoidable
|
||
|
requirement: the language *must* provide finalizers or destructors of some sort.
|
||
|
There must be a way of calling a C function when the language GCs or otherwise
|
||
|
destroys an object. We don't care much whether it is a finalizer, a destructor,
|
||
|
or a cleaner, as long as it gets called eventually when the object is destroyed.
|
||
|
Without finalizers, we would have no way of cleaning up upb data and everything
|
||
|
would leak.
|
||
|
2. **HashMap with weak values**: This is not an absolute requirement, but in
|
||
|
languages with automatic memory management, we generally end up wanting a
|
||
|
hash map with weak values to act as a `upb_msg* -> wrapper` object cache.
|
||
|
We want the values to be weak (not the keys).
|
||
|
|
||
|
## Reflection vs. Direct Access
|
||
|
|
||
|
Each language wrapping upb gets to decide whether it will access messages
|
||
|
through *reflection* or through *direct access*. This decision has some deep
|
||
|
implications that will affect the design, features, and performance of your
|
||
|
library.
|
||
|
|
||
|
### Reflection
|
||
|
|
||
|
The simplest option is to load full reflection data into the upb library at
|
||
|
runtime. You can load reflection data using serialized descriptors, which are a
|
||
|
stable and widely supported format across all protobuf tooling.
|
||
|
|
||
|
```c
|
||
|
// A upb_symtab is a dynamic container that we can load reflection data into.
|
||
|
upb_symtab* symtab = upb_symtab_new();
|
||
|
|
||
|
// We load reflection data via a serialized descriptor. The code generator
|
||
|
// for your language should embed serialized descriptors into your generated
|
||
|
// files; for each file you load, you can add the descriptor to the symtab as
|
||
|
// shown.
|
||
|
upb_arena *tmp = upb_arena_new();
|
||
|
google_protobuf_FileDescriptorProto* file =
|
||
|
google_protobuf_FileDescriptorProto_parse(desc_data, desc_size, tmp);
|
||
|
if (!file || !upb_symtab_addfile(symtab, file, NULL)) {
|
||
|
// Handle error.
|
||
|
}
|
||
|
upb_arena_free(tmp);
|
||
|
|
||
|
// At application exit, we free the symtab.
|
||
|
upb_symtab_free(symtab);
|
||
|
```
|
||
|
|
||
|
The `upb_symtab` will give you full access to all data from the `.proto` file,
|
||
|
including convenient APIs like looking up a field by name. It will allow you to
|
||
|
use JSON and text format. The APIs for accessing a message through reflection
|
||
|
are simple and well-supported. These APIs cleanly encapsulate upb's internal
|
||
|
implementation details.
|
||
|
|
||
|
```c
|
||
|
upb_symtab* symtab = BuildSymtab();
|
||
|
|
||
|
// Look up a message type in the symtab.
|
||
|
const upb_msgdef* m = upb_symtab_lookupmsg(symtab, "FooMessage");
|
||
|
|
||
|
// Construct a new message of this type, via reflection.
|
||
|
upb_arena *arena = upb_arena_new();
|
||
|
upb_msg *msg = upb_msg_new(m, arena);
|
||
|
|
||
|
// Set a message field using reflection.
|
||
|
const upb_fielddef* f = upb_msgdef_ntof("bar_field");
|
||
|
upb_msgval val = {.int32_val = 123};
|
||
|
upb_msg_set(m, f, val, arena);
|
||
|
|
||
|
// Free the message and symtab.
|
||
|
upb_arena_free(arena);
|
||
|
upb_symtab_free(symtab);
|
||
|
```
|
||
|
|
||
|
Using reflection is an natural choice in heavily reflective, dynamic runtimes
|
||
|
like Python, Ruby, PHP, or Lua. These languages generally perform method
|
||
|
dispatch through a dictionary/hash table anyway, so we are not adding any extra
|
||
|
overhead by using upb's hash table to lookup fields by name at field access
|
||
|
time.
|
||
|
|
||
|
### Direct Access
|
||
|
|
||
|
Using reflection has some downsides. Reflection data is relatively large, both
|
||
|
in your binary (at rest) and in RAM (at runtime). It contains names of
|
||
|
everything, and these names will be exposed in your binary. Reflection APIs for
|
||
|
accessing a message will have more overhead than you might want, especially if
|
||
|
crossing the FFI boundary for your language runtime imposes significant
|
||
|
overhead.
|
||
|
|
||
|
We can reduce these overheads by using *direct access*. upb's parser and
|
||
|
serializer do not actually require full reflection data, they use a more compact
|
||
|
data structure known as **mini tables**. Mini tables will take up less space
|
||
|
than reflection, both in the binary and in RAM, and they will not leak field
|
||
|
names. Mini tables will let us parse and serialize binary wire format data
|
||
|
without reflection.
|
||
|
|
||
|
```c
|
||
|
// TODO: demonstrate upb API for loading mini table data at runtime.
|
||
|
// This API does not exist yet.
|
||
|
```
|
||
|
|
||
|
To access messages themselves without the reflection API, we will be using
|
||
|
different, lower-level APIs that will require you to supply precise data such as
|
||
|
the offset of a given field. This is information that will come from the upb
|
||
|
compiler framework, and the correctness (and even memory safety!) of the program
|
||
|
will rely on you passing these values through from the upb compiler libraries to
|
||
|
the upb runtime correctly.
|
||
|
|
||
|
```c
|
||
|
// TODO: demonstrate using low-level APIs for direct field access.
|
||
|
// These APIs do not exist yet.
|
||
|
```
|
||
|
|
||
|
It can even be possible in certain circumstances to bypass the upb API completely
|
||
|
and access raw field data directly at a given offset, using unsafe APIs like
|
||
|
`sun.misc.unsafe`. This can theoretically allow for field access that is no
|
||
|
more expensive than referencing a struct/class field.
|
||
|
|
||
|
```java
|
||
|
import sun.misc.Unsafe;
|
||
|
|
||
|
class FooProto {
|
||
|
private final long addr;
|
||
|
private final Arena arena;
|
||
|
|
||
|
// Accessor that a Java library built on upb could conceivably generate.
|
||
|
long getFoo() {
|
||
|
// The offset 1234 came from the upb compiler library, and was injected by the
|
||
|
// Java+upb code generator.
|
||
|
return Unsafe.getLong(self.addr + 1234);
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
It is always possible for a library built on direct access to also load reflection
|
||
|
data as an add-on, optional package for when users want JSON, text format, or
|
||
|
reflection-based access to a message. You do not give up any of these possibilities
|
||
|
by using direct access.
|
||
|
|
||
|
However, using direct access does have some noticeable downsides. It requires
|
||
|
tighter coupling with upb's implementation details, as the mini table format is
|
||
|
upb-specific and requires building your code generator against upb's compiler
|
||
|
libraries. Any direct access of memory is especially tightly coupled, and would
|
||
|
need to be changed if upb's in-memory format ever changes. It also is more
|
||
|
prone to hard-to-debug memory errors if you make any mistakes.
|
||
|
|
||
|
## Memory Management
|
||
|
|
||
|
One of the core design challenges when wrapping upb is memory management. Every
|
||
|
language runtime will have some memory management system, whether it is
|
||
|
garbage collection, reference counting, manual memory management, or some hybrid
|
||
|
of these. upb is written in C and uses arenas for memory management, but upb is
|
||
|
designed to integrate with a wide variety of memory management schemes, and it
|
||
|
provides a number of tools for making this integration as smooth as possible.
|
||
|
|
||
|
### Arenas
|
||
|
|
||
|
upb defines data structures in C to represent messages, arrays (repeated
|
||
|
fields), and maps. A protobuf message is a hierarchical tree of these objects.
|
||
|
For example, a relatively simple protobuf tree might look something like this:
|
||
|
|
||
|
```dot {align="center"}
|
||
|
digraph G {
|
||
|
rankdir=LR;
|
||
|
newrank=true;
|
||
|
graph[bgcolor=transparent]
|
||
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
|
||
|
upb_msg -> upb_msg2;
|
||
|
upb_msg -> upb_array;
|
||
|
upb_msg [label="upb Message" fillcolor=1]
|
||
|
upb_msg2 [label="upb Message"];
|
||
|
upb_array [label="upb Array"]
|
||
|
}
|
||
|
```
|
||
|
|
||
|
All upb objects are allocated from an arena. An arena lets you allocate objects
|
||
|
individually, but you cannot free individual objects; you can only free the arena
|
||
|
as a whole. When the arena is freed, all of the individual objects allocated
|
||
|
from that arena are freed together.
|
||
|
|
||
|
```dot {align="center"}
|
||
|
digraph G {
|
||
|
rankdir=LR;
|
||
|
newrank=true;
|
||
|
graph[bgcolor=transparent]
|
||
|
subgraph cluster_0 {
|
||
|
label = "upb Arena"
|
||
|
graph[style="rounded,filled" fillcolor=gray]
|
||
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
|
||
|
upb_msg -> upb_array;
|
||
|
upb_msg -> upb_msg2;
|
||
|
upb_msg [label="upb Message" fillcolor=1]
|
||
|
upb_msg2 [label="upb Message"];
|
||
|
upb_array [label="upb Array"];
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
In simple cases, the entire tree of objects will all live in a single arena.
|
||
|
This has the nice property that there cannot be any dangling pointers between
|
||
|
objects, since all objects are freed at the same time.
|
||
|
|
||
|
However upb allows you to create links between any two objects, whether or
|
||
|
not they are in the same arena. The library does not know or care what arenas
|
||
|
the objects are in when you create links between them.
|
||
|
|
||
|
```dot {align="center"}
|
||
|
digraph G {
|
||
|
rankdir=LR;
|
||
|
newrank=true;
|
||
|
graph[bgcolor=transparent]
|
||
|
subgraph cluster_0 {
|
||
|
label = "upb Arena 1"
|
||
|
graph[style="rounded,filled" fillcolor=gray]
|
||
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
|
||
|
upb_msg -> upb_array;
|
||
|
upb_msg -> upb_msg2;
|
||
|
upb_msg [label="upb Message 1" fillcolor=1]
|
||
|
upb_msg2 [label="upb Message 2"];
|
||
|
upb_array [label="upb Array"];
|
||
|
}
|
||
|
subgraph cluster_1 {
|
||
|
label = "upb Arena 2"
|
||
|
graph[style="rounded,filled" fillcolor=gray]
|
||
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1]
|
||
|
upb_msg3;
|
||
|
}
|
||
|
upb_msg2 -> upb_msg3;
|
||
|
upb_msg3 [label="upb Message 3"];
|
||
|
}
|
||
|
```
|
||
|
|
||
|
When objects are on separate arenas, it is the user's responsibility to ensure
|
||
|
that there are no dangling pointers. In the example above, this means Arena 2
|
||
|
must outlive Message 1 and Message 2.
|
||
|
|
||
|
### Integrating GC with upb
|
||
|
|
||
|
In languages with automatic memory management, the goal is to handle all of the
|
||
|
arenas behind the scenes, so that the user does not have to manage them manually
|
||
|
or even know that they exist.
|
||
|
|
||
|
We can achieve this goal if we set up the object graph in a particular way. The
|
||
|
general strategy is to create wrapper objects around all of the C objects,
|
||
|
including the arena. Our key goal is to make sure the arena wrapper is not
|
||
|
GC'd until all of the C objects in that arena have become unreachable.
|
||
|
|
||
|
For this example, we will assume we are wrapping upb in Python:
|
||
|
|
||
|
```dot {align="center"}
|
||
|
digraph G {
|
||
|
graph[bgcolor=transparent]
|
||
|
rankdir=LR;
|
||
|
newrank=true;
|
||
|
compound=true;
|
||
|
|
||
|
subgraph cluster_1 {
|
||
|
label = "upb Arena"
|
||
|
graph[bgcolor=transparent style="rounded,filled" fillcolor=gray]
|
||
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
|
||
|
upb_msg -> upb_array [style=dashed];
|
||
|
upb_msg -> upb_msg2 [style=dashed];
|
||
|
upb_msg [label="upb Message" fillcolor=1]
|
||
|
upb_msg2 [label="upb Message"];
|
||
|
upb_array [label="upb Array"]
|
||
|
dummy [style=invis]
|
||
|
}
|
||
|
subgraph cluster_python {
|
||
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2]
|
||
|
peripheries=0
|
||
|
py_upb_msg [label="Python Message"];
|
||
|
py_upb_msg2 [label="Python Message"];
|
||
|
py_upb_arena [label="Python Arena"];
|
||
|
}
|
||
|
py_upb_msg -> upb_msg [style=dashed];
|
||
|
py_upb_msg2->upb_msg2 [style=dashed];
|
||
|
py_upb_msg2 -> py_upb_arena [color=springgreen4];
|
||
|
py_upb_msg -> py_upb_arena [color=springgreen4];
|
||
|
py_upb_arena -> dummy [lhead=cluster_1, color=red];
|
||
|
{
|
||
|
rank=same;
|
||
|
upb_msg;
|
||
|
py_upb_msg;
|
||
|
}
|
||
|
{
|
||
|
rank=same;
|
||
|
upb_array;
|
||
|
upb_msg2;
|
||
|
py_upb_msg2;
|
||
|
}
|
||
|
{ rank=same;
|
||
|
dummy;
|
||
|
py_upb_arena;
|
||
|
}
|
||
|
dummy->upb_array [style=invis];
|
||
|
dummy->upb_msg2 [style=invis];
|
||
|
|
||
|
subgraph cluster_01 {
|
||
|
node [shape=plaintext]
|
||
|
peripheries=0
|
||
|
key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
|
||
|
<tr><td align="right" port="i1">raw ptr</td></tr>
|
||
|
<tr><td align="right" port="i2">unique ptr</td></tr>
|
||
|
<tr><td align="right" port="i3">shared (GC) ptr</td></tr>
|
||
|
</table>>]
|
||
|
key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
|
||
|
<tr><td port="i1"> </td></tr>
|
||
|
<tr><td port="i2"> </td></tr>
|
||
|
<tr><td port="i3"> </td></tr>
|
||
|
</table>>]
|
||
|
key:i1:e -> key2:i1:w [style=dashed]
|
||
|
key:i2:e -> key2:i2:w [color=red]
|
||
|
key:i3:e -> key2:i3:w [color=springgreen4]
|
||
|
}
|
||
|
key2:i1:w -> upb_msg [style=invis];
|
||
|
{
|
||
|
rank=same;
|
||
|
key;
|
||
|
upb_msg;
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
In this example we have three different kinds of pointers:
|
||
|
|
||
|
* **raw ptr**: This is a pointer that carries no ownership.
|
||
|
* **unique ptr**: This is a pointer has *unique ownership* of the target. The owner
|
||
|
will free the target in its destructor (or finalizer, or cleaner). There can
|
||
|
only be a single unique pointer to a given object.
|
||
|
* **shared (GC) ptr**: This is a pointer that has *shared ownership* of the
|
||
|
target. Many objects can point to the target, and the target will be deleted
|
||
|
only when all such references are gone. In a runtime with automatic memory
|
||
|
management (GC), this is a reference that participates in GC. In Python such
|
||
|
references use reference counting, but in other VMs they may use mark and
|
||
|
sweep or some other form of GC instead.
|
||
|
|
||
|
The Python Message wrappers have only raw pointers to the underlying message,
|
||
|
but they contain a shared pointer to the arena that will ensure that the raw
|
||
|
pointer remains valid. Only when all message wrapper objects are destroyed
|
||
|
will the Python Arena become unreachable, and the upb arena ultimately freed.
|
||
|
|
||
|
### Links between arenas with "Fuse"
|
||
|
|
||
|
The design given above works well for objects that live in a single arena. But
|
||
|
what if a user wants to create a link between two objects in different arenas?
|
||
|
|
||
|
TODO
|
||
|
|
||
|
## UTF-8 vs. UTF-16
|
||
|
|
||
|
TODO
|
||
|
|
||
|
## Object Cache
|
||
|
|
||
|
TODO
|