Protocol Buffers - Google's data interchange format (grpc依赖)
https://developers.google.com/protocol-buffers/
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
370 lines
14 KiB
370 lines
14 KiB
|
|
<!--- |
|
This document contains embedded graphviz diagrams inside ```dot blocks. |
|
|
|
To convert it to rendered form using render.py: |
|
$ ./render.py wrapping-upb.in.md |
|
|
|
You can also live-preview this document with all diagrams using Markdown Preview Enhanced |
|
in Visual Studio Code: |
|
https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced |
|
---> |
|
|
|
# Wrapping upb in other languages |
|
|
|
upb is a C kernel that is designed to be wrapped in other languages. This is a |
|
guide for creating a new protobuf implementation based on upb. |
|
|
|
## What you will need |
|
|
|
There are certain things that the language runtime must provide in order to be |
|
wrapped by upb. |
|
|
|
1. **Finalizers, Destructors, or Cleaners**: This is one unavoidable |
|
requirement: the language *must* provide finalizers or destructors of some sort. |
|
There must be a way of calling a C function when the language GCs or otherwise |
|
destroys an object. We don't care much whether it is a finalizer, a destructor, |
|
or a cleaner, as long as it gets called eventually when the object is destroyed. |
|
Without finalizers, we would have no way of cleaning up upb data and everything |
|
would leak. |
|
2. **HashMap with weak values**: This is not an absolute requirement, but in |
|
languages with automatic memory management, we generally end up wanting a |
|
hash map with weak values to act as a `upb_msg* -> wrapper` object cache. |
|
We want the values to be weak (not the keys). |
|
|
|
## Reflection vs. Direct Access |
|
|
|
Each language wrapping upb gets to decide whether it will access messages |
|
through *reflection* or through *direct access*. This decision has some deep |
|
implications that will affect the design, features, and performance of your |
|
library. |
|
|
|
### Reflection |
|
|
|
The simplest option is to load full reflection data into the upb library at |
|
runtime. You can load reflection data using serialized descriptors, which are a |
|
stable and widely supported format across all protobuf tooling. |
|
|
|
```c |
|
// A upb_symtab is a dynamic container that we can load reflection data into. |
|
upb_symtab* symtab = upb_symtab_new(); |
|
|
|
// We load reflection data via a serialized descriptor. The code generator |
|
// for your language should embed serialized descriptors into your generated |
|
// files; for each file you load, you can add the descriptor to the symtab as |
|
// shown. |
|
upb_arena *tmp = upb_arena_new(); |
|
google_protobuf_FileDescriptorProto* file = |
|
google_protobuf_FileDescriptorProto_parse(desc_data, desc_size, tmp); |
|
if (!file || !upb_symtab_addfile(symtab, file, NULL)) { |
|
// Handle error. |
|
} |
|
upb_arena_free(tmp); |
|
|
|
// At application exit, we free the symtab. |
|
upb_symtab_free(symtab); |
|
``` |
|
|
|
The `upb_symtab` will give you full access to all data from the `.proto` file, |
|
including convenient APIs like looking up a field by name. It will allow you to |
|
use JSON and text format. The APIs for accessing a message through reflection |
|
are simple and well-supported. These APIs cleanly encapsulate upb's internal |
|
implementation details. |
|
|
|
```c |
|
upb_symtab* symtab = BuildSymtab(); |
|
|
|
// Look up a message type in the symtab. |
|
const upb_msgdef* m = upb_symtab_lookupmsg(symtab, "FooMessage"); |
|
|
|
// Construct a new message of this type, via reflection. |
|
upb_arena *arena = upb_arena_new(); |
|
upb_msg *msg = upb_msg_new(m, arena); |
|
|
|
// Set a message field using reflection. |
|
const upb_fielddef* f = upb_msgdef_ntof("bar_field"); |
|
upb_msgval val = {.int32_val = 123}; |
|
upb_msg_set(m, f, val, arena); |
|
|
|
// Free the message and symtab. |
|
upb_arena_free(arena); |
|
upb_symtab_free(symtab); |
|
``` |
|
|
|
Using reflection is an natural choice in heavily reflective, dynamic runtimes |
|
like Python, Ruby, PHP, or Lua. These languages generally perform method |
|
dispatch through a dictionary/hash table anyway, so we are not adding any extra |
|
overhead by using upb's hash table to lookup fields by name at field access |
|
time. |
|
|
|
### Direct Access |
|
|
|
Using reflection has some downsides. Reflection data is relatively large, both |
|
in your binary (at rest) and in RAM (at runtime). It contains names of |
|
everything, and these names will be exposed in your binary. Reflection APIs for |
|
accessing a message will have more overhead than you might want, especially if |
|
crossing the FFI boundary for your language runtime imposes significant |
|
overhead. |
|
|
|
We can reduce these overheads by using *direct access*. upb's parser and |
|
serializer do not actually require full reflection data, they use a more compact |
|
data structure known as **mini tables**. Mini tables will take up less space |
|
than reflection, both in the binary and in RAM, and they will not leak field |
|
names. Mini tables will let us parse and serialize binary wire format data |
|
without reflection. |
|
|
|
```c |
|
// TODO: demonstrate upb API for loading mini table data at runtime. |
|
// This API does not exist yet. |
|
``` |
|
|
|
To access messages themselves without the reflection API, we will be using |
|
different, lower-level APIs that will require you to supply precise data such as |
|
the offset of a given field. This is information that will come from the upb |
|
compiler framework, and the correctness (and even memory safety!) of the program |
|
will rely on you passing these values through from the upb compiler libraries to |
|
the upb runtime correctly. |
|
|
|
```c |
|
// TODO: demonstrate using low-level APIs for direct field access. |
|
// These APIs do not exist yet. |
|
``` |
|
|
|
It can even be possible in certain circumstances to bypass the upb API completely |
|
and access raw field data directly at a given offset, using unsafe APIs like |
|
`sun.misc.unsafe`. This can theoretically allow for field access that is no |
|
more expensive than referencing a struct/class field. |
|
|
|
```java |
|
import sun.misc.Unsafe; |
|
|
|
class FooProto { |
|
private final long addr; |
|
private final Arena arena; |
|
|
|
// Accessor that a Java library built on upb could conceivably generate. |
|
long getFoo() { |
|
// The offset 1234 came from the upb compiler library, and was injected by the |
|
// Java+upb code generator. |
|
return Unsafe.getLong(self.addr + 1234); |
|
} |
|
} |
|
``` |
|
|
|
It is always possible for a library built on direct access to also load reflection |
|
data as an add-on, optional package for when users want JSON, text format, or |
|
reflection-based access to a message. You do not give up any of these possibilities |
|
by using direct access. |
|
|
|
However, using direct access does have some noticeable downsides. It requires |
|
tighter coupling with upb's implementation details, as the mini table format is |
|
upb-specific and requires building your code generator against upb's compiler |
|
libraries. Any direct access of memory is especially tightly coupled, and would |
|
need to be changed if upb's in-memory format ever changes. It also is more |
|
prone to hard-to-debug memory errors if you make any mistakes. |
|
|
|
## Memory Management |
|
|
|
One of the core design challenges when wrapping upb is memory management. Every |
|
language runtime will have some memory management system, whether it is |
|
garbage collection, reference counting, manual memory management, or some hybrid |
|
of these. upb is written in C and uses arenas for memory management, but upb is |
|
designed to integrate with a wide variety of memory management schemes, and it |
|
provides a number of tools for making this integration as smooth as possible. |
|
|
|
### Arenas |
|
|
|
upb defines data structures in C to represent messages, arrays (repeated |
|
fields), and maps. A protobuf message is a hierarchical tree of these objects. |
|
For example, a relatively simple protobuf tree might look something like this: |
|
|
|
```dot {align="center"} |
|
digraph G { |
|
rankdir=LR; |
|
newrank=true; |
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] |
|
upb_msg -> upb_msg2; |
|
upb_msg -> upb_array; |
|
upb_msg [label="upb Message" fillcolor=1] |
|
upb_msg2 [label="upb Message"]; |
|
upb_array [label="upb Array"] |
|
} |
|
``` |
|
|
|
All upb objects are allocated from an arena. An arena lets you allocate objects |
|
individually, but you cannot free individual objects; you can only free the arena |
|
as a whole. When the arena is freed, all of the individual objects allocated |
|
from that arena are freed together. |
|
|
|
```dot {align="center"} |
|
digraph G { |
|
rankdir=LR; |
|
newrank=true; |
|
subgraph cluster_0 { |
|
label = "upb Arena" |
|
graph[style="rounded,filled" fillcolor=gray] |
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] |
|
upb_msg -> upb_array; |
|
upb_msg -> upb_msg2; |
|
upb_msg [label="upb Message" fillcolor=1] |
|
upb_msg2 [label="upb Message"]; |
|
upb_array [label="upb Array"]; |
|
} |
|
} |
|
``` |
|
|
|
In simple cases, the entire tree of objects will all live in a single arena. |
|
This has the nice property that there cannot be any dangling pointers between |
|
objects, since all objects are freed at the same time. |
|
|
|
However upb allows you to create links between any two objects, whether or |
|
not they are in the same arena. The library does not know or care what arenas |
|
the objects are in when you create links between them. |
|
|
|
```dot {align="center"} |
|
digraph G { |
|
rankdir=LR; |
|
newrank=true; |
|
subgraph cluster_0 { |
|
label = "upb Arena 1" |
|
graph[style="rounded,filled" fillcolor=gray] |
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] |
|
upb_msg -> upb_array; |
|
upb_msg -> upb_msg2; |
|
upb_msg [label="upb Message 1" fillcolor=1] |
|
upb_msg2 [label="upb Message 2"]; |
|
upb_array [label="upb Array"]; |
|
} |
|
subgraph cluster_1 { |
|
label = "upb Arena 2" |
|
graph[style="rounded,filled" fillcolor=gray] |
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1] |
|
upb_msg3; |
|
} |
|
upb_msg2 -> upb_msg3; |
|
upb_msg3 [label="upb Message 3"]; |
|
} |
|
``` |
|
|
|
When objects are on separate arenas, it is the user's responsibility to ensure |
|
that there are no dangling pointers. In the example above, this means Arena 2 |
|
must outlive Message 1 and Message 2. |
|
|
|
### Integrating GC with upb |
|
|
|
In languages with automatic memory management, the goal is to handle all of the |
|
arenas behind the scenes, so that the user does not have to manage them manually |
|
or even know that they exist. |
|
|
|
We can achieve this goal if we set up the object graph in a particular way. The |
|
general strategy is to create wrapper objects around all of the C objects, |
|
including the arena. Our key goal is to make sure the arena wrapper is not |
|
GC'd until all of the C objects in that arena have become unreachable. |
|
|
|
For this example, we will assume we are wrapping upb in Python: |
|
|
|
```dot {align="center"} |
|
digraph G { |
|
rankdir=LR; |
|
newrank=true; |
|
compound=true; |
|
|
|
subgraph cluster_1 { |
|
label = "upb Arena" |
|
graph[style="rounded,filled" fillcolor=gray] |
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] |
|
upb_msg -> upb_array [style=dashed]; |
|
upb_msg -> upb_msg2 [style=dashed]; |
|
upb_msg [label="upb Message" fillcolor=1] |
|
upb_msg2 [label="upb Message"]; |
|
upb_array [label="upb Array"] |
|
dummy [style=invis] |
|
} |
|
subgraph cluster_python { |
|
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2] |
|
peripheries=0 |
|
py_upb_msg [label="Python Message"]; |
|
py_upb_msg2 [label="Python Message"]; |
|
py_upb_arena [label="Python Arena"]; |
|
} |
|
py_upb_msg -> upb_msg [style=dashed]; |
|
py_upb_msg2->upb_msg2 [style=dashed]; |
|
py_upb_msg2 -> py_upb_arena [color=springgreen4]; |
|
py_upb_msg -> py_upb_arena [color=springgreen4]; |
|
py_upb_arena -> dummy [lhead=cluster_1, color=red]; |
|
{ |
|
rank=same; |
|
upb_msg; |
|
py_upb_msg; |
|
} |
|
{ |
|
rank=same; |
|
upb_array; |
|
upb_msg2; |
|
py_upb_msg2; |
|
} |
|
{ rank=same; |
|
dummy; |
|
py_upb_arena; |
|
} |
|
dummy->upb_array [style=invis]; |
|
dummy->upb_msg2 [style=invis]; |
|
|
|
subgraph cluster_01 { |
|
node [shape=plaintext] |
|
peripheries=0 |
|
key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0"> |
|
<tr><td align="right" port="i1">raw ptr</td></tr> |
|
<tr><td align="right" port="i2">unique ptr</td></tr> |
|
<tr><td align="right" port="i3">shared (GC) ptr</td></tr> |
|
</table>>] |
|
key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0"> |
|
<tr><td port="i1"> </td></tr> |
|
<tr><td port="i2"> </td></tr> |
|
<tr><td port="i3"> </td></tr> |
|
</table>>] |
|
key:i1:e -> key2:i1:w [style=dashed] |
|
key:i2:e -> key2:i2:w [color=red] |
|
key:i3:e -> key2:i3:w [color=springgreen4] |
|
} |
|
key2:i1:w -> upb_msg [style=invis]; |
|
{ |
|
rank=same; |
|
key; |
|
upb_msg; |
|
} |
|
} |
|
``` |
|
|
|
In this example we have three different kinds of pointers: |
|
|
|
* **raw ptr**: This is a pointer that carries no ownership. |
|
* **unique ptr**: This is a pointer has *unique ownership* of the target. The owner |
|
will free the target in its destructor (or finalizer, or cleaner). There can |
|
only be a single unique pointer to a given object. |
|
* **shared (GC) ptr**: This is a pointer that has *shared ownership* of the |
|
target. Many objects can point to the target, and the target will be deleted |
|
only when all such references are gone. In a runtime with automatic memory |
|
management (GC), this is a reference that participates in GC. In Python such |
|
references use reference counting, but in other VMs they may use mark and |
|
sweep or some other form of GC instead. |
|
|
|
The Python Message wrappers have only raw pointers to the underlying message, |
|
but they contain a shared pointer to the arena that will ensure that the raw |
|
pointer remains valid. Only when all message wrapper objects are destroyed |
|
will the Python Arena become unreachable, and the upb arena ultimately freed. |
|
|
|
### Links between arenas with "Fuse" |
|
|
|
The design given above works well for objects that live in a single arena. But |
|
what if a user wants to create a link between two objects in different arenas? |
|
|
|
TODO |
|
|
|
## UTF-8 vs. UTF-16 |
|
|
|
TODO |
|
|
|
## Object Cache |
|
|
|
TODO
|
|
|