parent
11c9468759
commit
8716432bfa
3 changed files with 691 additions and 135 deletions
@ -1,14 +1,32 @@ |
||||
#!/usr/bin/env python3 |
||||
|
||||
import subprocess |
||||
import sys |
||||
|
||||
if len(sys.argv) < 2: |
||||
print("Must pass a filename argument") |
||||
sys.exit(1) |
||||
|
||||
in_file = sys.argv[1] |
||||
out_file = in_file.replace(".in.md", ".md") |
||||
in_filename = sys.argv[1] |
||||
out_filename = in_filename.replace(".in.md", ".md") |
||||
|
||||
if in_file == out_file: |
||||
if in_filename == out_filename: |
||||
print("File must end in .in.md") |
||||
sys.exit(1) |
||||
|
||||
with open(out_filename, "wb") as out_file, open(in_filename, "rb") as in_file: |
||||
for line in in_file: |
||||
if line.startswith(b"```dot"): |
||||
dot_lines = [] |
||||
while True: |
||||
dot_line = next(in_file) |
||||
if dot_line == b"```\n": |
||||
break |
||||
dot_lines.append(dot_line) |
||||
dot_input = b"".join(dot_lines) |
||||
svg = subprocess.check_output(['dot', '-Tsvg'], input=dot_input) |
||||
out_file.write(b"<div align=center>") |
||||
out_file.write(svg) |
||||
out_file.write(b"</div>") |
||||
else: |
||||
out_file.write(line) |
||||
|
@ -0,0 +1,374 @@ |
||||
|
||||
<!--- |
||||
This document contains embedded graphviz diagrams inside ```dot blocks. |
||||
|
||||
To convert it to rendered form using render.py: |
||||
$ ./render.py wrapping-upb.in.md |
||||
|
||||
You can also live-preview this document with all diagrams using Markdown Preview Enhanced |
||||
in Visual Studio Code: |
||||
https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced |
||||
---> |
||||
|
||||
# Wrapping upb in other languages |
||||
|
||||
upb is a C kernel that is designed to be wrapped in other languages. This is a |
||||
guide for creating a new protobuf implementation based on upb. |
||||
|
||||
## What you will need |
||||
|
||||
There are certain things that the language runtime must provide in order to be |
||||
wrapped by upb. |
||||
|
||||
1. **Finalizers, Destructors, or Cleaners**: This is one unavoidable |
||||
requirement: the language *must* provide finalizers or destructors of some sort. |
||||
There must be a way of calling a C function when the language GCs or otherwise |
||||
destroys an object. We don't care much whether it is a finalizer, a destructor, |
||||
or a cleaner, as long as it gets called eventually when the object is destroyed. |
||||
Without finalizers, we would have no way of cleaning up upb data and everything |
||||
would leak. |
||||
2. **HashMap with weak values**: This is not an absolute requirement, but in |
||||
languages with automatic memory management, we generally end up wanting a |
||||
hash map with weak values to act as a `upb_msg* -> wrapper` object cache. |
||||
We want the values to be weak (not the keys). |
||||
|
||||
## Reflection vs. Direct Access |
||||
|
||||
Each language wrapping upb gets to decide whether it will access messages |
||||
through *reflection* or through *direct access*. This decision has some deep |
||||
implications that will affect the design, features, and performance of your |
||||
library. |
||||
|
||||
### Reflection |
||||
|
||||
The simplest option is to load full reflection data into the upb library at |
||||
runtime. You can load reflection data using serialized descriptors, which are a |
||||
stable and widely supported format across all protobuf tooling. |
||||
|
||||
```c |
||||
// A upb_symtab is a dynamic container that we can load reflection data into. |
||||
upb_symtab* symtab = upb_symtab_new(); |
||||
|
||||
// We load reflection data via a serialized descriptor. The code generator |
||||
// for your language should embed serialized descriptors into your generated |
||||
// files; for each file you load, you can add the descriptor to the symtab as |
||||
// shown. |
||||
upb_arena *tmp = upb_arena_new(); |
||||
google_protobuf_FileDescriptorProto* file = |
||||
google_protobuf_FileDescriptorProto_parse(desc_data, desc_size, tmp); |
||||
if (!file || !upb_symtab_addfile(symtab, file, NULL)) { |
||||
// Handle error. |
||||
} |
||||
upb_arena_free(tmp); |
||||
|
||||
// At application exit, we free the symtab. |
||||
upb_symtab_free(symtab); |
||||
``` |
||||
|
||||
The `upb_symtab` will give you full access to all data from the `.proto` file, |
||||
including convenient APIs like looking up a field by name. It will allow you to |
||||
use JSON and text format. The APIs for accessing a message through reflection |
||||
are simple and well-supported. These APIs cleanly encapsulate upb's internal |
||||
implementation details. |
||||
|
||||
```c |
||||
upb_symtab* symtab = BuildSymtab(); |
||||
|
||||
// Look up a message type in the symtab. |
||||
const upb_msgdef* m = upb_symtab_lookupmsg(symtab, "FooMessage"); |
||||
|
||||
// Construct a new message of this type, via reflection. |
||||
upb_arena *arena = upb_arena_new(); |
||||
upb_msg *msg = upb_msg_new(m, arena); |
||||
|
||||
// Set a message field using reflection. |
||||
const upb_fielddef* f = upb_msgdef_ntof("bar_field"); |
||||
upb_msgval val = {.int32_val = 123}; |
||||
upb_msg_set(m, f, val, arena); |
||||
|
||||
// Free the message and symtab. |
||||
upb_arena_free(arena); |
||||
upb_symtab_free(symtab); |
||||
``` |
||||
|
||||
Using reflection is an natural choice in heavily reflective, dynamic runtimes |
||||
like Python, Ruby, PHP, or Lua. These languages generally perform method |
||||
dispatch through a dictionary/hash table anyway, so we are not adding any extra |
||||
overhead by using upb's hash table to lookup fields by name at field access |
||||
time. |
||||
|
||||
### Direct Access |
||||
|
||||
Using reflection has some downsides. Reflection data is relatively large, both |
||||
in your binary (at rest) and in RAM (at runtime). It contains names of |
||||
everything, and these names will be exposed in your binary. Reflection APIs for |
||||
accessing a message will have more overhead than you might want, especially if |
||||
crossing the FFI boundary for your language runtime imposes significant |
||||
overhead. |
||||
|
||||
We can reduce these overheads by using *direct access*. upb's parser and |
||||
serializer do not actually require full reflection data, they use a more compact |
||||
data structure known as **mini tables**. Mini tables will take up less space |
||||
than reflection, both in the binary and in RAM, and they will not leak field |
||||
names. Mini tables will let us parse and serialize binary wire format data |
||||
without reflection. |
||||
|
||||
```c |
||||
// TODO: demonstrate upb API for loading mini table data at runtime. |
||||
// This API does not exist yet. |
||||
``` |
||||
|
||||
To access messages themselves without the reflection API, we will be using |
||||
different, lower-level APIs that will require you to supply precise data such as |
||||
the offset of a given field. This is information that will come from the upb |
||||
compiler framework, and the correctness (and even memory safety!) of the program |
||||
will rely on you passing these values through from the upb compiler libraries to |
||||
the upb runtime correctly. |
||||
|
||||
```c |
||||
// TODO: demonstrate using low-level APIs for direct field access. |
||||
// These APIs do not exist yet. |
||||
``` |
||||
|
||||
It can even be possible in certain circumstances to bypass the upb API completely |
||||
and access raw field data directly at a given offset, using unsafe APIs like |
||||
`sun.misc.unsafe`. This can theoretically allow for field access that is no |
||||
more expensive than referencing a struct/class field. |
||||
|
||||
```java |
||||
import sun.misc.Unsafe; |
||||
|
||||
class FooProto { |
||||
private final long addr; |
||||
private final Arena arena; |
||||
|
||||
// Accessor that a Java library built on upb could conceivably generate. |
||||
long getFoo() { |
||||
// The offset 1234 came from the upb compiler library, and was injected by the |
||||
// Java+upb code generator. |
||||
return Unsafe.getLong(self.addr + 1234); |
||||
} |
||||
} |
||||
``` |
||||
|
||||
It is always possible for a library built on direct access to also load reflection |
||||
data as an add-on, optional package for when users want JSON, text format, or |
||||
reflection-based access to a message. You do not give up any of these possibilities |
||||
by using direct access. |
||||
|
||||
However, using direct access does have some noticeable downsides. It requires |
||||
tighter coupling with upb's implementation details, as the mini table format is |
||||
upb-specific and requires building your code generator against upb's compiler |
||||
libraries. Any direct access of memory is especially tightly coupled, and would |
||||
need to be changed if upb's in-memory format ever changes. It also is more |
||||
prone to hard-to-debug memory errors if you make any mistakes. |
||||
|
||||
## Memory Management |
||||
|
||||
One of the core design challenges when wrapping upb is memory management. Every |
||||
language runtime will have some memory management system, whether it is |
||||
garbage collection, reference counting, manual memory management, or some hybrid |
||||
of these. upb is written in C and uses arenas for memory management, but upb is |
||||
designed to integrate with a wide variety of memory management schemes, and it |
||||
provides a number of tools for making this integration as smooth as possible. |
||||
|
||||
### Arenas |
||||
|
||||
upb defines data structures in C to represent messages, arrays (repeated |
||||
fields), and maps. A protobuf message is a hierarchical tree of these objects. |
||||
For example, a relatively simple protobuf tree might look something like this: |
||||
|
||||
```dot {align="center"} |
||||
digraph G { |
||||
rankdir=LR; |
||||
newrank=true; |
||||
graph[bgcolor=transparent] |
||||
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] |
||||
upb_msg -> upb_msg2; |
||||
upb_msg -> upb_array; |
||||
upb_msg [label="upb Message" fillcolor=1] |
||||
upb_msg2 [label="upb Message"]; |
||||
upb_array [label="upb Array"] |
||||
} |
||||
``` |
||||
|
||||
All upb objects are allocated from an arena. An arena lets you allocate objects |
||||
individually, but you cannot free individual objects; you can only free the arena |
||||
as a whole. When the arena is freed, all of the individual objects allocated |
||||
from that arena are freed together. |
||||
|
||||
```dot {align="center"} |
||||
digraph G { |
||||
rankdir=LR; |
||||
newrank=true; |
||||
graph[bgcolor=transparent] |
||||
subgraph cluster_0 { |
||||
label = "upb Arena" |
||||
graph[style="rounded,filled" fillcolor=gray] |
||||
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] |
||||
upb_msg -> upb_array; |
||||
upb_msg -> upb_msg2; |
||||
upb_msg [label="upb Message" fillcolor=1] |
||||
upb_msg2 [label="upb Message"]; |
||||
upb_array [label="upb Array"]; |
||||
} |
||||
} |
||||
``` |
||||
|
||||
In simple cases, the entire tree of objects will all live in a single arena. |
||||
This has the nice property that there cannot be any dangling pointers between |
||||
objects, since all objects are freed at the same time. |
||||
|
||||
However upb allows you to create links between any two objects, whether or |
||||
not they are in the same arena. The library does not know or care what arenas |
||||
the objects are in when you create links between them. |
||||
|
||||
```dot {align="center"} |
||||
digraph G { |
||||
rankdir=LR; |
||||
newrank=true; |
||||
graph[bgcolor=transparent] |
||||
subgraph cluster_0 { |
||||
label = "upb Arena 1" |
||||
graph[style="rounded,filled" fillcolor=gray] |
||||
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] |
||||
upb_msg -> upb_array; |
||||
upb_msg -> upb_msg2; |
||||
upb_msg [label="upb Message 1" fillcolor=1] |
||||
upb_msg2 [label="upb Message 2"]; |
||||
upb_array [label="upb Array"]; |
||||
} |
||||
subgraph cluster_1 { |
||||
label = "upb Arena 2" |
||||
graph[style="rounded,filled" fillcolor=gray] |
||||
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1] |
||||
upb_msg3; |
||||
} |
||||
upb_msg2 -> upb_msg3; |
||||
upb_msg3 [label="upb Message 3"]; |
||||
} |
||||
``` |
||||
|
||||
When objects are on separate arenas, it is the user's responsibility to ensure |
||||
that there are no dangling pointers. In the example above, this means Arena 2 |
||||
must outlive Message 1 and Message 2. |
||||
|
||||
### Integrating GC with upb |
||||
|
||||
In languages with automatic memory management, the goal is to handle all of the |
||||
arenas behind the scenes, so that the user does not have to manage them manually |
||||
or even know that they exist. |
||||
|
||||
We can achieve this goal if we set up the object graph in a particular way. The |
||||
general strategy is to create wrapper objects around all of the C objects, |
||||
including the arena. Our key goal is to make sure the arena wrapper is not |
||||
GC'd until all of the C objects in that arena have become unreachable. |
||||
|
||||
For this example, we will assume we are wrapping upb in Python: |
||||
|
||||
```dot {align="center"} |
||||
digraph G { |
||||
graph[bgcolor=transparent] |
||||
rankdir=LR; |
||||
newrank=true; |
||||
compound=true; |
||||
|
||||
subgraph cluster_1 { |
||||
label = "upb Arena" |
||||
graph[bgcolor=transparent style="rounded,filled" fillcolor=gray] |
||||
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] |
||||
upb_msg -> upb_array [style=dashed]; |
||||
upb_msg -> upb_msg2 [style=dashed]; |
||||
upb_msg [label="upb Message" fillcolor=1] |
||||
upb_msg2 [label="upb Message"]; |
||||
upb_array [label="upb Array"] |
||||
dummy [style=invis] |
||||
} |
||||
subgraph cluster_python { |
||||
node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2] |
||||
peripheries=0 |
||||
py_upb_msg [label="Python Message"]; |
||||
py_upb_msg2 [label="Python Message"]; |
||||
py_upb_arena [label="Python Arena"]; |
||||
} |
||||
py_upb_msg -> upb_msg [style=dashed]; |
||||
py_upb_msg2->upb_msg2 [style=dashed]; |
||||
py_upb_msg2 -> py_upb_arena [color=springgreen4]; |
||||
py_upb_msg -> py_upb_arena [color=springgreen4]; |
||||
py_upb_arena -> dummy [lhead=cluster_1, color=red]; |
||||
{ |
||||
rank=same; |
||||
upb_msg; |
||||
py_upb_msg; |
||||
} |
||||
{ |
||||
rank=same; |
||||
upb_array; |
||||
upb_msg2; |
||||
py_upb_msg2; |
||||
} |
||||
{ rank=same; |
||||
dummy; |
||||
py_upb_arena; |
||||
} |
||||
dummy->upb_array [style=invis]; |
||||
dummy->upb_msg2 [style=invis]; |
||||
|
||||
subgraph cluster_01 { |
||||
node [shape=plaintext] |
||||
peripheries=0 |
||||
key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0"> |
||||
<tr><td align="right" port="i1">raw ptr</td></tr> |
||||
<tr><td align="right" port="i2">unique ptr</td></tr> |
||||
<tr><td align="right" port="i3">shared (GC) ptr</td></tr> |
||||
</table>>] |
||||
key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0"> |
||||
<tr><td port="i1"> </td></tr> |
||||
<tr><td port="i2"> </td></tr> |
||||
<tr><td port="i3"> </td></tr> |
||||
</table>>] |
||||
key:i1:e -> key2:i1:w [style=dashed] |
||||
key:i2:e -> key2:i2:w [color=red] |
||||
key:i3:e -> key2:i3:w [color=springgreen4] |
||||
} |
||||
key2:i1:w -> upb_msg [style=invis]; |
||||
{ |
||||
rank=same; |
||||
key; |
||||
upb_msg; |
||||
} |
||||
} |
||||
``` |
||||
|
||||
In this example we have three different kinds of pointers: |
||||
|
||||
* **raw ptr**: This is a pointer that carries no ownership. |
||||
* **unique ptr**: This is a pointer has *unique ownership* of the target. The owner |
||||
will free the target in its destructor (or finalizer, or cleaner). There can |
||||
only be a single unique pointer to a given object. |
||||
* **shared (GC) ptr**: This is a pointer that has *shared ownership* of the |
||||
target. Many objects can point to the target, and the target will be deleted |
||||
only when all such references are gone. In a runtime with automatic memory |
||||
management (GC), this is a reference that participates in GC. In Python such |
||||
references use reference counting, but in other VMs they may use mark and |
||||
sweep or some other form of GC instead. |
||||
|
||||
The Python Message wrappers have only raw pointers to the underlying message, |
||||
but they contain a shared pointer to the arena that will ensure that the raw |
||||
pointer remains valid. Only when all message wrapper objects are destroyed |
||||
will the Python Arena become unreachable, and the upb arena ultimately freed. |
||||
|
||||
### Links between arenas with "Fuse" |
||||
|
||||
The design given above works well for objects that live in a single arena. But |
||||
what if a user wants to create a link between two objects in different arenas? |
||||
|
||||
TODO |
||||
|
||||
## UTF-8 vs. UTF-16 |
||||
|
||||
TODO |
||||
|
||||
## Object Cache |
||||
|
||||
TODO |
Loading…
Reference in new issue