protobuf/doc/wrapping-upb.in.md


<!---
This document contains embedded graphviz diagrams inside ```dot blocks.

To convert it to rendered form using render.py:
  $ ./render.py wrapping-upb.in.md

You can also live-preview this document with all diagrams using Markdown Preview Enhanced
in Visual Studio Code:
  https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced
--->

# Wrapping upb in other languages

upb is a C kernel that is designed to be wrapped in other languages.  This is a
guide for creating a new protobuf implementation based on upb.

## What you will need

There are certain things that the language runtime must provide in order to be
wrapped by upb.

1. **Finalizers, Destructors, or Cleaners**: This is one unavoidable
   requirement: the language *must* provide finalizers or destructors of some sort.
   There must be a way of calling a C function when the language GCs or otherwise
   destroys an object.  We don't care much whether it is a finalizer, a destructor,
   or a cleaner, as long as it gets called eventually when the object is destroyed.
   Without finalizers, we would have no way of cleaning up upb data and everything
   would leak.
2. **HashMap with weak values**: This is not an absolute requirement, but in
   languages with automatic memory management, we generally end up wanting a
   hash map with weak values to act as a `upb_msg* -> wrapper` object cache.
   We want the values to be weak (not the keys).

## Reflection vs. Direct Access

Each language wrapping upb gets to decide whether it will access messages
through *reflection* or through *direct access*.  This decision has some deep
implications that will affect the design, features, and performance of your
library.

### Reflection

The simplest option is to load full reflection data into the upb library at
runtime.  You can load reflection data using serialized descriptors, which are a
stable and widely supported format across all protobuf tooling.

```c
  // A upb_symtab is a dynamic container that we can load reflection data into.
  upb_symtab* symtab = upb_symtab_new();

  // We load reflection data via a serialized descriptor.  The code generator
  // for your language should embed serialized descriptors into your generated
  // files. For each generated file loaded by your library, you can add the
  // serialized descriptor to the symtab as shown.
  upb_arena *tmp = upb_arena_new();
  google_protobuf_FileDescriptorProto* file =
      google_protobuf_FileDescriptorProto_parse(desc_data, desc_size, tmp);
  if (!file || !upb_symtab_addfile(symtab, file, NULL)) {
    // Handle error.
  }
  upb_arena_free(tmp);

  // At application exit, we free the symtab.
  upb_symtab_free(symtab);
```

The `upb_symtab` will give you full access to all data from the `.proto` file,
including convenient APIs like looking up a field by name. It will allow you to
use JSON and text format.  The APIs for accessing a message through reflection
are simple and well-supported.  These APIs cleanly encapsulate upb's internal
implementation details.

```c
  upb_symtab* symtab = BuildSymtab();

  // Look up a message type in the symtab.
  const upb_msgdef* m = upb_symtab_lookupmsg(symtab, "FooMessage");

  // Construct a new message of this type, via reflection.
  upb_arena *arena = upb_arena_new();
  upb_msg *msg = upb_msg_new(m, arena);

  // Set a message field using reflection.
  const upb_fielddef* f = upb_msgdef_ntof("bar_field");
  upb_msgval val = {.int32_val = 123};
  upb_msg_set(m, f, val, arena);

  // Free the message and symtab.
  upb_arena_free(arena);
  upb_symtab_free(symtab);
```

Using reflection is a natural choice in heavily reflective, dynamic runtimes
like Python, Ruby, PHP, or Lua.  These languages generally perform method
dispatch through a dictionary/hash table anyway, so we are not adding any extra
overhead by using upb's hash table to lookup fields by name at field access
time.

### Direct Access

Using reflection has some downsides.  Reflection data is relatively large, both
in your binary (at rest) and in RAM (at runtime).  It contains names of
everything, and these names will be exposed in your binary.  Reflection APIs for
accessing a message will have more overhead than you might want, especially if
crossing the FFI boundary for your language runtime imposes significant
overhead.

We can reduce these overheads by using *direct access*.  upb's parser and
serializer do not actually require full reflection data, they use a more compact
data structure known as **mini tables**.  Mini tables will take up less space
than reflection, both in the binary and in RAM, and they will not leak field
names.  Mini tables will let us parse and serialize binary wire format data
without reflection.

```c
  // TODO: demonstrate upb API for loading mini table data at runtime.
  // This API does not exist yet.
```

To access messages themselves without the reflection API, we will be using
different, lower-level APIs that will require you to supply precise data such as
the offset of a given field.  This is information that will come from the upb
compiler framework, and the correctness (and even memory safety!) of the program
will rely on you passing these values through from the upb compiler libraries to
the upb runtime correctly.

```c
  // TODO: demonstrate using low-level APIs for direct field access.
  // These APIs do not exist yet.
```

It can even be possible in certain circumstances to bypass the upb API completely
and access raw field data directly at a given offset, using unsafe APIs like
`sun.misc.unsafe`.  This can theoretically allow for field access that is no
more expensive than referencing a struct/class field.

```java
import sun.misc.Unsafe;

class FooProto {
  private final long addr;
  private final Arena arena;

  // Accessor that a Java library built on upb could conceivably generate.
  long getFoo() {
    // The offset 1234 came from the upb compiler library, and was injected by the
    // Java+upb code generator.
    return Unsafe.getLong(self.addr + 1234);
  }
}
```

It is always possible to load reflection data as desired, even if your library
is designed primarily around direct access.  Users who want to use JSON, text
format, or reflection could potentially load reflection data from separate
generated modules, for cases where they do not mind the size overhead or the
leaking of field names. You do not give up any of these possibilities by using
direct access.

However, using direct access does have some noticeable downsides.  It requires
tighter coupling with upb's implementation details, as the mini table format is
upb-specific and requires building your code generator against upb's compiler
libraries.  Any direct access of memory is especially tightly coupled, and would
need to be changed if upb's in-memory format ever changes.  It also is more
prone to hard-to-debug memory errors if you make any mistakes.

## Memory Management

One of the core design challenges when wrapping upb is memory management.  Every
language runtime will have some memory management system, whether it is
garbage collection, reference counting, manual memory management, or some hybrid
of these.  upb is written in C and uses arenas for memory management, but upb is
designed to integrate with a wide variety of memory management schemes, and it
provides a number of tools for making this integration as smooth as possible.

### Arenas

upb defines data structures in C to represent messages, arrays (repeated
fields), and maps.  A protobuf message is a hierarchical tree of these objects.
For example, a relatively simple protobuf tree might look something like this:

```dot {align="center"}
digraph G {
  rankdir=LR;
  newrank=true;
  node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
  upb_msg -> upb_msg2;
  upb_msg -> upb_array;
  upb_msg [label="upb Message" fillcolor=1]
  upb_msg2 [label="upb Message"];
  upb_array [label="upb Array"]
}
```

All upb objects are allocated from an arena.  An arena lets you allocate objects
individually, but you cannot free individual objects; you can only free the arena
as a whole.  When the arena is freed, all of the individual objects allocated
from that arena are freed together.

```dot {align="center"}
digraph G {
  rankdir=LR;
  newrank=true;
  subgraph cluster_0 {
    label = "upb Arena"
    graph[style="rounded,filled" fillcolor=gray]
    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
    upb_msg -> upb_array;
    upb_msg -> upb_msg2;
    upb_msg [label="upb Message" fillcolor=1]
    upb_msg2 [label="upb Message"];
    upb_array [label="upb Array"];
  }
}
```

In simple cases, the entire tree of objects will all live in a single arena.
This has the nice property that there cannot be any dangling pointers between
objects, since all objects are freed at the same time.

However upb allows you to create links between any two objects, whether or
not they are in the same arena.  The library does not know or care what arenas
the objects are in when you create links between them.

```dot {align="center"}
digraph G {
  rankdir=LR;
  newrank=true;
  subgraph cluster_0 {
    label = "upb Arena 1"
    graph[style="rounded,filled" fillcolor=gray]
    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
    upb_msg -> upb_array;
    upb_msg -> upb_msg2;
    upb_msg [label="upb Message 1" fillcolor=1]
    upb_msg2 [label="upb Message 2"];
    upb_array [label="upb Array"];
  }
  subgraph cluster_1 {
    label = "upb Arena 2"
    graph[style="rounded,filled" fillcolor=gray]
    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1]
    upb_msg3;
  }
  upb_msg2 -> upb_msg3;
  upb_msg3 [label="upb Message 3"];
}
```

When objects are on separate arenas, it is the user's responsibility to ensure
that there are no dangling pointers.  In the example above, this means Arena 2
must outlive Message 1 and Message 2.

### Integrating GC with upb

In languages with automatic memory management, the goal is to handle all of the
arenas behind the scenes, so that the user does not have to manage them manually
or even know that they exist.

We can achieve this goal if we set up the object graph in a particular way.  The
general strategy is to create wrapper objects around all of the C objects,
including the arena.  Our key goal is to make sure the arena wrapper is not
GC'd until all of the C objects in that arena have become unreachable.

For this example, we will assume we are wrapping upb in Python:

```dot {align="center"}
digraph G {
  rankdir=LR;
  newrank=true;
  compound=true;

  subgraph cluster_1 {
    label = "upb Arena"
    graph[style="rounded,filled" fillcolor=gray]
    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
    upb_msg -> upb_array [style=dashed];
    upb_msg -> upb_msg2 [style=dashed];
    upb_msg [label="upb Message" fillcolor=1]
    upb_msg2 [label="upb Message"];
    upb_array [label="upb Array"]
    dummy [style=invis]
  }
  subgraph cluster_python {
    node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2]
    peripheries=0
    py_upb_msg [label="Python Message"];
    py_upb_msg2 [label="Python Message"];
    py_upb_arena [label="Python Arena"];
  }
  py_upb_msg -> upb_msg [style=dashed];
  py_upb_msg2->upb_msg2 [style=dashed];
  py_upb_msg2 -> py_upb_arena [color=springgreen4];
  py_upb_msg -> py_upb_arena [color=springgreen4];
  py_upb_arena -> dummy [lhead=cluster_1, color=red];
  {
     rank=same;
     upb_msg;
     py_upb_msg;
  }
  {
     rank=same;
     upb_array;
     upb_msg2;
     py_upb_msg2;
  }
  {  rank=same;
     dummy;
     py_upb_arena;
  }
  dummy->upb_array [style=invis];
  dummy->upb_msg2 [style=invis];

  subgraph cluster_01 {
    node [shape=plaintext]
    peripheries=0
    key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
      <tr><td align="right" port="i1">raw ptr</td></tr>
      <tr><td align="right" port="i2">unique ptr</td></tr>
      <tr><td align="right" port="i3">shared (GC) ptr</td></tr>
      </table>>]
    key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
      <tr><td port="i1">&nbsp;</td></tr>
      <tr><td port="i2">&nbsp;</td></tr>
      <tr><td port="i3">&nbsp;</td></tr>
      </table>>]
    key:i1:e -> key2:i1:w [style=dashed]
    key:i2:e -> key2:i2:w [color=red]
    key:i3:e -> key2:i3:w [color=springgreen4]
  }
    key2:i1:w -> upb_msg [style=invis];
  {
    rank=same;
    key;
    upb_msg;
  }
}
```

In this example we have three different kinds of pointers:

* **raw ptr**: This is a pointer that carries no ownership.
* **unique ptr**: This is a pointer has *unique ownership* of the target.  The owner
  will free the target in its destructor (or finalizer, or cleaner).  There can
  only be a single unique pointer to a given object.
* **shared (GC) ptr**: This is a pointer that has *shared ownership* of the
  target.  Many objects can point to the target, and the target will be deleted
  only when all such references are gone.  In a runtime with automatic memory
  management (GC), this is a reference that participates in GC.  In Python such
  references use reference counting, but in other VMs they may use mark and
  sweep or some other form of GC instead.

The Python Message wrappers have only raw pointers to the underlying message,
but they contain a shared pointer to the arena that will ensure that the raw
pointer remains valid.  Only when all message wrapper objects are destroyed
will the Python Arena become unreachable, and the upb arena ultimately freed.

### Links between arenas with "Fuse"

The design given above works well for objects that live in a single arena. But
what if a user wants to create a link between two objects in different arenas?

TODO

## UTF-8 vs. UTF-16

TODO

## Object Cache

TODO