Updated "wrapping" doc.

PiperOrigin-RevId: 447505841
3 years ago · 2a5919deb3
parent 49f356a801
commit 2a5919deb3
1 changed files with 198 additions and 126 deletions
--- a/docs/wrapping-upb.md
+++ b/docs/wrapping-upb.md
@ -10,160 +10,232 @@ in Visual Studio Code:
  https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced
 --->
-# Wrapping upb in other languages
+# Building a protobuf library on upb
-upb is a C kernel that is designed to be wrapped in other languages.  This is a
+This is a guide for creating a new protobuf implementation based on upb.  It
-guide for creating a new protobuf implementation based on upb.
+starts from the beginning and walks you through the process, highlighting
 some important design choices you will need to make.
-## What you will need
+## Overview
-There are certain things that the language runtime must provide in order to be
+A protobuf implementation consists of two main pieces:
 wrapped by upb.
-1. **Finalizers, Destructors, or Cleaners**: This is one unavoidable
+1. a code generator, run at compile time, to turn `.proto` files into source
-   requirement: the language *must* provide finalizers or destructors of some sort.
+   files in your language (we will call this "zlang", assuming an extension of ".z").
-   There must be a way of calling a C function when the language GCs or otherwise
+2. a runtime component, which implements the wire format and provides the data
-   destroys an object.  We don't care much whether it is a finalizer, a destructor,
+   structures for representing protobuf data and metadata.
   or a cleaner, as long as it gets called eventually when the object is destroyed.
   Without finalizers, we would have no way of cleaning up upb data and everything
   would leak.
 2. **HashMap with weak values**: This is not an absolute requirement, but in
   languages with automatic memory management, we generally end up wanting a
   hash map with weak values to act as a `upb_msg* -> wrapper` object cache.
   We want the values to be weak (not the keys).
-## Reflection vs. Direct Access
+<br/>
-Each language wrapping upb gets to decide whether it will access messages
+```dot {align="center"}
-through *reflection* or through *direct access*.  This decision has some deep
+digraph {
-implications that will affect the design, features, and performance of your
+    rankdir=LR; 
-library.
+    newrank=true;
    node [style="rounded,filled" shape=box]
    "foo.proto" -> protoc;
    "foo.proto" [shape=folder];
    protoc [fillcolor=lightgrey];
    protoc -> "protoc-gen-zlang";
    "protoc-gen-zlang" -> "foo.z";
    "protoc-gen-zlang" [fillcolor=palegreen3];
    "foo.z" [shape=folder];
    labelloc="b";
    label="Compile Time";
 }
 ```
 <br/>
 ```dot {align="center"}
 digraph {
    newrank=true;
    node [style="rounded,filled" shape=box fillcolor=lightgrey]
    "foo.z" -> "zlang/upb glue (FFI)";
    "zlang/upb glue (FFI)" -> "upb (C)";
    "zlang/upb glue (FFI)" [fillcolor=palegreen3];
    labelloc="b";
    label="Runtime";
 }
 ```
 The parts in green are what you will need to implement.
 Note that your code generator (`protoc-gen-zlang`) does *not* need to generate
 any C code (eg. `foo.c`). While upb itself is written in C, upb's parsers and
 serializers are fully table-driven, which means there is never any need or even
 benefit to generating C code for each proto. upb is capable of full-speed
 parsing even when schema data is loaded at runtime from strings embedded into
 `foo.z`. This is a key benefit of upb compared with C++ protos, which have
 traditionally relied on generated parsers in `foo.pb.cc` files to achieve full
 parsing speed, and suffered a ~10x speed penalty in the parser when the schema
 data was loaded at runtime.
 ## Prerequisites
 There are a few things that the language runtime must provide in order to wrap
 upb.
 1.  **FFI**: To wrap upb, your language must be able to call into a C API
    through a Foreign Function Interface (FFI). Most languages support FFI in
    some form, either through "native extensions" (in which you write some C
    code to implement new methods in your language) or through a direct FFI (in
    which you can call into regular C functions directly from your language
    using a special library).
 2.  **Finalizers, Destructors, or Cleaners**: The runtime must provide
    finalizers or destructors of some sort. There must be a way of triggering a
    call to a C function when the language garbage collects or otherwise
    destroys an object. We don't care much whether it is a finalizer, a
    destructor, or a cleaner, as long as it gets called eventually when the
    object is destroyed. upb allocates memory in C space, and a finalizer is our
    only way of making sure that memory is freed and does not leak.
 3.  **HashMap with weak values**: (optional) This is not a strong requirement,
    but it is sometimes helpful to have a global hashmap with weak values to act
    as a `upb_msg* -> wrapper` object cache. We want the values to be weak (not
    the keys). There is some question about whether we want to continue to use
    this pattern going forward.
 ## Reflection vs. MiniTables
 The first key design decision you will need to make is whether your generated
 code will access message data via reflection or minitables. Generally more
 dynamic languages will want to use reflection and more static languages will
 want to use minitables.
 ### Reflection
-The simplest option is to load full reflection data into the upb library at
+Reflection-based data access makes the most sense in highly dynamic language
-runtime.  You can load reflection data using serialized descriptors, which are a
+interpreters, where method dispatch is generally resolved via strings and hash
-stable and widely supported format across all protobuf tooling.
+table lookups.  
-
+
-```c
+In such languages, you can often implement a special method like `__getattr__`
-  // A upb_symtab is a dynamic container that we can load reflection data into.
+(Python) or `method_missing` (Ruby) that receives the method name as a string.
-  upb_symtab* symtab = upb_symtab_new();
+Using upb's reflection, you can look up a field name using the method name,
-
+thereby using a hash table belonging to upb instead of one provided by the
-  // We load reflection data via a serialized descriptor.  The code generator
+language.
-  // for your language should embed serialized descriptors into your generated
+
-  // files. For each generated file loaded by your library, you can add the
+```python
-  // serialized descriptor to the symtab as shown.
+class FooMessage:
-  upb_arena *tmp = upb_arena_new();
+  # Written in Python for illustration, but in practice we will want to
-  google_protobuf_FileDescriptorProto* file =
+  # implement this in C for speed.
-      google_protobuf_FileDescriptorProto_parse(desc_data, desc_size, tmp);
+  def __getattr__(self, name):
-  if (!file || !upb_symtab_addfile(symtab, file, NULL)) {
+    field = FooMessage.descriptor.fields_by_name[name]
-    // Handle error.
+    return field.get_value(self)
-  }
+```
  upb_arena_free(tmp);
-  // At application exit, we free the symtab.
+Using this design, we only need to attach a single `__getattr__` method to each
-  upb_symtab_free(symtab);
+message class, instead of defining a getter/setter for each field. In this way
 we can avoid duplicating hash tables between upb and the language interpreter,
 reducing memory usage.
 Reflection-based access requires loading full reflection at runtime. Your
 generated code will need to embed serialized descriptors (ie. a serialized
 message of `descriptor.proto`), which has some amount of size overhead and
 exposes all message/field names to the binary. It also forces a hash table
 lookup in the critical path of field access. If method calls in your language
 already have this overhead, then this is no added burden, but for statically
 dispatched languages it would cause extra overhead.
 If we take this path to its logical conclusion, all class creation can be
 performed fully dynamically, using only a binary descriptor as input. The
 "generated code" becomes little more than an embedded descriptor plus a
 library call to load it. Python has recently gone down this path. Generated
 code now looks something like this:
 ```python
 # main_pb2.py
 from google3.net.proto2.python.internal import builder as _builder
 from google3.net.proto2.python.public import descriptor_pool as _descriptor_pool
 DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile("<...>")
 _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
 _builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.main_pb2', globals())
 ```
-The `upb_symtab` will give you full access to all data from the `.proto` file,
+This is all the runtime needs to create all of the classes for messages defined
-including convenient APIs like looking up a field by name. It will allow you to
+in that serialized descriptor.  This code has no pretense of readability, but
-use JSON and text format.  The APIs for accessing a message through reflection
+a separate `.pyi` stub file provides a fully expanded and readable list of all
-are simple and well-supported.  These APIs cleanly encapsulate upb's internal
+methods a user can expect to be available:
 implementation details.  
-```c
+```python
-  upb_symtab* symtab = BuildSymtab();
+# main_pb2.pyi
 from google3.net.proto2.python.public import descriptor as _descriptor
 from google3.net.proto2.python.public import message as _message
 from typing import ClassVar as _ClassVar, Optional as _Optional
-  // Look up a message type in the symtab.
+DESCRIPTOR: _descriptor.FileDescriptor
  const upb_msgdef* m = upb_symtab_lookupmsg(symtab, "FooMessage");
-  // Construct a new message of this type, via reflection.
+class MyMessage(_message.Message):
-  upb_arena *arena = upb_arena_new();
+    __slots__ = ["my_field"]
-  upb_msg *msg = upb_msg_new(m, arena);
+    MY_FIELD_FIELD_NUMBER: _ClassVar[int]
    my_field: str
    def __init__(self, my_field: _Optional[str] = ...) -> None: ...
 ```
-  // Set a message field using reflection.
+To use reflection-based access:
  const upb_fielddef* f = upb_msgdef_ntof("bar_field");
  upb_msgval val = {.int32_val = 123};
  upb_msg_set(m, f, val, arena);
-  // Free the message and symtab.
+1. Load and access descriptor data using the interfaces in google3/third_party/upb/upb/def.h.
-  upb_arena_free(arena);
+2. Access message data using the interfaces in google3/third_party/upb/upb/reflection.h.
  upb_symtab_free(symtab);
 ```
-Using reflection is a natural choice in heavily reflective, dynamic runtimes
+### MiniTables
 like Python, Ruby, PHP, or Lua.  These languages generally perform method
 dispatch through a dictionary/hash table anyway, so we are not adding any extra
 overhead by using upb's hash table to lookup fields by name at field access
 time.
 ### Direct Access
 Using reflection has some downsides.  Reflection data is relatively large, both
 in your binary (at rest) and in RAM (at runtime).  It contains names of
 everything, and these names will be exposed in your binary.  Reflection APIs for
 accessing a message will have more overhead than you might want, especially if
 crossing the FFI boundary for your language runtime imposes significant
 overhead.
 We can reduce these overheads by using *direct access*.  upb's parser and
 serializer do not actually require full reflection data, they use a more compact
 data structure known as **mini tables**.  Mini tables will take up less space
 than reflection, both in the binary and in RAM, and they will not leak field
 names.  Mini tables will let us parse and serialize binary wire format data
 without reflection.
 ```c
  // TODO: demonstrate upb API for loading mini table data at runtime.
  // This API does not exist yet.
 ```
-To access messages themselves without the reflection API, we will be using
+MiniTables are a "lite" schema representation that are much smaller that
-different, lower-level APIs that will require you to supply precise data such as
+reflection. MiniTables omit names, options, and almost everything else from the
-the offset of a given field.  This is information that will come from the upb
+`.proto` file, retaining only enough information to parse and serialize binary
-compiler framework, and the correctness (and even memory safety!) of the program
+format.
 will rely on you passing these values through from the upb compiler libraries to
 the upb runtime correctly.
-```c
+MiniTables can be loaded into upb through *MiniDescriptors*. MiniDescriptors are
-  // TODO: demonstrate using low-level APIs for direct field access.
+a byte-oriented format that can be embedded into your generated code and passed
-  // These APIs do not exist yet.
+to upb to construct MiniTables. MiniDescriptors only use printable characters,
-```
+and therefore do not require escaping when embedding them into generated code
 strings. Overall the size savings of MiniDescriptors are ~60x compared with
 regular descriptors.
-It can even be possible in certain circumstances to bypass the upb API completely
+MiniTables and MiniDescriptors are a natural choice for compiled languages that
-and access raw field data directly at a given offset, using unsafe APIs like
+resolve method calls at compile time. For languages that are sometimes compiled
-`sun.misc.unsafe`.  This can theoretically allow for field access that is no
+and sometimes interpreted, there might not be an obvious choice. When a method
-more expensive than referencing a struct/class field.
+call is statically bound, we want to remove as much overhead as possible,
 especially from accessors. In the extreme case, we can use unsafe APIs to read
 raw memory at a known offset:
 ```java
-import sun.misc.Unsafe;
+// Example of a maximally-optimized generated accessor.
 class FooMessage {
    public long getBarField() {
        // Using Unsafe should give us performance that is comparable to a
        // native member access.
        //
        // The constant "24" is obtained from upb at compile time.
        sun.misc.Unsafe.getLong(this.ptr, 24);
    }
 }
 ```
-class FooProto {
+This design is very low-level, and tightly couples the generated code to one
-  private final long addr;
+specific version of the schema and compiler.  A slower but safer version would
-  private final Arena arena;
+look up a field by field number:
-  // Accessor that a Java library built on upb could conceivably generate.
+```java
-  long getFoo() {
+// Example of a more loosely-coupled accessor.
-    // The offset 1234 came from the upb compiler library, and was injected by the
+class FooMessage {
-    // Java+upb code generator.
+    public long getBarField() {
-    return Unsafe.getLong(self.addr + 1234);
+        // The constant "2" is the field number.  Internally this will look
-  }
+        // up the number "2" in the MiniTable and use that to read the value
        // from the message.
        upb.glue.getLong(this.ptr, 2);
    }
 }
 ```
-It is always possible to load reflection data as desired, even if your library
+One downside of MiniTables is that they cannot support parsing or serializing
-is designed primarily around direct access.  Users who want to use JSON, text
+to JSON or TextFormat, because they do now know the field names.  It should be
-format, or reflection could potentially load reflection data from separate
+possible to generate reflection data "on the side", into separate generated
-generated modules, for cases where they do not mind the size overhead or the
+code files, so that reflection is only pulled in if it is being used.  However
-leaking of field names. You do not give up any of these possibilities by using
+APIs to do this do not exist yet.
-direct access.
+
-
+To use MiniTable-based access:
-However, using direct access does have some noticeable downsides.  It requires
+
-tighter coupling with upb's implementation details, as the mini table format is
+1. Load and access MiniDescriptors data using the interfaces in google3/third_party/upb/upb/mini_table.h.
-upb-specific and requires building your code generator against upb's compiler
+2. Access message data using the interfaces in google3/third_party/upb/upb/mini_table_accessors.h.
 libraries.  Any direct access of memory is especially tightly coupled, and would
 need to be changed if upb's in-memory format ever changes.  It also is more
 prone to hard-to-debug memory errors if you make any mistakes.
 ## Memory Management