Updated "wrapping" doc.

PiperOrigin-RevId: 447505841
3 years ago · 2a5919deb3
parent 49f356a801
commit 2a5919deb3
1 changed files with 198 additions and 126 deletions
--- a/docs/wrapping-upb.md
+++ b/docs/wrapping-upb.md
@ -10,160 +10,232 @@ in Visual Studio Code:
  https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced
 --->

-# Wrapping upb in other languages
+# Building a protobuf library on upb

-upb is a C kernel that is designed to be wrapped in other languages.  This is a
-guide for creating a new protobuf implementation based on upb.
+This is a guide for creating a new protobuf implementation based on upb.  It
+starts from the beginning and walks you through the process, highlighting
+some important design choices you will need to make.

-## What you will need
+## Overview

-There are certain things that the language runtime must provide in order to be
-wrapped by upb.
+A protobuf implementation consists of two main pieces:

-1. **Finalizers, Destructors, or Cleaners**: This is one unavoidable
-   requirement: the language *must* provide finalizers or destructors of some sort.
-   There must be a way of calling a C function when the language GCs or otherwise
-   destroys an object.  We don't care much whether it is a finalizer, a destructor,
-   or a cleaner, as long as it gets called eventually when the object is destroyed.
-   Without finalizers, we would have no way of cleaning up upb data and everything
-   would leak.
-2. **HashMap with weak values**: This is not an absolute requirement, but in
-   languages with automatic memory management, we generally end up wanting a
-   hash map with weak values to act as a `upb_msg* -> wrapper` object cache.
-   We want the values to be weak (not the keys).
+1. a code generator, run at compile time, to turn `.proto` files into source
+   files in your language (we will call this "zlang", assuming an extension of ".z").
+2. a runtime component, which implements the wire format and provides the data
+   structures for representing protobuf data and metadata.

-## Reflection vs. Direct Access
+<br/>

-Each language wrapping upb gets to decide whether it will access messages
-through *reflection* or through *direct access*.  This decision has some deep
-implications that will affect the design, features, and performance of your
-library.
+```dot {align="center"}
+digraph {
+    rankdir=LR; 
+    newrank=true;
+    node [style="rounded,filled" shape=box]
+    "foo.proto" -> protoc;
+    "foo.proto" [shape=folder];
+    protoc [fillcolor=lightgrey];
+    protoc -> "protoc-gen-zlang";
+    "protoc-gen-zlang" -> "foo.z";
+    "protoc-gen-zlang" [fillcolor=palegreen3];
+    "foo.z" [shape=folder];
+    labelloc="b";
+    label="Compile Time";
+}
+```
+
+<br/>
+
+```dot {align="center"}
+digraph {
+    newrank=true;
+    node [style="rounded,filled" shape=box fillcolor=lightgrey]
+    "foo.z" -> "zlang/upb glue (FFI)";
+    "zlang/upb glue (FFI)" -> "upb (C)";
+    "zlang/upb glue (FFI)" [fillcolor=palegreen3];
+    labelloc="b";
+    label="Runtime";
+}
+```
+
+The parts in green are what you will need to implement.
+
+Note that your code generator (`protoc-gen-zlang`) does *not* need to generate
+any C code (eg. `foo.c`). While upb itself is written in C, upb's parsers and
+serializers are fully table-driven, which means there is never any need or even
+benefit to generating C code for each proto. upb is capable of full-speed
+parsing even when schema data is loaded at runtime from strings embedded into
+`foo.z`. This is a key benefit of upb compared with C++ protos, which have
+traditionally relied on generated parsers in `foo.pb.cc` files to achieve full
+parsing speed, and suffered a ~10x speed penalty in the parser when the schema
+data was loaded at runtime.
+
+## Prerequisites
+
+There are a few things that the language runtime must provide in order to wrap
+upb.
+
+1.  **FFI**: To wrap upb, your language must be able to call into a C API
+    through a Foreign Function Interface (FFI). Most languages support FFI in
+    some form, either through "native extensions" (in which you write some C
+    code to implement new methods in your language) or through a direct FFI (in
+    which you can call into regular C functions directly from your language
+    using a special library).
+2.  **Finalizers, Destructors, or Cleaners**: The runtime must provide
+    finalizers or destructors of some sort. There must be a way of triggering a
+    call to a C function when the language garbage collects or otherwise
+    destroys an object. We don't care much whether it is a finalizer, a
+    destructor, or a cleaner, as long as it gets called eventually when the
+    object is destroyed. upb allocates memory in C space, and a finalizer is our
+    only way of making sure that memory is freed and does not leak.
+3.  **HashMap with weak values**: (optional) This is not a strong requirement,
+    but it is sometimes helpful to have a global hashmap with weak values to act
+    as a `upb_msg* -> wrapper` object cache. We want the values to be weak (not
+    the keys). There is some question about whether we want to continue to use
+    this pattern going forward.
+
+## Reflection vs. MiniTables
+
+The first key design decision you will need to make is whether your generated
+code will access message data via reflection or minitables. Generally more
+dynamic languages will want to use reflection and more static languages will
+want to use minitables.

 ### Reflection

-The simplest option is to load full reflection data into the upb library at
-runtime.  You can load reflection data using serialized descriptors, which are a
-stable and widely supported format across all protobuf tooling.
-
-```c
-  // A upb_symtab is a dynamic container that we can load reflection data into.
-  upb_symtab* symtab = upb_symtab_new();
-
-  // We load reflection data via a serialized descriptor.  The code generator
-  // for your language should embed serialized descriptors into your generated
-  // files. For each generated file loaded by your library, you can add the
-  // serialized descriptor to the symtab as shown.
-  upb_arena *tmp = upb_arena_new();
-  google_protobuf_FileDescriptorProto* file =
-      google_protobuf_FileDescriptorProto_parse(desc_data, desc_size, tmp);
-  if (!file || !upb_symtab_addfile(symtab, file, NULL)) {
-    // Handle error.
-  }
-  upb_arena_free(tmp);
+Reflection-based data access makes the most sense in highly dynamic language
+interpreters, where method dispatch is generally resolved via strings and hash
+table lookups.  
+
+In such languages, you can often implement a special method like `__getattr__`
+(Python) or `method_missing` (Ruby) that receives the method name as a string.
+Using upb's reflection, you can look up a field name using the method name,
+thereby using a hash table belonging to upb instead of one provided by the
+language.
+
+```python
+class FooMessage:
+  # Written in Python for illustration, but in practice we will want to
+  # implement this in C for speed.
+  def __getattr__(self, name):
+    field = FooMessage.descriptor.fields_by_name[name]
+    return field.get_value(self)
+```

-  // At application exit, we free the symtab.
-  upb_symtab_free(symtab);
+Using this design, we only need to attach a single `__getattr__` method to each
+message class, instead of defining a getter/setter for each field. In this way
+we can avoid duplicating hash tables between upb and the language interpreter,
+reducing memory usage.
+
+Reflection-based access requires loading full reflection at runtime. Your
+generated code will need to embed serialized descriptors (ie. a serialized
+message of `descriptor.proto`), which has some amount of size overhead and
+exposes all message/field names to the binary. It also forces a hash table
+lookup in the critical path of field access. If method calls in your language
+already have this overhead, then this is no added burden, but for statically
+dispatched languages it would cause extra overhead.
+
+If we take this path to its logical conclusion, all class creation can be
+performed fully dynamically, using only a binary descriptor as input. The
+"generated code" becomes little more than an embedded descriptor plus a
+library call to load it. Python has recently gone down this path. Generated
+code now looks something like this:
+
+```python
+# main_pb2.py
+from google3.net.proto2.python.internal import builder as _builder
+from google3.net.proto2.python.public import descriptor_pool as _descriptor_pool
+
+DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile("<...>")
+_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
+_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.main_pb2', globals())
 ```

-The `upb_symtab` will give you full access to all data from the `.proto` file,
-including convenient APIs like looking up a field by name. It will allow you to
-use JSON and text format.  The APIs for accessing a message through reflection
-are simple and well-supported.  These APIs cleanly encapsulate upb's internal
-implementation details.  
+This is all the runtime needs to create all of the classes for messages defined
+in that serialized descriptor.  This code has no pretense of readability, but
+a separate `.pyi` stub file provides a fully expanded and readable list of all
+methods a user can expect to be available:

-```c
-  upb_symtab* symtab = BuildSymtab();
+```python
+# main_pb2.pyi
+from google3.net.proto2.python.public import descriptor as _descriptor
+from google3.net.proto2.python.public import message as _message
+from typing import ClassVar as _ClassVar, Optional as _Optional

-  // Look up a message type in the symtab.
-  const upb_msgdef* m = upb_symtab_lookupmsg(symtab, "FooMessage");
+DESCRIPTOR: _descriptor.FileDescriptor

-  // Construct a new message of this type, via reflection.
-  upb_arena *arena = upb_arena_new();
-  upb_msg *msg = upb_msg_new(m, arena);
+class MyMessage(_message.Message):
+    __slots__ = ["my_field"]
+    MY_FIELD_FIELD_NUMBER: _ClassVar[int]
+    my_field: str
+    def __init__(self, my_field: _Optional[str] = ...) -> None: ...
+```

-  // Set a message field using reflection.
-  const upb_fielddef* f = upb_msgdef_ntof("bar_field");
-  upb_msgval val = {.int32_val = 123};
-  upb_msg_set(m, f, val, arena);
+To use reflection-based access:

-  // Free the message and symtab.
-  upb_arena_free(arena);
-  upb_symtab_free(symtab);
-```
+1. Load and access descriptor data using the interfaces in google3/third_party/upb/upb/def.h.
+2. Access message data using the interfaces in google3/third_party/upb/upb/reflection.h.

-Using reflection is a natural choice in heavily reflective, dynamic runtimes
-like Python, Ruby, PHP, or Lua.  These languages generally perform method
-dispatch through a dictionary/hash table anyway, so we are not adding any extra
-overhead by using upb's hash table to lookup fields by name at field access
-time.
-
-### Direct Access
-
-Using reflection has some downsides.  Reflection data is relatively large, both
-in your binary (at rest) and in RAM (at runtime).  It contains names of
-everything, and these names will be exposed in your binary.  Reflection APIs for
-accessing a message will have more overhead than you might want, especially if
-crossing the FFI boundary for your language runtime imposes significant
-overhead.
-
-We can reduce these overheads by using *direct access*.  upb's parser and
-serializer do not actually require full reflection data, they use a more compact
-data structure known as **mini tables**.  Mini tables will take up less space
-than reflection, both in the binary and in RAM, and they will not leak field
-names.  Mini tables will let us parse and serialize binary wire format data
-without reflection.
-
-```c
-  // TODO: demonstrate upb API for loading mini table data at runtime.
-  // This API does not exist yet.
-```
+### MiniTables

-To access messages themselves without the reflection API, we will be using
-different, lower-level APIs that will require you to supply precise data such as
-the offset of a given field.  This is information that will come from the upb
-compiler framework, and the correctness (and even memory safety!) of the program
-will rely on you passing these values through from the upb compiler libraries to
-the upb runtime correctly.
+MiniTables are a "lite" schema representation that are much smaller that
+reflection. MiniTables omit names, options, and almost everything else from the
+`.proto` file, retaining only enough information to parse and serialize binary
+format.

-```c
-  // TODO: demonstrate using low-level APIs for direct field access.
-  // These APIs do not exist yet.
-```
+MiniTables can be loaded into upb through *MiniDescriptors*. MiniDescriptors are
+a byte-oriented format that can be embedded into your generated code and passed
+to upb to construct MiniTables. MiniDescriptors only use printable characters,
+and therefore do not require escaping when embedding them into generated code
+strings. Overall the size savings of MiniDescriptors are ~60x compared with
+regular descriptors.

-It can even be possible in certain circumstances to bypass the upb API completely
-and access raw field data directly at a given offset, using unsafe APIs like
-`sun.misc.unsafe`.  This can theoretically allow for field access that is no
-more expensive than referencing a struct/class field.
+MiniTables and MiniDescriptors are a natural choice for compiled languages that
+resolve method calls at compile time. For languages that are sometimes compiled
+and sometimes interpreted, there might not be an obvious choice. When a method
+call is statically bound, we want to remove as much overhead as possible,
+especially from accessors. In the extreme case, we can use unsafe APIs to read
+raw memory at a known offset:

 ```java
-import sun.misc.Unsafe;
+// Example of a maximally-optimized generated accessor.
+class FooMessage {
+    public long getBarField() {
+        // Using Unsafe should give us performance that is comparable to a
+        // native member access.
+        //
+        // The constant "24" is obtained from upb at compile time.
+        sun.misc.Unsafe.getLong(this.ptr, 24);
+    }
+}
+```

-class FooProto {
-  private final long addr;
-  private final Arena arena;
+This design is very low-level, and tightly couples the generated code to one
+specific version of the schema and compiler.  A slower but safer version would
+look up a field by field number:

-  // Accessor that a Java library built on upb could conceivably generate.
-  long getFoo() {
-    // The offset 1234 came from the upb compiler library, and was injected by the
-    // Java+upb code generator.
-    return Unsafe.getLong(self.addr + 1234);
-  }
+```java
+// Example of a more loosely-coupled accessor.
+class FooMessage {
+    public long getBarField() {
+        // The constant "2" is the field number.  Internally this will look
+        // up the number "2" in the MiniTable and use that to read the value
+        // from the message.
+        upb.glue.getLong(this.ptr, 2);
+    }
 }
 ```

-It is always possible to load reflection data as desired, even if your library
-is designed primarily around direct access.  Users who want to use JSON, text
-format, or reflection could potentially load reflection data from separate
-generated modules, for cases where they do not mind the size overhead or the
-leaking of field names. You do not give up any of these possibilities by using
-direct access.
-
-However, using direct access does have some noticeable downsides.  It requires
-tighter coupling with upb's implementation details, as the mini table format is
-upb-specific and requires building your code generator against upb's compiler
-libraries.  Any direct access of memory is especially tightly coupled, and would
-need to be changed if upb's in-memory format ever changes.  It also is more
-prone to hard-to-debug memory errors if you make any mistakes.
+One downside of MiniTables is that they cannot support parsing or serializing
+to JSON or TextFormat, because they do now know the field names.  It should be
+possible to generate reflection data "on the side", into separate generated
+code files, so that reflection is only pulled in if it is being used.  However
+APIs to do this do not exist yet.
+
+To use MiniTable-based access:
+
+1. Load and access MiniDescriptors data using the interfaces in google3/third_party/upb/upb/mini_table.h.
+2. Access message data using the interfaces in google3/third_party/upb/upb/mini_table_accessors.h.

 ## Memory Management