diff --git a/docs/wrapping-upb.md b/docs/wrapping-upb.md index 10a2c6d6d1..68392d5d8f 100644 --- a/docs/wrapping-upb.md +++ b/docs/wrapping-upb.md @@ -10,160 +10,232 @@ in Visual Studio Code: https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced ---> -# Wrapping upb in other languages +# Building a protobuf library on upb -upb is a C kernel that is designed to be wrapped in other languages. This is a -guide for creating a new protobuf implementation based on upb. +This is a guide for creating a new protobuf implementation based on upb. It +starts from the beginning and walks you through the process, highlighting +some important design choices you will need to make. -## What you will need +## Overview -There are certain things that the language runtime must provide in order to be -wrapped by upb. +A protobuf implementation consists of two main pieces: -1. **Finalizers, Destructors, or Cleaners**: This is one unavoidable - requirement: the language *must* provide finalizers or destructors of some sort. - There must be a way of calling a C function when the language GCs or otherwise - destroys an object. We don't care much whether it is a finalizer, a destructor, - or a cleaner, as long as it gets called eventually when the object is destroyed. - Without finalizers, we would have no way of cleaning up upb data and everything - would leak. -2. **HashMap with weak values**: This is not an absolute requirement, but in - languages with automatic memory management, we generally end up wanting a - hash map with weak values to act as a `upb_msg* -> wrapper` object cache. - We want the values to be weak (not the keys). +1. a code generator, run at compile time, to turn `.proto` files into source + files in your language (we will call this "zlang", assuming an extension of ".z"). +2. a runtime component, which implements the wire format and provides the data + structures for representing protobuf data and metadata. -## Reflection vs. Direct Access +
-Each language wrapping upb gets to decide whether it will access messages -through *reflection* or through *direct access*. This decision has some deep -implications that will affect the design, features, and performance of your -library. +```dot {align="center"} +digraph { + rankdir=LR; + newrank=true; + node [style="rounded,filled" shape=box] + "foo.proto" -> protoc; + "foo.proto" [shape=folder]; + protoc [fillcolor=lightgrey]; + protoc -> "protoc-gen-zlang"; + "protoc-gen-zlang" -> "foo.z"; + "protoc-gen-zlang" [fillcolor=palegreen3]; + "foo.z" [shape=folder]; + labelloc="b"; + label="Compile Time"; +} +``` + +
+ +```dot {align="center"} +digraph { + newrank=true; + node [style="rounded,filled" shape=box fillcolor=lightgrey] + "foo.z" -> "zlang/upb glue (FFI)"; + "zlang/upb glue (FFI)" -> "upb (C)"; + "zlang/upb glue (FFI)" [fillcolor=palegreen3]; + labelloc="b"; + label="Runtime"; +} +``` + +The parts in green are what you will need to implement. + +Note that your code generator (`protoc-gen-zlang`) does *not* need to generate +any C code (eg. `foo.c`). While upb itself is written in C, upb's parsers and +serializers are fully table-driven, which means there is never any need or even +benefit to generating C code for each proto. upb is capable of full-speed +parsing even when schema data is loaded at runtime from strings embedded into +`foo.z`. This is a key benefit of upb compared with C++ protos, which have +traditionally relied on generated parsers in `foo.pb.cc` files to achieve full +parsing speed, and suffered a ~10x speed penalty in the parser when the schema +data was loaded at runtime. + +## Prerequisites + +There are a few things that the language runtime must provide in order to wrap +upb. + +1. **FFI**: To wrap upb, your language must be able to call into a C API + through a Foreign Function Interface (FFI). Most languages support FFI in + some form, either through "native extensions" (in which you write some C + code to implement new methods in your language) or through a direct FFI (in + which you can call into regular C functions directly from your language + using a special library). +2. **Finalizers, Destructors, or Cleaners**: The runtime must provide + finalizers or destructors of some sort. There must be a way of triggering a + call to a C function when the language garbage collects or otherwise + destroys an object. We don't care much whether it is a finalizer, a + destructor, or a cleaner, as long as it gets called eventually when the + object is destroyed. upb allocates memory in C space, and a finalizer is our + only way of making sure that memory is freed and does not leak. +3. **HashMap with weak values**: (optional) This is not a strong requirement, + but it is sometimes helpful to have a global hashmap with weak values to act + as a `upb_msg* -> wrapper` object cache. We want the values to be weak (not + the keys). There is some question about whether we want to continue to use + this pattern going forward. + +## Reflection vs. MiniTables + +The first key design decision you will need to make is whether your generated +code will access message data via reflection or minitables. Generally more +dynamic languages will want to use reflection and more static languages will +want to use minitables. ### Reflection -The simplest option is to load full reflection data into the upb library at -runtime. You can load reflection data using serialized descriptors, which are a -stable and widely supported format across all protobuf tooling. - -```c - // A upb_symtab is a dynamic container that we can load reflection data into. - upb_symtab* symtab = upb_symtab_new(); - - // We load reflection data via a serialized descriptor. The code generator - // for your language should embed serialized descriptors into your generated - // files. For each generated file loaded by your library, you can add the - // serialized descriptor to the symtab as shown. - upb_arena *tmp = upb_arena_new(); - google_protobuf_FileDescriptorProto* file = - google_protobuf_FileDescriptorProto_parse(desc_data, desc_size, tmp); - if (!file || !upb_symtab_addfile(symtab, file, NULL)) { - // Handle error. - } - upb_arena_free(tmp); +Reflection-based data access makes the most sense in highly dynamic language +interpreters, where method dispatch is generally resolved via strings and hash +table lookups. + +In such languages, you can often implement a special method like `__getattr__` +(Python) or `method_missing` (Ruby) that receives the method name as a string. +Using upb's reflection, you can look up a field name using the method name, +thereby using a hash table belonging to upb instead of one provided by the +language. + +```python +class FooMessage: + # Written in Python for illustration, but in practice we will want to + # implement this in C for speed. + def __getattr__(self, name): + field = FooMessage.descriptor.fields_by_name[name] + return field.get_value(self) +``` - // At application exit, we free the symtab. - upb_symtab_free(symtab); +Using this design, we only need to attach a single `__getattr__` method to each +message class, instead of defining a getter/setter for each field. In this way +we can avoid duplicating hash tables between upb and the language interpreter, +reducing memory usage. + +Reflection-based access requires loading full reflection at runtime. Your +generated code will need to embed serialized descriptors (ie. a serialized +message of `descriptor.proto`), which has some amount of size overhead and +exposes all message/field names to the binary. It also forces a hash table +lookup in the critical path of field access. If method calls in your language +already have this overhead, then this is no added burden, but for statically +dispatched languages it would cause extra overhead. + +If we take this path to its logical conclusion, all class creation can be +performed fully dynamically, using only a binary descriptor as input. The +"generated code" becomes little more than an embedded descriptor plus a +library call to load it. Python has recently gone down this path. Generated +code now looks something like this: + +```python +# main_pb2.py +from google3.net.proto2.python.internal import builder as _builder +from google3.net.proto2.python.public import descriptor_pool as _descriptor_pool + +DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile("<...>") +_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals()) +_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.main_pb2', globals()) ``` -The `upb_symtab` will give you full access to all data from the `.proto` file, -including convenient APIs like looking up a field by name. It will allow you to -use JSON and text format. The APIs for accessing a message through reflection -are simple and well-supported. These APIs cleanly encapsulate upb's internal -implementation details. +This is all the runtime needs to create all of the classes for messages defined +in that serialized descriptor. This code has no pretense of readability, but +a separate `.pyi` stub file provides a fully expanded and readable list of all +methods a user can expect to be available: -```c - upb_symtab* symtab = BuildSymtab(); +```python +# main_pb2.pyi +from google3.net.proto2.python.public import descriptor as _descriptor +from google3.net.proto2.python.public import message as _message +from typing import ClassVar as _ClassVar, Optional as _Optional - // Look up a message type in the symtab. - const upb_msgdef* m = upb_symtab_lookupmsg(symtab, "FooMessage"); +DESCRIPTOR: _descriptor.FileDescriptor - // Construct a new message of this type, via reflection. - upb_arena *arena = upb_arena_new(); - upb_msg *msg = upb_msg_new(m, arena); +class MyMessage(_message.Message): + __slots__ = ["my_field"] + MY_FIELD_FIELD_NUMBER: _ClassVar[int] + my_field: str + def __init__(self, my_field: _Optional[str] = ...) -> None: ... +``` - // Set a message field using reflection. - const upb_fielddef* f = upb_msgdef_ntof("bar_field"); - upb_msgval val = {.int32_val = 123}; - upb_msg_set(m, f, val, arena); +To use reflection-based access: - // Free the message and symtab. - upb_arena_free(arena); - upb_symtab_free(symtab); -``` +1. Load and access descriptor data using the interfaces in google3/third_party/upb/upb/def.h. +2. Access message data using the interfaces in google3/third_party/upb/upb/reflection.h. -Using reflection is a natural choice in heavily reflective, dynamic runtimes -like Python, Ruby, PHP, or Lua. These languages generally perform method -dispatch through a dictionary/hash table anyway, so we are not adding any extra -overhead by using upb's hash table to lookup fields by name at field access -time. - -### Direct Access - -Using reflection has some downsides. Reflection data is relatively large, both -in your binary (at rest) and in RAM (at runtime). It contains names of -everything, and these names will be exposed in your binary. Reflection APIs for -accessing a message will have more overhead than you might want, especially if -crossing the FFI boundary for your language runtime imposes significant -overhead. - -We can reduce these overheads by using *direct access*. upb's parser and -serializer do not actually require full reflection data, they use a more compact -data structure known as **mini tables**. Mini tables will take up less space -than reflection, both in the binary and in RAM, and they will not leak field -names. Mini tables will let us parse and serialize binary wire format data -without reflection. - -```c - // TODO: demonstrate upb API for loading mini table data at runtime. - // This API does not exist yet. -``` +### MiniTables -To access messages themselves without the reflection API, we will be using -different, lower-level APIs that will require you to supply precise data such as -the offset of a given field. This is information that will come from the upb -compiler framework, and the correctness (and even memory safety!) of the program -will rely on you passing these values through from the upb compiler libraries to -the upb runtime correctly. +MiniTables are a "lite" schema representation that are much smaller that +reflection. MiniTables omit names, options, and almost everything else from the +`.proto` file, retaining only enough information to parse and serialize binary +format. -```c - // TODO: demonstrate using low-level APIs for direct field access. - // These APIs do not exist yet. -``` +MiniTables can be loaded into upb through *MiniDescriptors*. MiniDescriptors are +a byte-oriented format that can be embedded into your generated code and passed +to upb to construct MiniTables. MiniDescriptors only use printable characters, +and therefore do not require escaping when embedding them into generated code +strings. Overall the size savings of MiniDescriptors are ~60x compared with +regular descriptors. -It can even be possible in certain circumstances to bypass the upb API completely -and access raw field data directly at a given offset, using unsafe APIs like -`sun.misc.unsafe`. This can theoretically allow for field access that is no -more expensive than referencing a struct/class field. +MiniTables and MiniDescriptors are a natural choice for compiled languages that +resolve method calls at compile time. For languages that are sometimes compiled +and sometimes interpreted, there might not be an obvious choice. When a method +call is statically bound, we want to remove as much overhead as possible, +especially from accessors. In the extreme case, we can use unsafe APIs to read +raw memory at a known offset: ```java -import sun.misc.Unsafe; +// Example of a maximally-optimized generated accessor. +class FooMessage { + public long getBarField() { + // Using Unsafe should give us performance that is comparable to a + // native member access. + // + // The constant "24" is obtained from upb at compile time. + sun.misc.Unsafe.getLong(this.ptr, 24); + } +} +``` -class FooProto { - private final long addr; - private final Arena arena; +This design is very low-level, and tightly couples the generated code to one +specific version of the schema and compiler. A slower but safer version would +look up a field by field number: - // Accessor that a Java library built on upb could conceivably generate. - long getFoo() { - // The offset 1234 came from the upb compiler library, and was injected by the - // Java+upb code generator. - return Unsafe.getLong(self.addr + 1234); - } +```java +// Example of a more loosely-coupled accessor. +class FooMessage { + public long getBarField() { + // The constant "2" is the field number. Internally this will look + // up the number "2" in the MiniTable and use that to read the value + // from the message. + upb.glue.getLong(this.ptr, 2); + } } ``` -It is always possible to load reflection data as desired, even if your library -is designed primarily around direct access. Users who want to use JSON, text -format, or reflection could potentially load reflection data from separate -generated modules, for cases where they do not mind the size overhead or the -leaking of field names. You do not give up any of these possibilities by using -direct access. - -However, using direct access does have some noticeable downsides. It requires -tighter coupling with upb's implementation details, as the mini table format is -upb-specific and requires building your code generator against upb's compiler -libraries. Any direct access of memory is especially tightly coupled, and would -need to be changed if upb's in-memory format ever changes. It also is more -prone to hard-to-debug memory errors if you make any mistakes. +One downside of MiniTables is that they cannot support parsing or serializing +to JSON or TextFormat, because they do now know the field names. It should be +possible to generate reflection data "on the side", into separate generated +code files, so that reflection is only pulled in if it is being used. However +APIs to do this do not exist yet. + +To use MiniTable-based access: + +1. Load and access MiniDescriptors data using the interfaces in google3/third_party/upb/upb/mini_table.h. +2. Access message data using the interfaces in google3/third_party/upb/upb/mini_table_accessors.h. ## Memory Management