This involves:
- remove upb_msglayout -> upb_msgfactory dependency.
- remove upb_msglayout -> upb_msgdef dependency (in progress).
- make upb_msglayout use a representation that can be
statically initialized by generated code.
The goal here is that upb_msglayout becomes a kind of "descriptor
lite": it contains enough data to parser and serialize protobufs
and manipulate a upb_msg in memory, while being far smaller and
simpler than a full descriptor. It also does not include field
names, which can be a benefit for applications that do not want
to leak field names.
Generated code can then create a upb_msglayout, and do most things
without ever needing to construct full descriptors/defs if they
don't want to.
* Put oneofs in the same table as fields.
Oneofs and fields are not allowed to have names that conflict,
so we might as well put them all in the same table. This also
allows an efficient operation that looks for both fields and
oneofs in a single lookup.
Added support for OneofDef to Lua to allow testing of this.
* Addressed PR comments.
* Changed schema for JSON test to be defined in a .proto file.
Before we had lots of code to build these schemas manually,
but this was verbose and made it difficult to add to the
schema easily. Now we can just write a .proto file and
adding fields is easy.
To avoid making the tests depend on upbc (and thus Lua)
we check in the generated schema.
* Made protobuf-compiler a dependency of "make genfiles."
* For genfiles download recent protoc that can handle proto3.
* Only use new protoc for genfiles.
* Split upb::Arena/upb::Allocator from upb::Environment.
This will allow arenas and allocators to be used
independently of environments, which will be important
for an upcoming change (a message representation).
Overall this design feels cleaner that the previous
Environment/SeededAllocator design.
As part of this change, moved all allocations in upb
to use a global allocator instead of hard-coding
malloc/free. This will allow injecting OOM faults
for more robust testing.
One place that doesn't use the global allocator is
the tracked ref code. Instead of its previous approach
of CHECK_OOM() after every malloc() or table insert, it
simply uses an allocator that does this automatically.
I moved Allocator/Arena/Environment into upb.h.
This seems principled since these are the only types
in upb whose size is directly exposed to users, since
they form the basis of memory allocation strategy.
* Cleaned up some header includes and fixed more malloc -> upb_gmalloc().
* Changes from PR review.
* Don't use UINTPTR_MAX or UINT64_MAX.
* Punt on adding line/file for now.
* We actually can't store (uint64_t)-1, update comment and test.
It is entirely optional: MessageDef/EnumDef can still exist
on their own. But this can represent a def's file when it is
desirable to do so (eg. for code generators).
This approach will require that we change the way we handle
extensions. But I think it will be a good change overall.
Specifically, we previously handled extensions by duplicating
the extended message and then adding the extension as a regular
field to the duplicated message. This required also duplicating
any messages that could reach the extended message.
In the new world we will need a way of declaring and looking up
extensions separately from the message being extended.
This change also involves some notable changes to the generated
code:
- files are now called foo.upbdefs.h instead of foo.upb.h.
This reflects the fact that we might possibly generate several
different output files for a .proto file, and this one is just
for defs.
- we no longer generate selectors in the .h file.
- the upbdefs.c no longer vends a SymbolTable. Now it vends the
individual messages (and possibly a FileDef later). I think this
will compose better once we can generate files where one
generated files imports another.
We also make the descriptor reader vend a list of FileDefs now.
This is the best conceptual match for parsing a FileDescriptorSet.
This was previously broken -- it would try to set
the status object on the parser, but the pointer
was never initialized. Also it didn't report
errors properly to the environment object.
A large part of this change contains surface-level
porting, like moving variable declarations to the
top of the block.
However there are a few more substantial things too:
- moved internal-only struct definitions to a separate
file (structdefs.int.h), for greater encapsulation
and ABI compatibility.
- removed the UPB_UPCAST macro, since it requires access
to the internal-only struct definitions. Replaced uses
with calls to inline, type-safe casting functions.
- removed the UPB_DEFINE_CLASS/UPB_DEFINE_STRUCT macros.
Class and struct definitions are now more explicit -- you
get to see the actual class/struct keywords in the source.
The casting convenience functions have been moved into
UPB_DECLARE_DERIVED_TYPE() and UPB_DECLARE_DERIVED_TYPE2().
- the new way that we duplicate base methods in derived types
is also more convenient and requires less duplication.
It is also less greppable, but hopefully that is not
too big a problem.
Compiler flags (-std=c89 -pedantic) should help to rigorously
enforce that the code is free of C99-isms.
A few functions are not available in C89 (strtoll). There
are temporary, hacky solutions in place.
Also added a separate ndebug build for testing that
-DNDEBUG builds still work.
Also disabled reference debugging by default, since it
requires either a global lock or -DUPB_THREAD_UNSAFE.
This change has several parts:
1. Resurrected tools/upbc. The code was all there but the build was
broken for open-source. Now you can type "make tools/upbc" and
it will build all necessary Lua modules and create a robust shell
script for running upbc.
2. Changed Lua module loading to no longer rely on OS-level .so
dependencies. The net effect of this is that you now only need
to set LUA_PATH and LUA_CPATH; setting LD_LIBRARY_PATH or rpaths
is no longer required. Downside: this drops compatibility with
Lua 5.1, since it depends on a feature that only exists in Lua >=5.2
(and LuaJIT).
3. Since upbc works again, I fixed the re-generation of the descriptor
files (descriptor.upb.h, descriptor.upb.c). "make genfiles" will
re-generate these as well as the JIT code generator.
4. Added a Travis test target that ensures that the checked-in generated
files are not out of date. I would do this for the Ragel generated
file also, but we can't count on all versions of Ragel to necessarily
generate identical output.
5. Changed Makefile to no longer automatically run Ragel to regenerate
the JSON parser. This is unfortuante, because it's convenient when
you're developing the JSON parser. However, "git clone" sometimes
skews the timestamps a little bit so that "make" thinks it needs to
regenerate these files for a fresh "git clone." This would normally
be harmless, but if the user doesn't have Ragel installed, it makes
the build fail completely. So now you have to explicitly regenerate
the Ragel output. If you want to you can uncomment the auto-generation
during development.
There are a number of tweaks to get this to work:
- The #include dependence graph wasn't quite complete, and I had to add
a few #includes to get the tool to work.
- I had to change a number of symbol names to avoid conflicts between
'static' definitions in different .c files. This could be avoided if
the tool were smart enough to rename static symbols to have unique
prefixes instead, but (i) this requires semantic understanding of C,
and (ii) the macro-defined static functions (e.g., handlers for
primitive types in several places) would probably trip this up.
Verified that the resulting upb.h/upb.c compiles and doesn't have any
unresolved references.
- Added a JSON test that round-trips (parses then re-serializes) several
test messages, ensuring that the re-serialized form matches the
original exactly.
- Added support for printing and parsing symbolic enum names (rather than
integer values) in JSON.
- Updated JSON printer to properly handle string fields that come in
multiple pieces. ('bytes' fields still do not support this, and this
work is more challenging because it requires making the base64 encoder
resumable. Base64 encoding is not separable at an input-byte
granularity, unlike string escaping.)
- Fixed a < vs. <= bug in UTF-8 encoding generation (oops).
Notable changes:
- We now only build things by default that require
no dependencies. So you can build upb even if you
don't have Lua or Google protobuf installed.
- Checked in a pre-built version of the JIT, so you
don't need Lua installed at build time to run DynASM.
It will still notice if you change the .dasc file and
attempt to re-run DynASM in that case.
- The build system now builds all modules of upb into
separate libraries, reflecting the modularity that
is already inherent in upb's design. This should
make it easier to trim the fat.
- removed the GDB JIT interface. I wasn't using it
much; using a .so is easier and more robust.
- Better error reporting for upb::Def setters.
- error reporting for upb::Handlers setters.
- made the start/endmsg handlers a little less special-cased.
Many things have changed and been simplified.
The memory-management story for upb_def and upb_handlers
is much more robust; upb_def and upb_handlers should be
fairly stable interfaces now. There is still much work
to do for the runtime component (upb_sink).
Many improvements, too many to mention. One significant
perf regression warrants investigation:
omitfp.parsetoproto2_googlemessage1.upb_jit: 343 -> 252 (-26.53)
plain.parsetoproto2_googlemessage1.upb_jit: 334 -> 251 (-24.85)
25% regression for this benchmark is bad, but since I don't think
there's any fundamental design issue that caused it I'm going to
go ahead with the commit anyway. Can investigate and fix later.
Other benchmarks were neutral or showed slight improvement.