This change has several parts:
1. Resurrected tools/upbc. The code was all there but the build was
broken for open-source. Now you can type "make tools/upbc" and
it will build all necessary Lua modules and create a robust shell
script for running upbc.
2. Changed Lua module loading to no longer rely on OS-level .so
dependencies. The net effect of this is that you now only need
to set LUA_PATH and LUA_CPATH; setting LD_LIBRARY_PATH or rpaths
is no longer required. Downside: this drops compatibility with
Lua 5.1, since it depends on a feature that only exists in Lua >=5.2
(and LuaJIT).
3. Since upbc works again, I fixed the re-generation of the descriptor
files (descriptor.upb.h, descriptor.upb.c). "make genfiles" will
re-generate these as well as the JIT code generator.
4. Added a Travis test target that ensures that the checked-in generated
files are not out of date. I would do this for the Ragel generated
file also, but we can't count on all versions of Ragel to necessarily
generate identical output.
5. Changed Makefile to no longer automatically run Ragel to regenerate
the JSON parser. This is unfortuante, because it's convenient when
you're developing the JSON parser. However, "git clone" sometimes
skews the timestamps a little bit so that "make" thinks it needs to
regenerate these files for a fresh "git clone." This would normally
be harmless, but if the user doesn't have Ragel installed, it makes
the build fail completely. So now you have to explicitly regenerate
the Ragel output. If you want to you can uncomment the auto-generation
during development.
There are a number of tweaks to get this to work:
- The #include dependence graph wasn't quite complete, and I had to add
a few #includes to get the tool to work.
- I had to change a number of symbol names to avoid conflicts between
'static' definitions in different .c files. This could be avoided if
the tool were smart enough to rename static symbols to have unique
prefixes instead, but (i) this requires semantic understanding of C,
and (ii) the macro-defined static functions (e.g., handlers for
primitive types in several places) would probably trip this up.
Verified that the resulting upb.h/upb.c compiles and doesn't have any
unresolved references.
- rewritten decoder; interpreted decoder is bytecode-based,
JIT decoder no longer falls back to the interpreter.
- C++ improvements: C++11-compatible iterators, upb::reffed_ptr
for RAII refcounting, better upcast/downcast support.
- removed the gross upb_value abstraction from public upb.h.
Major changes:
- Got rid of all bytestream interfaces in favor of
using regular handlers.
- new Pipeline object represents a upb pipeline, does
bump allocation internally to manage memory.
- proto2 support now can handle extensions.
Many things have changed and been simplified.
The memory-management story for upb_def and upb_handlers
is much more robust; upb_def and upb_handlers should be
fairly stable interfaces now. There is still much work
to do for the runtime component (upb_sink).
Many improvements, too many to mention. One significant
perf regression warrants investigation:
omitfp.parsetoproto2_googlemessage1.upb_jit: 343 -> 252 (-26.53)
plain.parsetoproto2_googlemessage1.upb_jit: 334 -> 251 (-24.85)
25% regression for this benchmark is bad, but since I don't think
there's any fundamental design issue that caused it I'm going to
go ahead with the commit anyway. Can investigate and fix later.
Other benchmarks were neutral or showed slight improvement.
Includes are now via upb/foo.h.
Files specific to the protobuf format are
now in upb/pb (the core library is concerned
with message definitions, handlers, and
byte streams, but knows nothing about any
particular serializationf format).
This should make it both easier to use and easier to
optimize, in exchange for a small amount of generality.
In practice, any remotely normal case is still very
natural.
The cost is that a upb_msg will now always have an overhead
of 2*sizeof(void*). This is comparable to proto2 overhead.
The benefit is that upb_msg is now self-describing, and
read-only algorithms can now operate on a upb_msg regardless
of the memory-management scheme.
Also, upb_array and upb_string now know inherently if they
own their associated memory, and upb_array has a generic
pointer for memory management purposes like upb_msg does.
There is significant refactoring here, as well as some more trivial
name changes. upb_msg has become upb_msgdef, to reflect the fact
that a upb_msg is not *itself* a message, it describes a message.
There are other renamings, such as upb_parse_state -> upb_stream_parser.
More significantly, the upb_msg class and parser have been refactored
to reflect my recent realization about how memory management should
work. upb_msg now has no memory management, and a memory mangement
scheme (that works beautifully with multiple language runtimes) will
be layered on top of it.
This iteration has the new, read-only upb_msg. upb_mm_msg (a
memory-managed message class) will come in the next change.