Many improvements, too many to mention. One significant
perf regression warrants investigation:
omitfp.parsetoproto2_googlemessage1.upb_jit: 343 -> 252 (-26.53)
plain.parsetoproto2_googlemessage1.upb_jit: 334 -> 251 (-24.85)
25% regression for this benchmark is bad, but since I don't think
there's any fundamental design issue that caused it I'm going to
go ahead with the commit anyway. Can investigate and fix later.
Other benchmarks were neutral or showed slight improvement.
Added a upb_byteregion that tracks a region of
the input buffer; decoders use this instead of
using a upb_bytesrc directly. upb_byteregion
is also used as the way of passing a string to
a upb_handlers callback. This symmetry makes
decoders compose better; if you want to take
a parsed string and decode it as something else,
you can take the string directly from the callback
and feed it as input to another parser.
A commented-out version of a pinning interface
is present; I decline to actually implement it
(and accept its extra complexity) until/unless
it is clear that it is actually a win. But it
is included as a proof-of-concept, to show that
it fits well with the existing interface.
This leads to a major (20-40%) improvement in the parsetoproto2
benchmark with small messages. We now are faster than proto2 in all
apples-to-apples comparisons, at least given the (admittedly
limited) set of benchmarks in this source tree.
Includes are now via upb/foo.h.
Files specific to the protobuf format are
now in upb/pb (the core library is concerned
with message definitions, handlers, and
byte streams, but knows nothing about any
particular serializationf format).
I'm realizing that basically all upb objects
will need to be refcounted to be sharable
across languages, but *not* messages which
are on their way out so we can get out of
the business of data representations.
Things which must be refcounted:
- encoders, decoders
- handlers objects
- defs
It can successfully parse SpeedMessage1.
Preliminary results: 750MB/s on Core2 2.4GHz.
This number is 2.5x proto2.
This isn't apples-to-apples, because
proto2 is parsing to a struct and we are
just doing stream parsing, but for apps
that are currently using proto2, this is the
improvement they would see if they could
move to stream-based processing.
Unfortunately perf-regression-test.py is
broken, and I'm not 100% sure why. It would
be nice to fix it first (to ensure that
there are no performance regressions for
the table-based decoder) but I'm really
impatient to get the JIT checked in.
This doesn't reflect any material change in
how I will be working on upb, and I have no
problem making this change. It's still open
source under the BSD license, and I'll still
be working on it well beyond the hours that
constitute a normal job.
This is a significant change to the upb_stream
protocol, and should hopefully be the last
significant change.
All callbacks are now registered ahead-of-time
instead of having delegated callbacks registered
at runtime, which makes it much easier to
aggressively optimize ahead-of-time (like with a
JIT).
Other impacts of this change:
- You no longer need to have loaded descriptor.proto
as a upb_def to load other descriptors! This means
the special-case code we used for bootstrapping is
no longer necessary, and we no longer need to link
the descriptor for descriptor.proto into upb.
- A client can now register any upb_value as what
will be delivered to their value callback, not
just a upb_fielddef*. This should allow for other
clients to get more bang out of the streaming
decoder.
This change unfortunately causes a bit of a performance
regression -- I think largely due to highly
suboptimal code that GCC generates when structs
are returned by value. See:
http://blog.reverberate.org/2011/03/19/when-a-compilers-slow-code-actually-bites-you/
On the other hand, once we have a JIT this should
no longer matter.
Performance numbers:
plain.parsestream_googlemessage1.upb_table: 374 -> 396 (5.88)
plain.parsestream_googlemessage2.upb_table: 616 -> 449 (-27.11)
plain.parsetostruct_googlemessage1.upb_table_byref: 268 -> 269 (0.37)
plain.parsetostruct_googlemessage1.upb_table_byval: 215 -> 204 (-5.12)
plain.parsetostruct_googlemessage2.upb_table_byref: 307 -> 281 (-8.47)
plain.parsetostruct_googlemessage2.upb_table_byval: 297 -> 272 (-8.42)
omitfp.parsestream_googlemessage1.upb_table: 423 -> 410 (-3.07)
omitfp.parsestream_googlemessage2.upb_table: 679 -> 483 (-28.87)
omitfp.parsetostruct_googlemessage1.upb_table_byref: 287 -> 282 (-1.74)
omitfp.parsetostruct_googlemessage1.upb_table_byval: 226 -> 219 (-3.10)
omitfp.parsetostruct_googlemessage2.upb_table_byref: 315 -> 298 (-5.40)
omitfp.parsetostruct_googlemessage2.upb_table_byval: 297 -> 287 (-3.37)
The symtab that contains them is now hidden, and
you can look them up by name but there is no access
to the symtab itself, so there is no risk of
mutating it (by extending it, adding other defs
to it, etc).
upb_inttable() now supports a "compact" operation that will
decide on an array size and put all entries with small enough
keys into the array part for faster lookup.
Also exposed the upb_itof_ent structure and put a few useful
values there, so they are one fewer pointer chase away.
Unfortunately this degrades hash table lookup performance by
about 8%, which affects the streaming benchmark for googlemessage1
by about 5%. We could get this back at the cost of some memory,
but it would be nice to avoid that.
* UPB_STOP -> UPB_BREAK, better represents breaking
out of a parsing loop.
* UPB_STATUS_OK -> UPB_OK, for all status codes, more
concise at no readability cost (perhaps an improvement).
1. the start and end callbacks can now return
a upb_flow_t and set a status message.
2. clarified some semantics around passing an
error status back from the callbacks.