Protocol Buffers - Google's data interchange format (grpc依赖)
https://developers.google.com/protocol-buffers/
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
402 lines
14 KiB
402 lines
14 KiB
7 months ago
|
# Editions: Group Migration Issues
|
||
|
|
||
|
**Authors**: [@mkruskal-google](https://github.com/mkruskal-google)
|
||
|
|
||
|
## Summary
|
||
|
|
||
|
Address some unexpected issues in delimited encoding in edition 2023 before its
|
||
|
OSS release.
|
||
|
|
||
|
## Background
|
||
|
|
||
|
Joshua Humphries reported some well-timed
|
||
|
[issues](https://github.com/protocolbuffers/protobuf/issues/16239) discovered
|
||
|
while experimenting with our early release of Edition 2023. He discovered that
|
||
|
our new message encoding feature piggybacked a bit too much on the old group
|
||
|
logic, and actually ended up being virtually useless in general.
|
||
|
|
||
|
None of our testing or migrations caught this because they were heavily focused
|
||
|
on *preserving* old behavior (which is the primary goal of edition 2023).
|
||
|
Delimited messages structured exactly like proto2 groups (e.g. message and field
|
||
|
in the same scope with matching names) continued to work exactly as before,
|
||
|
making it seem like everything was fine.
|
||
|
|
||
|
All of this is especially problematic in light of *Submessages: In Pursuit of a
|
||
|
More Perfect Encoding* (not available externally yet), which intends to migrate the
|
||
|
ecosystem to use delimited encoding everywhere. Releasing a semi-broken feature
|
||
|
as a migration tool to eliminate a deprecated syntax is one thing, but trying to
|
||
|
push the ecosystem to it is especially bad.
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
The problems here stem from the fact that before edition 2023, the field and
|
||
|
type name of group fields was guaranteed to always be unique and intuitive.
|
||
|
Proto2 splits groups into a synthetic nested message with a type name equivalent
|
||
|
to the group specification (required to be capitalized), and a field name that's
|
||
|
fully lowercased. For example,
|
||
|
|
||
|
```
|
||
|
optional group MyGroup = 1 { ... }
|
||
|
```
|
||
|
|
||
|
would become:
|
||
|
|
||
|
```
|
||
|
message MyGroup { ... }
|
||
|
optional MyGroup mygroup = 1;
|
||
|
```
|
||
|
|
||
|
The casing here is very important, since the transformation is irreversible. We
|
||
|
can't recover the group name from the field name in general, only if the group
|
||
|
is a single word.
|
||
|
|
||
|
The problem under edition 2023 is that we've removed the generation of
|
||
|
synchronized synthetic messages from the language. Users now explicitly define
|
||
|
messages, and any message field can be marked `DELIMITED`. This means that
|
||
|
anyone assuming that the type and field name are synchronized could now be
|
||
|
broken.
|
||
|
|
||
|
### Codegen
|
||
|
|
||
|
While using the field name for generated APIs required less special-casing in
|
||
|
the generators, the field name ends up producing slightly-less-readable APIs for
|
||
|
multi-word camelcased groups. The result is that we see a fairly random-seeming
|
||
|
mix in different generators. Using protoc-explorer (not available externally),
|
||
|
we find the following:
|
||
|
|
||
|
<table>
|
||
|
<tr>
|
||
|
<td><strong>Language</strong>
|
||
|
</td>
|
||
|
<td><strong>Generated APIs</strong>
|
||
|
</td>
|
||
|
<td><strong>Example proto2 getter</strong>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>C++
|
||
|
</td>
|
||
|
<td>field
|
||
|
</td>
|
||
|
<td><code>MyGroup mygroup()</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>Java (all)
|
||
|
</td>
|
||
|
<td>message
|
||
|
</td>
|
||
|
<td><code>MyGroup getMyGroup()</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>Python
|
||
|
</td>
|
||
|
<td>field
|
||
|
</td>
|
||
|
<td><code>mygroup</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>Go (all)
|
||
|
</td>
|
||
|
<td>field
|
||
|
</td>
|
||
|
<td><code>GetMygroup() *Foo_MyGroup</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>Dart V1
|
||
|
</td>
|
||
|
<td>field/message*
|
||
|
</td>
|
||
|
<td><code>get mygroup</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>upb **
|
||
|
</td>
|
||
|
<td>field
|
||
|
</td>
|
||
|
<td><code>Foo_mygroup()</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>Objective-c
|
||
|
</td>
|
||
|
<td>message
|
||
|
</td>
|
||
|
<td><code>MyGroup* myGroup</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>Swift
|
||
|
</td>
|
||
|
<td>message
|
||
|
</td>
|
||
|
<td><code>MyGroup myGroup</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td>C#
|
||
|
</td>
|
||
|
<td>field/message*
|
||
|
</td>
|
||
|
<td><code>MyGroup Mygroup</code>
|
||
|
</td>
|
||
|
</tr>
|
||
|
</table>
|
||
|
|
||
|
\* This codegen difference was [caught](cl/611144002) during the implementation
|
||
|
and intentionally "fixed" in Edition 2023. \
|
||
|
\*\* This includes all upb-based runtimes as well (e.g. Ruby, Rust, etc.) \
|
||
|
† Extensions use field
|
||
|
|
||
|
In the Dart V1 implementation, we decided to intentionally introduce a behavior
|
||
|
change on editions upgrades. It was determined that this only affected a handful
|
||
|
of protos in google3, and could probably be manually fixed as-needed. Java's
|
||
|
handling changes the story significantly, since over 50% of protos in google3
|
||
|
produce generated Java code. Objective-C is also noteworthy since we open-source
|
||
|
it, and Swift because it's widely used in OSS and we don't own it.
|
||
|
|
||
|
While the editions upgrade is still non-breaking, it means that the generated
|
||
|
APIs could have very surprising spellings and may not be unique. For example,
|
||
|
using the same type for two delimited fields in the same containing message will
|
||
|
create two sets of generated APIs with the same name in some languages!
|
||
|
|
||
|
### Text Format
|
||
|
|
||
|
Our "official"
|
||
|
[draft specification](https://protobuf.dev/reference/protobuf/textformat-spec/)
|
||
|
of text-format explicitly states that group messages are encoded by the *message
|
||
|
name*, rather than the lowercases field name. A group `MyGroup` will be
|
||
|
serialized as:
|
||
|
|
||
|
```
|
||
|
MyGroup {
|
||
|
...
|
||
|
}
|
||
|
```
|
||
|
|
||
|
In C++, we always serialize the message name and have special handling to only
|
||
|
accept the message name in parsing. We also have conformance tests locking down
|
||
|
the positive path here (i.e. using the message name round-trip). The negative
|
||
|
path (i.e. failing to accept the field name) doesn't have a conformance test,
|
||
|
but C++/Java/Python all agree and there's no known case that doesn't.
|
||
|
|
||
|
To make things even stranger, for *extensions* (group fields extending other
|
||
|
messages), we always use the field name for groups. So as far as group
|
||
|
extensions are concerned, there's no problem for editions.
|
||
|
|
||
|
There are a few problems with non-extension group fields in editions:
|
||
|
|
||
|
* Refactoring the message name will change any text-format output
|
||
|
* New delimited fields will have unexpected text-format output, that *could*
|
||
|
conflict with other fields
|
||
|
* Text parsers will expect the message name, which is surprising and could be
|
||
|
impossible to specify uniquely
|
||
|
|
||
|
## Recommendation
|
||
|
|
||
|
Clearly the end-state we want is for the field name to be used in all generated
|
||
|
APIs, and for text-format serialization/parsing. The only questions are: how do
|
||
|
we get there and can/should we do it in time for the 2023 release in 27.0 next
|
||
|
month?
|
||
|
|
||
|
We propose a combination of the alternatives listed below.
|
||
|
[Smooth Extension](#smooth-extension) seems like the best short-term path
|
||
|
forward to unblock the delimited migration. It *mostly* solves the problem and
|
||
|
doesn't require any new features. The necessary changes for this approach have
|
||
|
already been prepared, along with new conformance tests to lock down the
|
||
|
behavior changes.
|
||
|
|
||
|
[Global Feature](#global-feature) is a good long-term mitigation for tech debt
|
||
|
we're leaving behind with *Smooth Extension*. Ultimately we would like to remove
|
||
|
any labeling of fields by their type, and editions provides a good mechanism to
|
||
|
do this. Alternatively, we could implement [aliases](#aliases) and use that to
|
||
|
unify this old behavior and avoid a new feature. Either of these options will be
|
||
|
the next step after the release of 2023, with aliases being preferred as long as
|
||
|
the timing works out.
|
||
|
|
||
|
If we hit any unexpected delays, Nerf Delimited Encoding in 2023 (not available
|
||
|
externally) is the quickest path forward to unblock the release of edition 2023.
|
||
|
It has a lot of downsides though, and will block any migration towards delimited
|
||
|
encoding until edition 2024 has started rolling out.
|
||
|
|
||
|
## Alternatives
|
||
|
|
||
|
### Smooth Extension {#smooth-extension}
|
||
|
|
||
|
Instead of trying to change the existing behavior, we could expand the current
|
||
|
spec to try to cover both proto2 and editions. We would define a "group-like"
|
||
|
concept, which applies to all fields which:
|
||
|
|
||
|
* Have `DELIMITED` encoding
|
||
|
* Have a type corresponding to a nested message directly under its containing
|
||
|
message
|
||
|
* Have a name corresponding to its lowercased type name.
|
||
|
|
||
|
Note that proto2 groups will *always* be "group-like."
|
||
|
|
||
|
For any group-like field we will use the old proto2 semantics, whatever they are
|
||
|
today. Otherwise, we will treat them as regular fields for both codegen and
|
||
|
text-format. This means that *most* new cases of delimited encoding will have
|
||
|
the desired behavior, while *all* old groups will continue to function. The main
|
||
|
exception here is that users will see the unexpected proto2 behavior if they
|
||
|
have message/field names that *happen* to match.
|
||
|
|
||
|
While the old behavior will result in some unexpected capitalization when it's
|
||
|
hit, it's mostly safe. Because of 2 and 3 (and the fact that we disallow
|
||
|
duplicate field names), we can guarantee that in both codegen and text encoding
|
||
|
there will never be any conflicting symbols. There can never be two delimited
|
||
|
fields of the same type using the old behavior, and no other messages or fields
|
||
|
will exist with either spelling.
|
||
|
|
||
|
Additionally, we will update the text parsers to accept **both** the old
|
||
|
message-based spelling and the new field-based spelling for group-like fields.
|
||
|
This will at least prevent parsing failures if users hit this unexpected change
|
||
|
in behavior.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Fully supports old proto2 behavior
|
||
|
* Treats most new editions fields correctly
|
||
|
* Doesn't allow for any of the problematic cases we see today
|
||
|
* By updating the parsers to accept both, we have a migration path to change
|
||
|
the "wire"-format
|
||
|
* Decoupled from editions launch (since it's a non-breaking change w/o a
|
||
|
feature)
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* Requires coordinated changes in every editions-compatible runtime (and many
|
||
|
generators)
|
||
|
* Keeps the old proto2 behavior around indefinitely, with no path to remove it
|
||
|
* Plants surprising edge case for users if they happen to name their
|
||
|
message/fields a certain way
|
||
|
|
||
|
### Global Feature {#global-feature}
|
||
|
|
||
|
The simplest answer here is to introduce a new global message feature
|
||
|
`legacy_group_handling` to control all the changes we'd like. This will only be
|
||
|
applicable to group-like fields (see
|
||
|
[Smooth Extension](?tab=t.0#heading=h.blnhard1tpyx)). With this feature enabled,
|
||
|
these fields will always use their message name for text-format. Each
|
||
|
non-conformant language could also use this feature to gate the codegen rules.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Simple boolean to gate all the behavior changes
|
||
|
* Doesn't require adding language features to a bunch of languages that don't
|
||
|
have them yet
|
||
|
* Uses editions to ratchet down the bad behavior
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* It's a little late in the game to be introducing new features to 2023
|
||
|
(go/edition-lifetimes)
|
||
|
* Requires coordinated changes in every editions-compatible runtime (and many
|
||
|
generators)
|
||
|
* The migration story for users is unclear. Overriding the value of this
|
||
|
feature is both a "wire"-breaking and API-breaking change they may not be
|
||
|
able to do easily.
|
||
|
* With the feature set, users will still see all of the problems we have today
|
||
|
|
||
|
### Feature Suite
|
||
|
|
||
|
An extension of [Global feature](?tab=t.0#heading=h.mvtf74vplkdg) would be to
|
||
|
split the codegen changes out into separate per-language features.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Simple booleans to gate all the distinct behavior changes
|
||
|
* Uses editions to ratchet down the bad behavior
|
||
|
* Better migration story for users, since it separates API and "wire" breaking
|
||
|
changes
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* Requires a whole slew of new language features, which typically have a
|
||
|
difficult first-time setup
|
||
|
* Requires coordinated changes in every editions-compatible runtime (and many
|
||
|
generators)
|
||
|
* Increases the complexity of edition 2023 significantly
|
||
|
* With the features set, users will still see all of the problems we have
|
||
|
today
|
||
|
|
||
|
### Nerf Delimited Encoding in 2023
|
||
|
|
||
|
A quick fix to avoid releasing a bad feature would be to simply ban the case
|
||
|
where the message and field names don't match. Adding this validation to protoc
|
||
|
would cover the majority of cases, although we might want additional checks in
|
||
|
every language that supports dynamic messages.
|
||
|
|
||
|
This is a good fallback option if we can't implement anything better before 27.0
|
||
|
is released. It allows us to release editions in a reasonable state, where we
|
||
|
can fix these issues and release a more functional `DELIMITED` feature in 2024.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Unblocks editions rollout
|
||
|
* Easy and safe to implement
|
||
|
* Avoids rushed implementation of a proper fix
|
||
|
* Avoids runtime issues with text format
|
||
|
* Avoids unexpected build breakages post-editions (e.g. renaming the nested
|
||
|
message)
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* We'd still be releasing a really bad feature. Instead of opening up new
|
||
|
possibilities, it's just "like groups but worse"
|
||
|
* We couldn't fix this in 2023 without potential version skew from third party
|
||
|
plugins. We'd likely have to wait until edition 2024
|
||
|
* Might requires coordinated changes in a lot of runtimes
|
||
|
* Doesn't unblock our effort to roll out delimited
|
||
|
|
||
|
### Rename Fields in Editions
|
||
|
|
||
|
While it might be tempting to leverage the edition 2023 upgrade as a place we
|
||
|
can just rename the group field, that doesn't actually work (e.g. rename
|
||
|
`mygroup` to `my_group`). Because so many runtimes already use the *field name*
|
||
|
in generated APIs, they would break under this transformation.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Works really well for text-format and some languages
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* Turns 2023 upgrade into a breaking change for many languages
|
||
|
|
||
|
### Aliases {#aliases}
|
||
|
|
||
|
We've discussed aliases a lot mostly in the context of `Any`, but they would be
|
||
|
useful for any encoding scheme that locks down field/message names. If we had a
|
||
|
fully implemented alias system in place, it would be the perfect mitigation
|
||
|
here. Unfortunately, we don't yet and the timeline here is probably too tight to
|
||
|
implement one.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Fixes all of the problems mentioned above
|
||
|
* Allows us to specify the old behavior using the proto language, which allows
|
||
|
it to be handled by Prototiller
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* We want this to be a real fully thought-out feature, not a hack rushed into
|
||
|
a tight timeline
|
||
|
|
||
|
### Do Nothing
|
||
|
|
||
|
Doing nothing doesn't actually break anyone, but it is embarrassing.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Easy to do
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* Releases a horrible feature full of foot-guns in our first edition
|
||
|
* Doesn't unblock our effort to roll out delimited
|