14 KiB
Editions: Group Migration Issues
Authors: @mkruskal-google
Summary
Address some unexpected issues in delimited encoding in edition 2023 before its OSS release.
Background
Joshua Humphries reported some well-timed issues discovered while experimenting with our early release of Edition 2023. He discovered that our new message encoding feature piggybacked a bit too much on the old group logic, and actually ended up being virtually useless in general.
None of our testing or migrations caught this because they were heavily focused on preserving old behavior (which is the primary goal of edition 2023). Delimited messages structured exactly like proto2 groups (e.g. message and field in the same scope with matching names) continued to work exactly as before, making it seem like everything was fine.
All of this is especially problematic in light of Submessages: In Pursuit of a More Perfect Encoding (not available externally yet), which intends to migrate the ecosystem to use delimited encoding everywhere. Releasing a semi-broken feature as a migration tool to eliminate a deprecated syntax is one thing, but trying to push the ecosystem to it is especially bad.
Overview
The problems here stem from the fact that before edition 2023, the field and type name of group fields was guaranteed to always be unique and intuitive. Proto2 splits groups into a synthetic nested message with a type name equivalent to the group specification (required to be capitalized), and a field name that's fully lowercased. For example,
optional group MyGroup = 1 { ... }
would become:
message MyGroup { ... }
optional MyGroup mygroup = 1;
The casing here is very important, since the transformation is irreversible. We can't recover the group name from the field name in general, only if the group is a single word.
The problem under edition 2023 is that we've removed the generation of
synchronized synthetic messages from the language. Users now explicitly define
messages, and any message field can be marked DELIMITED
. This means that
anyone assuming that the type and field name are synchronized could now be
broken.
Codegen
While using the field name for generated APIs required less special-casing in the generators, the field name ends up producing slightly-less-readable APIs for multi-word camelcased groups. The result is that we see a fairly random-seeming mix in different generators. Using protoc-explorer (not available externally), we find the following:
Language | Generated APIs | Example proto2 getter |
C++ | field | MyGroup mygroup()
|
Java (all) | message | MyGroup getMyGroup()
|
Python | field | mygroup
|
Go (all) | field | GetMygroup() *Foo_MyGroup
|
Dart V1 | field/message* | get mygroup
|
upb ** | field | Foo_mygroup()
|
Objective-c | message | MyGroup* myGroup
|
Swift | message | MyGroup myGroup
|
C# | field/message* | MyGroup Mygroup
|
* This codegen difference was caught during the implementation
and intentionally "fixed" in Edition 2023.
** This includes all upb-based runtimes as well (e.g. Ruby, Rust, etc.)
† Extensions use field
In the Dart V1 implementation, we decided to intentionally introduce a behavior change on editions upgrades. It was determined that this only affected a handful of protos in google3, and could probably be manually fixed as-needed. Java's handling changes the story significantly, since over 50% of protos in google3 produce generated Java code. Objective-C is also noteworthy since we open-source it, and Swift because it's widely used in OSS and we don't own it.
While the editions upgrade is still non-breaking, it means that the generated APIs could have very surprising spellings and may not be unique. For example, using the same type for two delimited fields in the same containing message will create two sets of generated APIs with the same name in some languages!
Text Format
Our "official"
draft specification
of text-format explicitly states that group messages are encoded by the message
name, rather than the lowercases field name. A group MyGroup
will be
serialized as:
MyGroup {
...
}
In C++, we always serialize the message name and have special handling to only accept the message name in parsing. We also have conformance tests locking down the positive path here (i.e. using the message name round-trip). The negative path (i.e. failing to accept the field name) doesn't have a conformance test, but C++/Java/Python all agree and there's no known case that doesn't.
To make things even stranger, for extensions (group fields extending other messages), we always use the field name for groups. So as far as group extensions are concerned, there's no problem for editions.
There are a few problems with non-extension group fields in editions:
- Refactoring the message name will change any text-format output
- New delimited fields will have unexpected text-format output, that could conflict with other fields
- Text parsers will expect the message name, which is surprising and could be impossible to specify uniquely
Recommendation
Clearly the end-state we want is for the field name to be used in all generated APIs, and for text-format serialization/parsing. The only questions are: how do we get there and can/should we do it in time for the 2023 release in 27.0 next month?
We propose a combination of the alternatives listed below. Smooth Extension seems like the best short-term path forward to unblock the delimited migration. It mostly solves the problem and doesn't require any new features. The necessary changes for this approach have already been prepared, along with new conformance tests to lock down the behavior changes.
Global Feature is a good long-term mitigation for tech debt we're leaving behind with Smooth Extension. Ultimately we would like to remove any labeling of fields by their type, and editions provides a good mechanism to do this. Alternatively, we could implement aliases and use that to unify this old behavior and avoid a new feature. Either of these options will be the next step after the release of 2023, with aliases being preferred as long as the timing works out.
If we hit any unexpected delays, Nerf Delimited Encoding in 2023 (not available externally) is the quickest path forward to unblock the release of edition 2023. It has a lot of downsides though, and will block any migration towards delimited encoding until edition 2024 has started rolling out.
Alternatives
Smooth Extension
Instead of trying to change the existing behavior, we could expand the current spec to try to cover both proto2 and editions. We would define a "group-like" concept, which applies to all fields which:
- Have
DELIMITED
encoding - Have a type corresponding to a nested message directly under its containing message
- Have a name corresponding to its lowercased type name.
Note that proto2 groups will always be "group-like."
For any group-like field we will use the old proto2 semantics, whatever they are today. Otherwise, we will treat them as regular fields for both codegen and text-format. This means that most new cases of delimited encoding will have the desired behavior, while all old groups will continue to function. The main exception here is that users will see the unexpected proto2 behavior if they have message/field names that happen to match.
While the old behavior will result in some unexpected capitalization when it's hit, it's mostly safe. Because of 2 and 3 (and the fact that we disallow duplicate field names), we can guarantee that in both codegen and text encoding there will never be any conflicting symbols. There can never be two delimited fields of the same type using the old behavior, and no other messages or fields will exist with either spelling.
Additionally, we will update the text parsers to accept both the old message-based spelling and the new field-based spelling for group-like fields. This will at least prevent parsing failures if users hit this unexpected change in behavior.
Pros
- Fully supports old proto2 behavior
- Treats most new editions fields correctly
- Doesn't allow for any of the problematic cases we see today
- By updating the parsers to accept both, we have a migration path to change the "wire"-format
- Decoupled from editions launch (since it's a non-breaking change w/o a feature)
Cons
- Requires coordinated changes in every editions-compatible runtime (and many generators)
- Keeps the old proto2 behavior around indefinitely, with no path to remove it
- Plants surprising edge case for users if they happen to name their message/fields a certain way
Global Feature
The simplest answer here is to introduce a new global message feature
legacy_group_handling
to control all the changes we'd like. This will only be
applicable to group-like fields (see
Smooth Extension). With this feature enabled,
these fields will always use their message name for text-format. Each
non-conformant language could also use this feature to gate the codegen rules.
Pros
- Simple boolean to gate all the behavior changes
- Doesn't require adding language features to a bunch of languages that don't have them yet
- Uses editions to ratchet down the bad behavior
Cons
- It's a little late in the game to be introducing new features to 2023 (go/edition-lifetimes)
- Requires coordinated changes in every editions-compatible runtime (and many generators)
- The migration story for users is unclear. Overriding the value of this feature is both a "wire"-breaking and API-breaking change they may not be able to do easily.
- With the feature set, users will still see all of the problems we have today
Feature Suite
An extension of Global feature would be to split the codegen changes out into separate per-language features.
Pros
- Simple booleans to gate all the distinct behavior changes
- Uses editions to ratchet down the bad behavior
- Better migration story for users, since it separates API and "wire" breaking changes
Cons
- Requires a whole slew of new language features, which typically have a difficult first-time setup
- Requires coordinated changes in every editions-compatible runtime (and many generators)
- Increases the complexity of edition 2023 significantly
- With the features set, users will still see all of the problems we have today
Nerf Delimited Encoding in 2023
A quick fix to avoid releasing a bad feature would be to simply ban the case where the message and field names don't match. Adding this validation to protoc would cover the majority of cases, although we might want additional checks in every language that supports dynamic messages.
This is a good fallback option if we can't implement anything better before 27.0
is released. It allows us to release editions in a reasonable state, where we
can fix these issues and release a more functional DELIMITED
feature in 2024.
Pros
- Unblocks editions rollout
- Easy and safe to implement
- Avoids rushed implementation of a proper fix
- Avoids runtime issues with text format
- Avoids unexpected build breakages post-editions (e.g. renaming the nested message)
Cons
- We'd still be releasing a really bad feature. Instead of opening up new possibilities, it's just "like groups but worse"
- We couldn't fix this in 2023 without potential version skew from third party plugins. We'd likely have to wait until edition 2024
- Might requires coordinated changes in a lot of runtimes
- Doesn't unblock our effort to roll out delimited
Rename Fields in Editions
While it might be tempting to leverage the edition 2023 upgrade as a place we
can just rename the group field, that doesn't actually work (e.g. rename
mygroup
to my_group
). Because so many runtimes already use the field name
in generated APIs, they would break under this transformation.
Pros
- Works really well for text-format and some languages
Cons
- Turns 2023 upgrade into a breaking change for many languages
Aliases
We've discussed aliases a lot mostly in the context of Any
, but they would be
useful for any encoding scheme that locks down field/message names. If we had a
fully implemented alias system in place, it would be the perfect mitigation
here. Unfortunately, we don't yet and the timeline here is probably too tight to
implement one.
Pros
- Fixes all of the problems mentioned above
- Allows us to specify the old behavior using the proto language, which allows it to be handled by Prototiller
Cons
- We want this to be a real fully thought-out feature, not a hack rushed into a tight timeline
Do Nothing
Doing nothing doesn't actually break anyone, but it is embarrassing.
Pros
- Easy to do
Cons
- Releases a horrible feature full of foot-guns in our first edition
- Doesn't unblock our effort to roll out delimited