Publishes "Editions: Group Migration Issues" to GitHub.

PiperOrigin-RevId: 626131189
10 months ago · 8bace99998
parent 2be4364e4c
commit 8bace99998
2 changed files with 403 additions and 1 deletions
--- a/docs/design/editions/README.md
+++ b/docs/design/editions/README.md
@ -37,4 +37,5 @@ The following topics are in this repository:
 *   [Edition Naming](edition-naming.md)
 *   [Editions Feature Visibility](editions-feature-visibility.md)
 *   [Legacy Syntax Editions](legacy-syntax-editions.md)
-*   [Editions: Feature Extension Layout](editions-feature-extension-layout.md)
+*   [Editions: Feature Extension Layout](editions-feature-extension-layout.md)
 *   [Editions: Group Migration Issues](group-migration-issues.md)
--- a/docs/design/editions/group-migration-issues.md
+++ b/docs/design/editions/group-migration-issues.md
@ -0,0 +1,401 @@
 # Editions: Group Migration Issues
 **Authors**: [@mkruskal-google](https://github.com/mkruskal-google)
 ## Summary
 Address some unexpected issues in delimited encoding in edition 2023 before its
 OSS release.
 ## Background
 Joshua Humphries reported some well-timed
 [issues](https://github.com/protocolbuffers/protobuf/issues/16239) discovered
 while experimenting with our early release of Edition 2023. He discovered that
 our new message encoding feature piggybacked a bit too much on the old group
 logic, and actually ended up being virtually useless in general.
 None of our testing or migrations caught this because they were heavily focused
 on *preserving* old behavior (which is the primary goal of edition 2023).
 Delimited messages structured exactly like proto2 groups (e.g. message and field
 in the same scope with matching names) continued to work exactly as before,
 making it seem like everything was fine.
 All of this is especially problematic in light of *Submessages: In Pursuit of a
 More Perfect Encoding* (not available externally yet), which intends to migrate the
 ecosystem to use delimited encoding everywhere. Releasing a semi-broken feature
 as a migration tool to eliminate a deprecated syntax is one thing, but trying to
 push the ecosystem to it is especially bad.
 ## Overview
 The problems here stem from the fact that before edition 2023, the field and
 type name of group fields was guaranteed to always be unique and intuitive.
 Proto2 splits groups into a synthetic nested message with a type name equivalent
 to the group specification (required to be capitalized), and a field name that's
 fully lowercased. For example,
 ```
 optional group MyGroup = 1 { ... }
 ```
 would become:
 ```
 message MyGroup { ... }
 optional MyGroup mygroup = 1;
 ```
 The casing here is very important, since the transformation is irreversible. We
 can't recover the group name from the field name in general, only if the group
 is a single word.
 The problem under edition 2023 is that we've removed the generation of
 synchronized synthetic messages from the language. Users now explicitly define
 messages, and any message field can be marked `DELIMITED`. This means that
 anyone assuming that the type and field name are synchronized could now be
 broken.
 ### Codegen
 While using the field name for generated APIs required less special-casing in
 the generators, the field name ends up producing slightly-less-readable APIs for
 multi-word camelcased groups. The result is that we see a fairly random-seeming
 mix in different generators. Using protoc-explorer (not available externally),
 we find the following:
 <table>
  <tr>
   <td><strong>Language</strong>
   </td>
   <td><strong>Generated APIs</strong>
   </td>
   <td><strong>Example proto2 getter</strong>
   </td>
  </tr>
  <tr>
   <td>C++
   </td>
   <td>field
   </td>
   <td><code>MyGroup mygroup()</code>
   </td>
  </tr>
  <tr>
   <td>Java (all)
   </td>
   <td>message
   </td>
   <td><code>MyGroup getMyGroup()</code>
   </td>
  </tr>
  <tr>
   <td>Python
   </td>
   <td>field
   </td>
   <td><code>mygroup</code>
   </td>
  </tr>
  <tr>
   <td>Go (all)
   </td>
   <td>field
   </td>
   <td><code>GetMygroup() *Foo_MyGroup</code>
   </td>
  </tr>
  <tr>
   <td>Dart V1
   </td>
   <td>field/message*
   </td>
   <td><code>get mygroup</code>
   </td>
  </tr>
  <tr>
   <td>upb **
   </td>
   <td>field
   </td>
   <td><code>Foo_mygroup()</code>
   </td>
  </tr>
  <tr>
   <td>Objective-c
   </td>
   <td>message
   </td>
   <td><code>MyGroup* myGroup</code>
   </td>
  </tr>
  <tr>
   <td>Swift
   </td>
   <td>message
   </td>
   <td><code>MyGroup myGroup</code>
   </td>
  </tr>
  <tr>
   <td>C#
   </td>
   <td>field/message*
   </td>
   <td><code>MyGroup Mygroup</code>
   </td>
  </tr>
 </table>
 \* This codegen difference was [caught](cl/611144002) during the implementation
 and intentionally "fixed" in Edition 2023. \
 \*\* This includes all upb-based runtimes as well (e.g. Ruby, Rust, etc.) \
 † Extensions use field
 In the Dart V1 implementation, we decided to intentionally introduce a behavior
 change on editions upgrades. It was determined that this only affected a handful
 of protos in google3, and could probably be manually fixed as-needed. Java's
 handling changes the story significantly, since over 50% of protos in google3
 produce generated Java code. Objective-C is also noteworthy since we open-source
 it, and Swift because it's widely used in OSS and we don't own it.
 While the editions upgrade is still non-breaking, it means that the generated
 APIs could have very surprising spellings and may not be unique. For example,
 using the same type for two delimited fields in the same containing message will
 create two sets of generated APIs with the same name in some languages!
 ### Text Format
 Our "official"
 [draft specification](https://protobuf.dev/reference/protobuf/textformat-spec/)
 of text-format explicitly states that group messages are encoded by the *message
 name*, rather than the lowercases field name. A group `MyGroup` will be
 serialized as:
 ```
 MyGroup {
  ...
 }
 ```
 In C++, we always serialize the message name and have special handling to only
 accept the message name in parsing. We also have conformance tests locking down
 the positive path here (i.e. using the message name round-trip). The negative
 path (i.e. failing to accept the field name) doesn't have a conformance test,
 but C++/Java/Python all agree and there's no known case that doesn't.
 To make things even stranger, for *extensions* (group fields extending other
 messages), we always use the field name for groups. So as far as group
 extensions are concerned, there's no problem for editions.
 There are a few problems with non-extension group fields in editions:
 *   Refactoring the message name will change any text-format output
 *   New delimited fields will have unexpected text-format output, that *could*
    conflict with other fields
 *   Text parsers will expect the message name, which is surprising and could be
    impossible to specify uniquely
 ## Recommendation
 Clearly the end-state we want is for the field name to be used in all generated
 APIs, and for text-format serialization/parsing. The only questions are: how do
 we get there and can/should we do it in time for the 2023 release in 27.0 next
 month?
 We propose a combination of the alternatives listed below.
 [Smooth Extension](#smooth-extension) seems like the best short-term path
 forward to unblock the delimited migration. It *mostly* solves the problem and
 doesn't require any new features. The necessary changes for this approach have
 already been prepared, along with new conformance tests to lock down the
 behavior changes.
 [Global Feature](#global-feature) is a good long-term mitigation for tech debt
 we're leaving behind with *Smooth Extension*. Ultimately we would like to remove
 any labeling of fields by their type, and editions provides a good mechanism to
 do this. Alternatively, we could implement [aliases](#aliases) and use that to
 unify this old behavior and avoid a new feature. Either of these options will be
 the next step after the release of 2023, with aliases being preferred as long as
 the timing works out.
 If we hit any unexpected delays, Nerf Delimited Encoding in 2023 (not available
 externally) is the quickest path forward to unblock the release of edition 2023.
 It has a lot of downsides though, and will block any migration towards delimited
 encoding until edition 2024 has started rolling out.
 ## Alternatives
 ### Smooth Extension {#smooth-extension}
 Instead of trying to change the existing behavior, we could expand the current
 spec to try to cover both proto2 and editions. We would define a "group-like"
 concept, which applies to all fields which:
 *   Have `DELIMITED` encoding
 *   Have a type corresponding to a nested message directly under its containing
    message
 *   Have a name corresponding to its lowercased type name.
 Note that proto2 groups will *always* be "group-like."
 For any group-like field we will use the old proto2 semantics, whatever they are
 today. Otherwise, we will treat them as regular fields for both codegen and
 text-format. This means that *most* new cases of delimited encoding will have
 the desired behavior, while *all* old groups will continue to function. The main
 exception here is that users will see the unexpected proto2 behavior if they
 have message/field names that *happen* to match.
 While the old behavior will result in some unexpected capitalization when it's
 hit, it's mostly safe. Because of 2 and 3 (and the fact that we disallow
 duplicate field names), we can guarantee that in both codegen and text encoding
 there will never be any conflicting symbols. There can never be two delimited
 fields of the same type using the old behavior, and no other messages or fields
 will exist with either spelling.
 Additionally, we will update the text parsers to accept **both** the old
 message-based spelling and the new field-based spelling for group-like fields.
 This will at least prevent parsing failures if users hit this unexpected change
 in behavior.
 #### Pros
 *   Fully supports old proto2 behavior
 *   Treats most new editions fields correctly
 *   Doesn't allow for any of the problematic cases we see today
 *   By updating the parsers to accept both, we have a migration path to change
    the "wire"-format
 *   Decoupled from editions launch (since it's a non-breaking change w/o a
    feature)
 #### Cons
 *   Requires coordinated changes in every editions-compatible runtime (and many
    generators)
 *   Keeps the old proto2 behavior around indefinitely, with no path to remove it
 *   Plants surprising edge case for users if they happen to name their
    message/fields a certain way
 ### Global Feature {#global-feature}
 The simplest answer here is to introduce a new global message feature
 `legacy_group_handling` to control all the changes we'd like. This will only be
 applicable to group-like fields (see
 [Smooth Extension](?tab=t.0#heading=h.blnhard1tpyx)). With this feature enabled,
 these fields will always use their message name for text-format. Each
 non-conformant language could also use this feature to gate the codegen rules.
 #### Pros
 *   Simple boolean to gate all the behavior changes
 *   Doesn't require adding language features to a bunch of languages that don't
    have them yet
 *   Uses editions to ratchet down the bad behavior
 #### Cons
 *   It's a little late in the game to be introducing new features to 2023
    (go/edition-lifetimes)
 *   Requires coordinated changes in every editions-compatible runtime (and many
    generators)
 *   The migration story for users is unclear. Overriding the value of this
    feature is both a "wire"-breaking and API-breaking change they may not be
    able to do easily.
 *   With the feature set, users will still see all of the problems we have today
 ### Feature Suite
 An extension of [Global feature](?tab=t.0#heading=h.mvtf74vplkdg) would be to
 split the codegen changes out into separate per-language features.
 #### Pros
 *   Simple booleans to gate all the distinct behavior changes
 *   Uses editions to ratchet down the bad behavior
 *   Better migration story for users, since it separates API and "wire" breaking
    changes
 #### Cons
 *   Requires a whole slew of new language features, which typically have a
    difficult first-time setup
 *   Requires coordinated changes in every editions-compatible runtime (and many
    generators)
 *   Increases the complexity of edition 2023 significantly
 *   With the features set, users will still see all of the problems we have
    today
 ### Nerf Delimited Encoding in 2023
 A quick fix to avoid releasing a bad feature would be to simply ban the case
 where the message and field names don't match. Adding this validation to protoc
 would cover the majority of cases, although we might want additional checks in
 every language that supports dynamic messages.
 This is a good fallback option if we can't implement anything better before 27.0
 is released. It allows us to release editions in a reasonable state, where we
 can fix these issues and release a more functional `DELIMITED` feature in 2024.
 #### Pros
 *   Unblocks editions rollout
 *   Easy and safe to implement
 *   Avoids rushed implementation of a proper fix
 *   Avoids runtime issues with text format
 *   Avoids unexpected build breakages post-editions (e.g. renaming the nested
    message)
 #### Cons
 *   We'd still be releasing a really bad feature. Instead of opening up new
    possibilities, it's just "like groups but worse"
 *   We couldn't fix this in 2023 without potential version skew from third party
    plugins. We'd likely have to wait until edition 2024
 *   Might requires coordinated changes in a lot of runtimes
 *   Doesn't unblock our effort to roll out delimited
 ### Rename Fields in Editions
 While it might be tempting to leverage the edition 2023 upgrade as a place we
 can just rename the group field, that doesn't actually work (e.g. rename
 `mygroup` to `my_group`). Because so many runtimes already use the *field name*
 in generated APIs, they would break under this transformation.
 #### Pros
 *   Works really well for text-format and some languages
 #### Cons
 *   Turns 2023 upgrade into a breaking change for many languages
 ### Aliases {#aliases}
 We've discussed aliases a lot mostly in the context of `Any`, but they would be
 useful for any encoding scheme that locks down field/message names. If we had a
 fully implemented alias system in place, it would be the perfect mitigation
 here. Unfortunately, we don't yet and the timeline here is probably too tight to
 implement one.
 #### Pros
 *   Fixes all of the problems mentioned above
 *   Allows us to specify the old behavior using the proto language, which allows
    it to be handled by Prototiller
 #### Cons
 *   We want this to be a real fully thought-out feature, not a hack rushed into
    a tight timeline
 ### Do Nothing
 Doing nothing doesn't actually break anyone, but it is embarrassing.
 #### Pros
 *   Easy to do
 #### Cons
 *   Releases a horrible feature full of foot-guns in our first edition
 *   Doesn't unblock our effort to roll out delimited