diff --git a/docs/design/editions/README.md b/docs/design/editions/README.md index 483ce860c0..c551f91ab0 100644 --- a/docs/design/editions/README.md +++ b/docs/design/editions/README.md @@ -36,3 +36,4 @@ The following topics are in this repository: * [Edition Naming](edition-naming.md) * [Editions Feature Visibility](editions-feature-visibility.md) * [Legacy Syntax Editions](legacy-syntax-editions.md) +* [Editions: Feature Extension Layout](editions-feature-extension-layout.md) \ No newline at end of file diff --git a/docs/design/editions/editions-feature-extension-layout.md b/docs/design/editions/editions-feature-extension-layout.md new file mode 100644 index 0000000000..67626a4388 --- /dev/null +++ b/docs/design/editions/editions-feature-extension-layout.md @@ -0,0 +1,150 @@ +# Editions: Feature Extension Layout + +**Author:** [@mkruskal-google](https://github.com/mkruskal-google), +[@zhangskz](https://github.com/zhangskz) + +**Approved:** 2023-08-23 + +## Background + +"[What are Protobuf Editions](what-are-protobuf-editions.md)" lays out a plan +for allowing for more targeted features not owned by the protobuf team. It uses +extensions of the global features proto to implement this. One thing that was +left a bit ambiguous was *who* should own these extensions. Language, code +generator, and runtime implementations are all similar but not identical +distinctions. + +"Editions Zero Feature: utf8_validation" (not available externally, though a +later version, +"[Editions Zero: utf8_validation Without Problematic Options](editions-zero-utf8_validation.md)" +is) is a recent plan to add a new set of generator features for utf8 validation. +While the sole feature we had originally created (`legacy_closed_enum` in Java +and C++) didn't have any ambiguity here, this one did. Specifically in Python, +the current behaviors across proto2/proto3 are distinct for all 3 +implementations: pure python, Python/C++, Python/upb. + +## Overview + +In meetings, we've discussed various alternatives, captured below. The original +plan was to make feature extensions runtime implementation-specific (e.g. C++, +Java, Python, upb). There are some notable complications that came up though: + +1. **Polyglot** - it's not clear how upb or C++ runtimes should behave in + multi-language situations. Which feature sets do they consider for runtime + behaviors? *Note: this is already a serious issue today, where all proto2 + strings and many proto3 strings are completely unsafe across languages.* + +2. **Shared Implementations** - Runtimes like upb and C++ are used as backing + implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP). + If we have a single set of `upb` or `cpp` features, migrating to those + shared implementations would be more difficult (since there's no independent + switches per-language). *Note: this is already the situation we're in today, + where switching the runtime implementation can cause subtle and dangerous + behavior changes.* + +Given that we only have two behaviors, and one of them is unambiguous, it seems +reasonable to punt on this decision until we have more information. We may +encounter more edge cases that require feature extensions (and give us more +information) during the rollout of edition zero. We also have a lot of freedom +to re-model features in later editions, so keeping the initial implementation as +simple as possible seems best (i.e. Alternative 2). + +## Alternatives + +### Alternative 1: Runtime Implementation Features + +Features would be per-runtime implementation as originally described in +"Editions Zero Feature: utf8_validation." For example, Protobuf Python users +would set different features depending on the backing implementation (e.g. +`features.(pb.cpp).`, `features.(pb.upb).`). + +#### Pros + +* Most consistent with range of behaviors expressible pre-Editions + +#### Cons + +* Implementation may / should not be obvious to users. +* Lack of levers specifically for language / implementation combos. For + example, there is no way to set Python-C++ behavior independently of C++ + behavior which may make migration harder from other Python implementations. + +### Alternative 2: Generator Features + +Features would be per-generator only (i.e. each protoc plugin would own one set +of features). This was the second decision we made in later discussions, and +while very similar to the above alternative, it's more inline with our goal of +making features primarily for codegen. + +For example, all Python implementations would share the same set of features +(e.g. `features.(pb.python).`). However, certain features could be +targeted to specific implementations (e.g. +`features.(pb.python).upb_utf8_validation` would only be used by Python/upb). + +#### Pros + +* Allows independent controls of shared implementations in different target + languages (e.g. Python's upb feature won't affect PHP). + +#### Cons + +* Possible complexity in upb to understand which language's features to + respect. UPB is not currently aware of what language it is being used for. +* Limits in-process sharing across languages with shared implementations (e.g. + Python upb, PHP upb) in the case of conflicting behaviors. + * Additional checks may be needed. + +### Alternative 3: Migrate to bytes + +Since this whole discussion revolves around the utf8 validation feature, one +option would be to just remove it from edition zero. Instead of adding a new +toggle for UTF8 behavior, we could simply migrate everyone who doesn't enforce +utf8 today to `bytes`. This would likely need another new *codegen* feature for +generating byte getters/setters as strings, but that wouldn't have any of the +ambiguity we're seeing today. + +Unfortunately, this doesn't seem feasible because of all the different behaviors +laid out in "Editions Zero Feature: utf8_validation." UTF8 validation isn't +really a binary on/off decision, and it can vary widely between languages. There +are many cases where UTF8 is validated in **some** languages but not others, and +there's also the C++ "hint" behavior that logs errors but allows invalid UTF8. + +**Note:** This could still be partially done in a follow-up LSC by targeting +specific combinations of the new feature that disable validation in all relevant +languages. + +#### Pros + +* Punts on the issue, we wouldn't need any upb features and C++ features would + all be code-gen only +* Simplifies the situation, avoids adding a very complicated feature in + edition zero + +#### Cons + +* Not really possible given the current complexity +* There are O(10M) proto2 string fields that would be blindly changed to bytes + +### Alternative 4: Nested Features + +Another option is to allow for shared feature set messages. For example, upb +would define a feature message, but *not* make it an extension of the global +`FeatureSet`. Instead, languages with upb implementations would have a field of +this type to allow for finer-grained controls. C++ would both extend the global +`FeatureSet` and also be allowed as a field in other languages. + +For example, python utf8 validation could be specified as: + +We could have checks during feature validation that enforce that impossible +combinations aren't specified. For example, with our current implementation +`features.(pb.python).cpp` should always be identical to `features.(pb.cpp)`, +since we don't have any mechanism for distinguishing them. + +#### Pros + +* Much more explicit than options 1 and 2 + +#### Cons + +* Maybe too explicit? Proto owners would be forced to duplicate a lot of + features