PiperOrigin-RevId: 578819110pull/14606/head
parent
5f146f8dfe
commit
725b477032
2 changed files with 151 additions and 0 deletions
@ -0,0 +1,150 @@ |
||||
# Editions: Feature Extension Layout |
||||
|
||||
**Author:** [@mkruskal-google](https://github.com/mkruskal-google), |
||||
[@zhangskz](https://github.com/zhangskz) |
||||
|
||||
**Approved:** 2023-08-23 |
||||
|
||||
## Background |
||||
|
||||
"[What are Protobuf Editions](what-are-protobuf-editions.md)" lays out a plan |
||||
for allowing for more targeted features not owned by the protobuf team. It uses |
||||
extensions of the global features proto to implement this. One thing that was |
||||
left a bit ambiguous was *who* should own these extensions. Language, code |
||||
generator, and runtime implementations are all similar but not identical |
||||
distinctions. |
||||
|
||||
"Editions Zero Feature: utf8_validation" (not available externally, though a |
||||
later version, |
||||
"[Editions Zero: utf8_validation Without Problematic Options](editions-zero-utf8_validation.md)" |
||||
is) is a recent plan to add a new set of generator features for utf8 validation. |
||||
While the sole feature we had originally created (`legacy_closed_enum` in Java |
||||
and C++) didn't have any ambiguity here, this one did. Specifically in Python, |
||||
the current behaviors across proto2/proto3 are distinct for all 3 |
||||
implementations: pure python, Python/C++, Python/upb. |
||||
|
||||
## Overview |
||||
|
||||
In meetings, we've discussed various alternatives, captured below. The original |
||||
plan was to make feature extensions runtime implementation-specific (e.g. C++, |
||||
Java, Python, upb). There are some notable complications that came up though: |
||||
|
||||
1. **Polyglot** - it's not clear how upb or C++ runtimes should behave in |
||||
multi-language situations. Which feature sets do they consider for runtime |
||||
behaviors? *Note: this is already a serious issue today, where all proto2 |
||||
strings and many proto3 strings are completely unsafe across languages.* |
||||
|
||||
2. **Shared Implementations** - Runtimes like upb and C++ are used as backing |
||||
implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP). |
||||
If we have a single set of `upb` or `cpp` features, migrating to those |
||||
shared implementations would be more difficult (since there's no independent |
||||
switches per-language). *Note: this is already the situation we're in today, |
||||
where switching the runtime implementation can cause subtle and dangerous |
||||
behavior changes.* |
||||
|
||||
Given that we only have two behaviors, and one of them is unambiguous, it seems |
||||
reasonable to punt on this decision until we have more information. We may |
||||
encounter more edge cases that require feature extensions (and give us more |
||||
information) during the rollout of edition zero. We also have a lot of freedom |
||||
to re-model features in later editions, so keeping the initial implementation as |
||||
simple as possible seems best (i.e. Alternative 2). |
||||
|
||||
## Alternatives |
||||
|
||||
### Alternative 1: Runtime Implementation Features |
||||
|
||||
Features would be per-runtime implementation as originally described in |
||||
"Editions Zero Feature: utf8_validation." For example, Protobuf Python users |
||||
would set different features depending on the backing implementation (e.g. |
||||
`features.(pb.cpp).<feature>`, `features.(pb.upb).<feature>`). |
||||
|
||||
#### Pros |
||||
|
||||
* Most consistent with range of behaviors expressible pre-Editions |
||||
|
||||
#### Cons |
||||
|
||||
* Implementation may / should not be obvious to users. |
||||
* Lack of levers specifically for language / implementation combos. For |
||||
example, there is no way to set Python-C++ behavior independently of C++ |
||||
behavior which may make migration harder from other Python implementations. |
||||
|
||||
### Alternative 2: Generator Features |
||||
|
||||
Features would be per-generator only (i.e. each protoc plugin would own one set |
||||
of features). This was the second decision we made in later discussions, and |
||||
while very similar to the above alternative, it's more inline with our goal of |
||||
making features primarily for codegen. |
||||
|
||||
For example, all Python implementations would share the same set of features |
||||
(e.g. `features.(pb.python).<feature>`). However, certain features could be |
||||
targeted to specific implementations (e.g. |
||||
`features.(pb.python).upb_utf8_validation` would only be used by Python/upb). |
||||
|
||||
#### Pros |
||||
|
||||
* Allows independent controls of shared implementations in different target |
||||
languages (e.g. Python's upb feature won't affect PHP). |
||||
|
||||
#### Cons |
||||
|
||||
* Possible complexity in upb to understand which language's features to |
||||
respect. UPB is not currently aware of what language it is being used for. |
||||
* Limits in-process sharing across languages with shared implementations (e.g. |
||||
Python upb, PHP upb) in the case of conflicting behaviors. |
||||
* Additional checks may be needed. |
||||
|
||||
### Alternative 3: Migrate to bytes |
||||
|
||||
Since this whole discussion revolves around the utf8 validation feature, one |
||||
option would be to just remove it from edition zero. Instead of adding a new |
||||
toggle for UTF8 behavior, we could simply migrate everyone who doesn't enforce |
||||
utf8 today to `bytes`. This would likely need another new *codegen* feature for |
||||
generating byte getters/setters as strings, but that wouldn't have any of the |
||||
ambiguity we're seeing today. |
||||
|
||||
Unfortunately, this doesn't seem feasible because of all the different behaviors |
||||
laid out in "Editions Zero Feature: utf8_validation." UTF8 validation isn't |
||||
really a binary on/off decision, and it can vary widely between languages. There |
||||
are many cases where UTF8 is validated in **some** languages but not others, and |
||||
there's also the C++ "hint" behavior that logs errors but allows invalid UTF8. |
||||
|
||||
**Note:** This could still be partially done in a follow-up LSC by targeting |
||||
specific combinations of the new feature that disable validation in all relevant |
||||
languages. |
||||
|
||||
#### Pros |
||||
|
||||
* Punts on the issue, we wouldn't need any upb features and C++ features would |
||||
all be code-gen only |
||||
* Simplifies the situation, avoids adding a very complicated feature in |
||||
edition zero |
||||
|
||||
#### Cons |
||||
|
||||
* Not really possible given the current complexity |
||||
* There are O(10M) proto2 string fields that would be blindly changed to bytes |
||||
|
||||
### Alternative 4: Nested Features |
||||
|
||||
Another option is to allow for shared feature set messages. For example, upb |
||||
would define a feature message, but *not* make it an extension of the global |
||||
`FeatureSet`. Instead, languages with upb implementations would have a field of |
||||
this type to allow for finer-grained controls. C++ would both extend the global |
||||
`FeatureSet` and also be allowed as a field in other languages. |
||||
|
||||
For example, python utf8 validation could be specified as: |
||||
|
||||
We could have checks during feature validation that enforce that impossible |
||||
combinations aren't specified. For example, with our current implementation |
||||
`features.(pb.python).cpp` should always be identical to `features.(pb.cpp)`, |
||||
since we don't have any mechanism for distinguishing them. |
||||
|
||||
#### Pros |
||||
|
||||
* Much more explicit than options 1 and 2 |
||||
|
||||
#### Cons |
||||
|
||||
* Maybe too explicit? Proto owners would be forced to duplicate a lot of |
||||
features |
Loading…
Reference in new issue