Protocol Buffers - Google's data interchange format (grpc依赖)
https://developers.google.com/protocol-buffers/
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
150 lines
6.6 KiB
150 lines
6.6 KiB
# Editions: Feature Extension Layout |
|
|
|
**Author:** [@mkruskal-google](https://github.com/mkruskal-google), |
|
[@zhangskz](https://github.com/zhangskz) |
|
|
|
**Approved:** 2023-08-23 |
|
|
|
## Background |
|
|
|
"[What are Protobuf Editions](what-are-protobuf-editions.md)" lays out a plan |
|
for allowing for more targeted features not owned by the protobuf team. It uses |
|
extensions of the global features proto to implement this. One thing that was |
|
left a bit ambiguous was *who* should own these extensions. Language, code |
|
generator, and runtime implementations are all similar but not identical |
|
distinctions. |
|
|
|
"Editions Zero Feature: utf8_validation" (not available externally, though a |
|
later version, |
|
"[Editions Zero: utf8_validation Without Problematic Options](editions-zero-utf8_validation.md)" |
|
is) is a recent plan to add a new set of generator features for utf8 validation. |
|
While the sole feature we had originally created (`legacy_closed_enum` in Java |
|
and C++) didn't have any ambiguity here, this one did. Specifically in Python, |
|
the current behaviors across proto2/proto3 are distinct for all 3 |
|
implementations: pure python, Python/C++, Python/upb. |
|
|
|
## Overview |
|
|
|
In meetings, we've discussed various alternatives, captured below. The original |
|
plan was to make feature extensions runtime implementation-specific (e.g. C++, |
|
Java, Python, upb). There are some notable complications that came up though: |
|
|
|
1. **Polyglot** - it's not clear how upb or C++ runtimes should behave in |
|
multi-language situations. Which feature sets do they consider for runtime |
|
behaviors? *Note: this is already a serious issue today, where all proto2 |
|
strings and many proto3 strings are completely unsafe across languages.* |
|
|
|
2. **Shared Implementations** - Runtimes like upb and C++ are used as backing |
|
implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP). |
|
If we have a single set of `upb` or `cpp` features, migrating to those |
|
shared implementations would be more difficult (since there's no independent |
|
switches per-language). *Note: this is already the situation we're in today, |
|
where switching the runtime implementation can cause subtle and dangerous |
|
behavior changes.* |
|
|
|
Given that we only have two behaviors, and one of them is unambiguous, it seems |
|
reasonable to punt on this decision until we have more information. We may |
|
encounter more edge cases that require feature extensions (and give us more |
|
information) during the rollout of edition zero. We also have a lot of freedom |
|
to re-model features in later editions, so keeping the initial implementation as |
|
simple as possible seems best (i.e. Alternative 2). |
|
|
|
## Alternatives |
|
|
|
### Alternative 1: Runtime Implementation Features |
|
|
|
Features would be per-runtime implementation as originally described in |
|
"Editions Zero Feature: utf8_validation." For example, Protobuf Python users |
|
would set different features depending on the backing implementation (e.g. |
|
`features.(pb.cpp).<feature>`, `features.(pb.upb).<feature>`). |
|
|
|
#### Pros |
|
|
|
* Most consistent with range of behaviors expressible pre-Editions |
|
|
|
#### Cons |
|
|
|
* Implementation may / should not be obvious to users. |
|
* Lack of levers specifically for language / implementation combos. For |
|
example, there is no way to set Python-C++ behavior independently of C++ |
|
behavior which may make migration harder from other Python implementations. |
|
|
|
### Alternative 2: Generator Features |
|
|
|
Features would be per-generator only (i.e. each protoc plugin would own one set |
|
of features). This was the second decision we made in later discussions, and |
|
while very similar to the above alternative, it's more inline with our goal of |
|
making features primarily for codegen. |
|
|
|
For example, all Python implementations would share the same set of features |
|
(e.g. `features.(pb.python).<feature>`). However, certain features could be |
|
targeted to specific implementations (e.g. |
|
`features.(pb.python).upb_utf8_validation` would only be used by Python/upb). |
|
|
|
#### Pros |
|
|
|
* Allows independent controls of shared implementations in different target |
|
languages (e.g. Python's upb feature won't affect PHP). |
|
|
|
#### Cons |
|
|
|
* Possible complexity in upb to understand which language's features to |
|
respect. UPB is not currently aware of what language it is being used for. |
|
* Limits in-process sharing across languages with shared implementations (e.g. |
|
Python upb, PHP upb) in the case of conflicting behaviors. |
|
* Additional checks may be needed. |
|
|
|
### Alternative 3: Migrate to bytes |
|
|
|
Since this whole discussion revolves around the utf8 validation feature, one |
|
option would be to just remove it from edition zero. Instead of adding a new |
|
toggle for UTF8 behavior, we could simply migrate everyone who doesn't enforce |
|
utf8 today to `bytes`. This would likely need another new *codegen* feature for |
|
generating byte getters/setters as strings, but that wouldn't have any of the |
|
ambiguity we're seeing today. |
|
|
|
Unfortunately, this doesn't seem feasible because of all the different behaviors |
|
laid out in "Editions Zero Feature: utf8_validation." UTF8 validation isn't |
|
really a binary on/off decision, and it can vary widely between languages. There |
|
are many cases where UTF8 is validated in **some** languages but not others, and |
|
there's also the C++ "hint" behavior that logs errors but allows invalid UTF8. |
|
|
|
**Note:** This could still be partially done in a follow-up LSC by targeting |
|
specific combinations of the new feature that disable validation in all relevant |
|
languages. |
|
|
|
#### Pros |
|
|
|
* Punts on the issue, we wouldn't need any upb features and C++ features would |
|
all be code-gen only |
|
* Simplifies the situation, avoids adding a very complicated feature in |
|
edition zero |
|
|
|
#### Cons |
|
|
|
* Not really possible given the current complexity |
|
* There are O(10M) proto2 string fields that would be blindly changed to bytes |
|
|
|
### Alternative 4: Nested Features |
|
|
|
Another option is to allow for shared feature set messages. For example, upb |
|
would define a feature message, but *not* make it an extension of the global |
|
`FeatureSet`. Instead, languages with upb implementations would have a field of |
|
this type to allow for finer-grained controls. C++ would both extend the global |
|
`FeatureSet` and also be allowed as a field in other languages. |
|
|
|
For example, python utf8 validation could be specified as: |
|
|
|
We could have checks during feature validation that enforce that impossible |
|
combinations aren't specified. For example, with our current implementation |
|
`features.(pb.python).cpp` should always be identical to `features.(pb.cpp)`, |
|
since we don't have any mechanism for distinguishing them. |
|
|
|
#### Pros |
|
|
|
* Much more explicit than options 1 and 2 |
|
|
|
#### Cons |
|
|
|
* Maybe too explicit? Proto owners would be forced to duplicate a lot of |
|
features
|
|
|