Protocol Buffers - Google's data interchange format (grpc依赖)
https://developers.google.com/protocol-buffers/
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
151 lines
6.6 KiB
151 lines
6.6 KiB
1 year ago
|
# Editions: Feature Extension Layout
|
||
|
|
||
|
**Author:** [@mkruskal-google](https://github.com/mkruskal-google),
|
||
|
[@zhangskz](https://github.com/zhangskz)
|
||
|
|
||
|
**Approved:** 2023-08-23
|
||
|
|
||
|
## Background
|
||
|
|
||
|
"[What are Protobuf Editions](what-are-protobuf-editions.md)" lays out a plan
|
||
|
for allowing for more targeted features not owned by the protobuf team. It uses
|
||
|
extensions of the global features proto to implement this. One thing that was
|
||
|
left a bit ambiguous was *who* should own these extensions. Language, code
|
||
|
generator, and runtime implementations are all similar but not identical
|
||
|
distinctions.
|
||
|
|
||
|
"Editions Zero Feature: utf8_validation" (not available externally, though a
|
||
|
later version,
|
||
|
"[Editions Zero: utf8_validation Without Problematic Options](editions-zero-utf8_validation.md)"
|
||
|
is) is a recent plan to add a new set of generator features for utf8 validation.
|
||
|
While the sole feature we had originally created (`legacy_closed_enum` in Java
|
||
|
and C++) didn't have any ambiguity here, this one did. Specifically in Python,
|
||
|
the current behaviors across proto2/proto3 are distinct for all 3
|
||
|
implementations: pure python, Python/C++, Python/upb.
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
In meetings, we've discussed various alternatives, captured below. The original
|
||
|
plan was to make feature extensions runtime implementation-specific (e.g. C++,
|
||
|
Java, Python, upb). There are some notable complications that came up though:
|
||
|
|
||
|
1. **Polyglot** - it's not clear how upb or C++ runtimes should behave in
|
||
|
multi-language situations. Which feature sets do they consider for runtime
|
||
|
behaviors? *Note: this is already a serious issue today, where all proto2
|
||
|
strings and many proto3 strings are completely unsafe across languages.*
|
||
|
|
||
|
2. **Shared Implementations** - Runtimes like upb and C++ are used as backing
|
||
|
implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP).
|
||
|
If we have a single set of `upb` or `cpp` features, migrating to those
|
||
|
shared implementations would be more difficult (since there's no independent
|
||
|
switches per-language). *Note: this is already the situation we're in today,
|
||
|
where switching the runtime implementation can cause subtle and dangerous
|
||
|
behavior changes.*
|
||
|
|
||
|
Given that we only have two behaviors, and one of them is unambiguous, it seems
|
||
|
reasonable to punt on this decision until we have more information. We may
|
||
|
encounter more edge cases that require feature extensions (and give us more
|
||
|
information) during the rollout of edition zero. We also have a lot of freedom
|
||
|
to re-model features in later editions, so keeping the initial implementation as
|
||
|
simple as possible seems best (i.e. Alternative 2).
|
||
|
|
||
|
## Alternatives
|
||
|
|
||
|
### Alternative 1: Runtime Implementation Features
|
||
|
|
||
|
Features would be per-runtime implementation as originally described in
|
||
|
"Editions Zero Feature: utf8_validation." For example, Protobuf Python users
|
||
|
would set different features depending on the backing implementation (e.g.
|
||
|
`features.(pb.cpp).<feature>`, `features.(pb.upb).<feature>`).
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Most consistent with range of behaviors expressible pre-Editions
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* Implementation may / should not be obvious to users.
|
||
|
* Lack of levers specifically for language / implementation combos. For
|
||
|
example, there is no way to set Python-C++ behavior independently of C++
|
||
|
behavior which may make migration harder from other Python implementations.
|
||
|
|
||
|
### Alternative 2: Generator Features
|
||
|
|
||
|
Features would be per-generator only (i.e. each protoc plugin would own one set
|
||
|
of features). This was the second decision we made in later discussions, and
|
||
|
while very similar to the above alternative, it's more inline with our goal of
|
||
|
making features primarily for codegen.
|
||
|
|
||
|
For example, all Python implementations would share the same set of features
|
||
|
(e.g. `features.(pb.python).<feature>`). However, certain features could be
|
||
|
targeted to specific implementations (e.g.
|
||
|
`features.(pb.python).upb_utf8_validation` would only be used by Python/upb).
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Allows independent controls of shared implementations in different target
|
||
|
languages (e.g. Python's upb feature won't affect PHP).
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* Possible complexity in upb to understand which language's features to
|
||
|
respect. UPB is not currently aware of what language it is being used for.
|
||
|
* Limits in-process sharing across languages with shared implementations (e.g.
|
||
|
Python upb, PHP upb) in the case of conflicting behaviors.
|
||
|
* Additional checks may be needed.
|
||
|
|
||
|
### Alternative 3: Migrate to bytes
|
||
|
|
||
|
Since this whole discussion revolves around the utf8 validation feature, one
|
||
|
option would be to just remove it from edition zero. Instead of adding a new
|
||
|
toggle for UTF8 behavior, we could simply migrate everyone who doesn't enforce
|
||
|
utf8 today to `bytes`. This would likely need another new *codegen* feature for
|
||
|
generating byte getters/setters as strings, but that wouldn't have any of the
|
||
|
ambiguity we're seeing today.
|
||
|
|
||
|
Unfortunately, this doesn't seem feasible because of all the different behaviors
|
||
|
laid out in "Editions Zero Feature: utf8_validation." UTF8 validation isn't
|
||
|
really a binary on/off decision, and it can vary widely between languages. There
|
||
|
are many cases where UTF8 is validated in **some** languages but not others, and
|
||
|
there's also the C++ "hint" behavior that logs errors but allows invalid UTF8.
|
||
|
|
||
|
**Note:** This could still be partially done in a follow-up LSC by targeting
|
||
|
specific combinations of the new feature that disable validation in all relevant
|
||
|
languages.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Punts on the issue, we wouldn't need any upb features and C++ features would
|
||
|
all be code-gen only
|
||
|
* Simplifies the situation, avoids adding a very complicated feature in
|
||
|
edition zero
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* Not really possible given the current complexity
|
||
|
* There are O(10M) proto2 string fields that would be blindly changed to bytes
|
||
|
|
||
|
### Alternative 4: Nested Features
|
||
|
|
||
|
Another option is to allow for shared feature set messages. For example, upb
|
||
|
would define a feature message, but *not* make it an extension of the global
|
||
|
`FeatureSet`. Instead, languages with upb implementations would have a field of
|
||
|
this type to allow for finer-grained controls. C++ would both extend the global
|
||
|
`FeatureSet` and also be allowed as a field in other languages.
|
||
|
|
||
|
For example, python utf8 validation could be specified as:
|
||
|
|
||
|
We could have checks during feature validation that enforce that impossible
|
||
|
combinations aren't specified. For example, with our current implementation
|
||
|
`features.(pb.python).cpp` should always be identical to `features.(pb.cpp)`,
|
||
|
since we don't have any mechanism for distinguishing them.
|
||
|
|
||
|
#### Pros
|
||
|
|
||
|
* Much more explicit than options 1 and 2
|
||
|
|
||
|
#### Cons
|
||
|
|
||
|
* Maybe too explicit? Proto owners would be forced to duplicate a lot of
|
||
|
features
|