Protocol Buffers - Google's data interchange format (grpc依赖)
https://developers.google.com/protocol-buffers/
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
245 lines
10 KiB
245 lines
10 KiB
1 year ago
|
# Stricter Schemas with Editions
|
||
|
|
||
|
**Author:** [@mcy](https://github.com/mcy)
|
||
|
|
||
|
**Approved:** 2022-11-28
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
The Protobuf language is surprisingly lax in what it allows in some places, even
|
||
|
though these corners of the syntax space are rarely exercised in real use, and
|
||
|
which add complexity to backends and runtimes.
|
||
|
|
||
|
This document describes several such corners in the language, and how we might
|
||
|
use Editions to fix them (spoiler: we'll add a feature for each one and then
|
||
|
ratchet the features).
|
||
|
|
||
|
This is primarily a memo on a use-case for Editions, and not a design doc per
|
||
|
se.
|
||
|
|
||
|
## Potential Lints
|
||
|
|
||
|
### Entity Names
|
||
|
|
||
|
Protobuf does not enforce any constraints on names other than the "ASCII
|
||
|
identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This
|
||
|
results in problems for backends:
|
||
|
|
||
|
* Backends need to be able to convert between PascalCase, camelCase,
|
||
|
snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky.
|
||
|
* Extraneous underscores, such as underscores in names that want to be
|
||
|
PascalCase, trailing underscores, leading underscores, and repeated
|
||
|
underscores create problems for case conversion and can clash with private
|
||
|
names generated by backends.
|
||
|
* Protobuf does not support non-ASCII identifiers, mostly out of inertia more
|
||
|
than anything else. Because some languages (Java most prominent among them)
|
||
|
do not support them, we can never support them, but we are not particularly
|
||
|
clear on this point.
|
||
|
|
||
|
The Protobuf language should be as strict as possible in what patterns it
|
||
|
accepts for identifiers, since these need to be transformed to many languages.
|
||
|
Thus, we propose the following regexes for the three casings used in Protobuf:
|
||
|
|
||
|
* `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for:
|
||
|
* Messages.
|
||
|
* Enums.
|
||
|
* Services.
|
||
|
* Methods.
|
||
|
* `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for:
|
||
|
* Fields (including extensions).
|
||
|
* Package components.
|
||
|
* `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for:
|
||
|
* Enum values.
|
||
|
|
||
|
These patterns are intended to reject extraneous underscores, and to make casing
|
||
|
of ASCII letters consistent. We explicitly only support ASCII for maximal
|
||
|
portability to target languages. Note that option names are not included, since
|
||
|
those are defined as fields in a proto, and would be subject to this rule
|
||
|
automatically.
|
||
|
|
||
|
To migrate, we would introduce a bool feature `feature.relax_identifier_rules`,
|
||
|
which can be applied to any entity. When set, it would cause the compiler to
|
||
|
reject `.proto` files which contain identifiers that don't match the above
|
||
|
constraints. It would default to true and would switch to false in a future
|
||
|
edition.
|
||
|
|
||
|
### Keywords as Identifiers
|
||
|
|
||
|
Currently, the Protobuf language allows using keywords as identifiers. This
|
||
|
makes the parser somewhat more complicated than it has to be for minimal
|
||
|
benefit, and shadowing behavior is not well-specified. For example, what does
|
||
|
the following compile as?
|
||
|
|
||
|
```
|
||
|
message Foo {
|
||
|
message int32 {}
|
||
|
optional int32 foo = 1;
|
||
|
}
|
||
|
```
|
||
|
|
||
|
This is particularly fraught in places where either a keyword or a type name can
|
||
|
follow. For example, `optional foo = 1;` is a proto3 non-optional with type
|
||
|
`optional`, but the parser can't tell until it sees the `=` sign.
|
||
|
|
||
|
To avoid this and eventually stop supporting this in the parser, we make the
|
||
|
following set of keywords true reserved names that cannot be used as
|
||
|
identifiers:
|
||
|
|
||
|
```
|
||
|
bool bytes double edition enum extend extensions fixed32
|
||
|
fixed64 float group import int32 int64 map max
|
||
|
message oneof option optional package public repeated required
|
||
|
reserved returns rpc service sfixed32 sfixed64 sint32 sint64
|
||
|
stream string syntax to uint32 uint64 weak
|
||
|
```
|
||
|
|
||
|
Additionally, we introduce the syntax `#optional` for escaping a keyword as an
|
||
|
identifier. This may *only* be used on keywords, and not non-keyword
|
||
|
identifiers.
|
||
|
|
||
|
To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`,
|
||
|
which can be applied to any entity. When set, it would cause the compiler to
|
||
|
reject `.proto` files which contain identifiers that use the names of keywords.
|
||
|
It would migrate true->false in a future edition. The `#optional` syntax would
|
||
|
not need to be feature-gated.
|
||
|
|
||
|
From time to time we may introduce new keywords. The best procedure for doing so
|
||
|
is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then
|
||
|
switch it to false in an edition, which would cause it to be treated as a
|
||
|
keyword for the purposes of this check. There's nothing stopping us from
|
||
|
starting to use it in the syntax without an edition if it would be relatively
|
||
|
unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they
|
||
|
really hate contextual keywords since it complicates the parser, so keywords
|
||
|
start out as contextual and become properly reserved in the next Rust edition.
|
||
|
|
||
|
### Nonempty Package
|
||
|
|
||
|
Right now, an empty package is technically permitted. We should remove this
|
||
|
functionality from the language completely and require every file to declare a
|
||
|
package.
|
||
|
|
||
|
We would introduce a feature like `feature.allow_missing_package`, start it out
|
||
|
as true, and switch it to false.
|
||
|
|
||
|
### Invalid Names in `reserved`
|
||
|
|
||
|
Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field
|
||
|
and thus should be rejected. Ideally we should remove this syntax altogether and
|
||
|
only permit the use of identifiers in this position, such as `reserved foo,
|
||
|
bar;`.
|
||
|
|
||
|
We would introduce a feature like `feature.allow_strings_in_reserved`, start it
|
||
|
out as true, and then switch it to false.
|
||
|
|
||
|
### Almost All Names are Fully Qualified
|
||
|
|
||
|
Right now, Protobuf defines a complicated name resolution scheme that involves
|
||
|
matching subsets of names inspired by that of C++ (which is even more
|
||
|
complicated than ours!). Instead, we should require that every name be either a
|
||
|
single identifier OR fully-qualified. This is an attempt to move to Go-style
|
||
|
name resolution, which is significantly simpler to implement and explain.
|
||
|
|
||
|
In particular, if a name is a single identifier, then:
|
||
|
|
||
|
* It must be the name of a type defined at the top level of the current file.
|
||
|
* If it is the name of a message or enum for a field's type, it may be the
|
||
|
name of a type defined in the current message. This does *not* apply to
|
||
|
extension fields.
|
||
|
|
||
|
Because any multi-component path must be fully qualified, we no longer need the
|
||
|
`.foo.Bar` syntax anymore, except to refer to messages defined in files without
|
||
|
a package. We forbid `.`-prefixed names except in that case.
|
||
|
|
||
|
We would introduce a feature like `features.use_cpp_style_name_resolution`,
|
||
|
start it out as true, and then switch it to false.
|
||
|
|
||
|
Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted
|
||
|
at a message, rather than a package. In that case, we could go as far as saying
|
||
|
that "names that start with a lower-case letter are fully-qualified, otherwise
|
||
|
they are relative to the current package, and will only find things defined in
|
||
|
the current file."
|
||
|
|
||
|
Unlike Go, we do not allow finding things in other packages without being
|
||
|
fully-qualified; this mostly comes from doing source-diving in very large
|
||
|
packages, like the Go runtime, where it is very hard to find where something is
|
||
|
defined.
|
||
|
|
||
|
### Unique Enum Values
|
||
|
|
||
|
Right now, we allow aliases in enums:
|
||
|
|
||
|
```
|
||
|
enum Foo {
|
||
|
BAR = 5;
|
||
|
BAZ = 5;
|
||
|
}
|
||
|
```
|
||
|
|
||
|
This results in significant complexity in some parts of the backend, and weird
|
||
|
behavior in textproto and JSON. We should disallow this.
|
||
|
|
||
|
We would introduce a feature like `features.allow_enum_aliases`, which would
|
||
|
switch from true to false.
|
||
|
|
||
|
### Imports are Used
|
||
|
|
||
|
We should adopt the Go rule that all non-public imports are used (i.e, every
|
||
|
import provides at least one type referred to in the file).
|
||
|
|
||
|
We would introduce a feature like `features.allow_unused_imports`, which would
|
||
|
switch from true to false.
|
||
|
|
||
|
### Next Field # is Reserved
|
||
|
|
||
|
There's a few idioms for this checked by linters, such as `// Next ID: N`. We
|
||
|
should codify this in the language by rewriting that every message begin with
|
||
|
`reserved N to max;`, with the intent that `N` is the next never-used field
|
||
|
number. Because it is required to be the first production in the message, it can
|
||
|
be
|
||
|
|
||
|
We could, additionally, require that *every* field number be either used or
|
||
|
reserved, in addition to having a single `N to max;` reservation. Alternatively,
|
||
|
we could require that every field number up to the largest one used be reserved;
|
||
|
gaps between message numbers are usually a smell.
|
||
|
|
||
|
This applies equally to message fields and enum values.
|
||
|
|
||
|
We would introduce a feature like `features.allow_unused_numbers`, which we
|
||
|
would switch from true to false.
|
||
|
|
||
|
### Disallow Implicit String Concatenation
|
||
|
|
||
|
Protobuf will implicitly concatenate two adjacent strings in any place it allows
|
||
|
quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting
|
||
|
problems around `reserved` in the past, if a comma is omitted: `reserved "foo"
|
||
|
"bar";` is `reserved "foobar";`.
|
||
|
|
||
|
We would introduce a feature like `features.concatenate_adjacent_strings`, which
|
||
|
would switch from true to false.
|
||
|
|
||
|
### Package Is First
|
||
|
|
||
|
The `package` declaration can appear anywhere in the file after `syntax` or
|
||
|
`edition`. We should take cues from Go and require it to be the first thing in
|
||
|
the file, after the edition.
|
||
|
|
||
|
We would introduce a feature like `features.package_anywhere`, which would
|
||
|
switch from true to false.
|
||
|
|
||
|
### Strict Boolean Options
|
||
|
|
||
|
Boolean options can use true, false, True, False, T, or F as a value: `option
|
||
|
my_bool = T;`. We should restrict to only `true` and `false`.
|
||
|
|
||
|
We would introduce a feature like `features.loose_bool_options`, which would
|
||
|
switch from true to false.
|
||
|
|
||
|
### Decimal Field Numbers
|
||
|
|
||
|
We permit non-decimal integer literals for field numbers, e.g. `optional int32
|
||
|
x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should
|
||
|
require decimal literals, since there is very little reason to allow other
|
||
|
literals and makes the Protobuf language harder to parse.
|
||
|
|
||
|
We would introduce a feature like `features.non_decimal_field_numbers`, which
|
||
|
would switch from true to false.
|