PiperOrigin-RevId: 563172650pull/13815/head
parent
75af7f9838
commit
d83e9f3498
2 changed files with 246 additions and 0 deletions
@ -0,0 +1,245 @@ |
||||
# Stricter Schemas with Editions |
||||
|
||||
**Author:** [@mcy](https://github.com/mcy) |
||||
|
||||
**Approved:** 2022-11-28 |
||||
|
||||
## Overview |
||||
|
||||
The Protobuf language is surprisingly lax in what it allows in some places, even |
||||
though these corners of the syntax space are rarely exercised in real use, and |
||||
which add complexity to backends and runtimes. |
||||
|
||||
This document describes several such corners in the language, and how we might |
||||
use Editions to fix them (spoiler: we'll add a feature for each one and then |
||||
ratchet the features). |
||||
|
||||
This is primarily a memo on a use-case for Editions, and not a design doc per |
||||
se. |
||||
|
||||
## Potential Lints |
||||
|
||||
### Entity Names |
||||
|
||||
Protobuf does not enforce any constraints on names other than the "ASCII |
||||
identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This |
||||
results in problems for backends: |
||||
|
||||
* Backends need to be able to convert between PascalCase, camelCase, |
||||
snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky. |
||||
* Extraneous underscores, such as underscores in names that want to be |
||||
PascalCase, trailing underscores, leading underscores, and repeated |
||||
underscores create problems for case conversion and can clash with private |
||||
names generated by backends. |
||||
* Protobuf does not support non-ASCII identifiers, mostly out of inertia more |
||||
than anything else. Because some languages (Java most prominent among them) |
||||
do not support them, we can never support them, but we are not particularly |
||||
clear on this point. |
||||
|
||||
The Protobuf language should be as strict as possible in what patterns it |
||||
accepts for identifiers, since these need to be transformed to many languages. |
||||
Thus, we propose the following regexes for the three casings used in Protobuf: |
||||
|
||||
* `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for: |
||||
* Messages. |
||||
* Enums. |
||||
* Services. |
||||
* Methods. |
||||
* `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for: |
||||
* Fields (including extensions). |
||||
* Package components. |
||||
* `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for: |
||||
* Enum values. |
||||
|
||||
These patterns are intended to reject extraneous underscores, and to make casing |
||||
of ASCII letters consistent. We explicitly only support ASCII for maximal |
||||
portability to target languages. Note that option names are not included, since |
||||
those are defined as fields in a proto, and would be subject to this rule |
||||
automatically. |
||||
|
||||
To migrate, we would introduce a bool feature `feature.relax_identifier_rules`, |
||||
which can be applied to any entity. When set, it would cause the compiler to |
||||
reject `.proto` files which contain identifiers that don't match the above |
||||
constraints. It would default to true and would switch to false in a future |
||||
edition. |
||||
|
||||
### Keywords as Identifiers |
||||
|
||||
Currently, the Protobuf language allows using keywords as identifiers. This |
||||
makes the parser somewhat more complicated than it has to be for minimal |
||||
benefit, and shadowing behavior is not well-specified. For example, what does |
||||
the following compile as? |
||||
|
||||
``` |
||||
message Foo { |
||||
message int32 {} |
||||
optional int32 foo = 1; |
||||
} |
||||
``` |
||||
|
||||
This is particularly fraught in places where either a keyword or a type name can |
||||
follow. For example, `optional foo = 1;` is a proto3 non-optional with type |
||||
`optional`, but the parser can't tell until it sees the `=` sign. |
||||
|
||||
To avoid this and eventually stop supporting this in the parser, we make the |
||||
following set of keywords true reserved names that cannot be used as |
||||
identifiers: |
||||
|
||||
``` |
||||
bool bytes double edition enum extend extensions fixed32 |
||||
fixed64 float group import int32 int64 map max |
||||
message oneof option optional package public repeated required |
||||
reserved returns rpc service sfixed32 sfixed64 sint32 sint64 |
||||
stream string syntax to uint32 uint64 weak |
||||
``` |
||||
|
||||
Additionally, we introduce the syntax `#optional` for escaping a keyword as an |
||||
identifier. This may *only* be used on keywords, and not non-keyword |
||||
identifiers. |
||||
|
||||
To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`, |
||||
which can be applied to any entity. When set, it would cause the compiler to |
||||
reject `.proto` files which contain identifiers that use the names of keywords. |
||||
It would migrate true->false in a future edition. The `#optional` syntax would |
||||
not need to be feature-gated. |
||||
|
||||
From time to time we may introduce new keywords. The best procedure for doing so |
||||
is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then |
||||
switch it to false in an edition, which would cause it to be treated as a |
||||
keyword for the purposes of this check. There's nothing stopping us from |
||||
starting to use it in the syntax without an edition if it would be relatively |
||||
unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they |
||||
really hate contextual keywords since it complicates the parser, so keywords |
||||
start out as contextual and become properly reserved in the next Rust edition. |
||||
|
||||
### Nonempty Package |
||||
|
||||
Right now, an empty package is technically permitted. We should remove this |
||||
functionality from the language completely and require every file to declare a |
||||
package. |
||||
|
||||
We would introduce a feature like `feature.allow_missing_package`, start it out |
||||
as true, and switch it to false. |
||||
|
||||
### Invalid Names in `reserved` |
||||
|
||||
Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field |
||||
and thus should be rejected. Ideally we should remove this syntax altogether and |
||||
only permit the use of identifiers in this position, such as `reserved foo, |
||||
bar;`. |
||||
|
||||
We would introduce a feature like `feature.allow_strings_in_reserved`, start it |
||||
out as true, and then switch it to false. |
||||
|
||||
### Almost All Names are Fully Qualified |
||||
|
||||
Right now, Protobuf defines a complicated name resolution scheme that involves |
||||
matching subsets of names inspired by that of C++ (which is even more |
||||
complicated than ours!). Instead, we should require that every name be either a |
||||
single identifier OR fully-qualified. This is an attempt to move to Go-style |
||||
name resolution, which is significantly simpler to implement and explain. |
||||
|
||||
In particular, if a name is a single identifier, then: |
||||
|
||||
* It must be the name of a type defined at the top level of the current file. |
||||
* If it is the name of a message or enum for a field's type, it may be the |
||||
name of a type defined in the current message. This does *not* apply to |
||||
extension fields. |
||||
|
||||
Because any multi-component path must be fully qualified, we no longer need the |
||||
`.foo.Bar` syntax anymore, except to refer to messages defined in files without |
||||
a package. We forbid `.`-prefixed names except in that case. |
||||
|
||||
We would introduce a feature like `features.use_cpp_style_name_resolution`, |
||||
start it out as true, and then switch it to false. |
||||
|
||||
Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted |
||||
at a message, rather than a package. In that case, we could go as far as saying |
||||
that "names that start with a lower-case letter are fully-qualified, otherwise |
||||
they are relative to the current package, and will only find things defined in |
||||
the current file." |
||||
|
||||
Unlike Go, we do not allow finding things in other packages without being |
||||
fully-qualified; this mostly comes from doing source-diving in very large |
||||
packages, like the Go runtime, where it is very hard to find where something is |
||||
defined. |
||||
|
||||
### Unique Enum Values |
||||
|
||||
Right now, we allow aliases in enums: |
||||
|
||||
``` |
||||
enum Foo { |
||||
BAR = 5; |
||||
BAZ = 5; |
||||
} |
||||
``` |
||||
|
||||
This results in significant complexity in some parts of the backend, and weird |
||||
behavior in textproto and JSON. We should disallow this. |
||||
|
||||
We would introduce a feature like `features.allow_enum_aliases`, which would |
||||
switch from true to false. |
||||
|
||||
### Imports are Used |
||||
|
||||
We should adopt the Go rule that all non-public imports are used (i.e, every |
||||
import provides at least one type referred to in the file). |
||||
|
||||
We would introduce a feature like `features.allow_unused_imports`, which would |
||||
switch from true to false. |
||||
|
||||
### Next Field # is Reserved |
||||
|
||||
There's a few idioms for this checked by linters, such as `// Next ID: N`. We |
||||
should codify this in the language by rewriting that every message begin with |
||||
`reserved N to max;`, with the intent that `N` is the next never-used field |
||||
number. Because it is required to be the first production in the message, it can |
||||
be |
||||
|
||||
We could, additionally, require that *every* field number be either used or |
||||
reserved, in addition to having a single `N to max;` reservation. Alternatively, |
||||
we could require that every field number up to the largest one used be reserved; |
||||
gaps between message numbers are usually a smell. |
||||
|
||||
This applies equally to message fields and enum values. |
||||
|
||||
We would introduce a feature like `features.allow_unused_numbers`, which we |
||||
would switch from true to false. |
||||
|
||||
### Disallow Implicit String Concatenation |
||||
|
||||
Protobuf will implicitly concatenate two adjacent strings in any place it allows |
||||
quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting |
||||
problems around `reserved` in the past, if a comma is omitted: `reserved "foo" |
||||
"bar";` is `reserved "foobar";`. |
||||
|
||||
We would introduce a feature like `features.concatenate_adjacent_strings`, which |
||||
would switch from true to false. |
||||
|
||||
### Package Is First |
||||
|
||||
The `package` declaration can appear anywhere in the file after `syntax` or |
||||
`edition`. We should take cues from Go and require it to be the first thing in |
||||
the file, after the edition. |
||||
|
||||
We would introduce a feature like `features.package_anywhere`, which would |
||||
switch from true to false. |
||||
|
||||
### Strict Boolean Options |
||||
|
||||
Boolean options can use true, false, True, False, T, or F as a value: `option |
||||
my_bool = T;`. We should restrict to only `true` and `false`. |
||||
|
||||
We would introduce a feature like `features.loose_bool_options`, which would |
||||
switch from true to false. |
||||
|
||||
### Decimal Field Numbers |
||||
|
||||
We permit non-decimal integer literals for field numbers, e.g. `optional int32 |
||||
x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should |
||||
require decimal literals, since there is very little reason to allow other |
||||
literals and makes the Protobuf language harder to parse. |
||||
|
||||
We would introduce a feature like `features.non_decimal_field_numbers`, which |
||||
would switch from true to false. |
Loading…
Reference in new issue