10 KiB
Stricter Schemas with Editions
Author: @mcy
Approved: 2022-11-28
Overview
The Protobuf language is surprisingly lax in what it allows in some places, even though these corners of the syntax space are rarely exercised in real use, and which add complexity to backends and runtimes.
This document describes several such corners in the language, and how we might use Editions to fix them (spoiler: we'll add a feature for each one and then ratchet the features).
This is primarily a memo on a use-case for Editions, and not a design doc per se.
Potential Lints
Entity Names
Protobuf does not enforce any constraints on names other than the "ASCII
identifier" rule: they must match the regex [A-Za-z_][A-Za-z0-9_]*
. This
results in problems for backends:
- Backends need to be able to convert between PascalCase, camelCase, snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky.
- Extraneous underscores, such as underscores in names that want to be PascalCase, trailing underscores, leading underscores, and repeated underscores create problems for case conversion and can clash with private names generated by backends.
- Protobuf does not support non-ASCII identifiers, mostly out of inertia more than anything else. Because some languages (Java most prominent among them) do not support them, we can never support them, but we are not particularly clear on this point.
The Protobuf language should be as strict as possible in what patterns it accepts for identifiers, since these need to be transformed to many languages. Thus, we propose the following regexes for the three casings used in Protobuf:
([A-Z][a-zA-Z0-9]*)+
for PascalCase. We require this case for:- Messages.
- Enums.
- Services.
- Methods.
[a-z][a-z0-9]*(_[a-z0-9]+)*
for snake_case. We require this case for:- Fields (including extensions).
- Package components.
[A-Z][A-Z0-9]*(_[A-Z0-9]+)*
for SHOUTY_CASE. We require this case for:- Enum values.
These patterns are intended to reject extraneous underscores, and to make casing of ASCII letters consistent. We explicitly only support ASCII for maximal portability to target languages. Note that option names are not included, since those are defined as fields in a proto, and would be subject to this rule automatically.
To migrate, we would introduce a bool feature feature.relax_identifier_rules
,
which can be applied to any entity. When set, it would cause the compiler to
reject .proto
files which contain identifiers that don't match the above
constraints. It would default to true and would switch to false in a future
edition.
Keywords as Identifiers
Currently, the Protobuf language allows using keywords as identifiers. This makes the parser somewhat more complicated than it has to be for minimal benefit, and shadowing behavior is not well-specified. For example, what does the following compile as?
message Foo {
message int32 {}
optional int32 foo = 1;
}
This is particularly fraught in places where either a keyword or a type name can
follow. For example, optional foo = 1;
is a proto3 non-optional with type
optional
, but the parser can't tell until it sees the =
sign.
To avoid this and eventually stop supporting this in the parser, we make the following set of keywords true reserved names that cannot be used as identifiers:
bool bytes double edition enum extend extensions fixed32
fixed64 float group import int32 int64 map max
message oneof option optional package public repeated required
reserved returns rpc service sfixed32 sfixed64 sint32 sint64
stream string syntax to uint32 uint64 weak
Additionally, we introduce the syntax #optional
for escaping a keyword as an
identifier. This may only be used on keywords, and not non-keyword
identifiers.
To migrate, we would introduce a bool feature feature.keywords_as_identifiers
,
which can be applied to any entity. When set, it would cause the compiler to
reject .proto
files which contain identifiers that use the names of keywords.
It would migrate true->false in a future edition. The #optional
syntax would
not need to be feature-gated.
From time to time we may introduce new keywords. The best procedure for doing so
is to add a feature.xxx_is_a_keyword
feature, start it out as true, and then
switch it to false in an edition, which would cause it to be treated as a
keyword for the purposes of this check. There's nothing stopping us from
starting to use it in the syntax without an edition if it would be relatively
unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they
really hate contextual keywords since it complicates the parser, so keywords
start out as contextual and become properly reserved in the next Rust edition.
Nonempty Package
Right now, an empty package is technically permitted. We should remove this functionality from the language completely and require every file to declare a package.
We would introduce a feature like feature.allow_missing_package
, start it out
as true, and switch it to false.
Invalid Names in reserved
Currently, reserved "foo-bar";
is accepted. It is not a valid name for a field
and thus should be rejected. Ideally we should remove this syntax altogether and
only permit the use of identifiers in this position, such as reserved foo, bar;
.
We would introduce a feature like feature.allow_strings_in_reserved
, start it
out as true, and then switch it to false.
Almost All Names are Fully Qualified
Right now, Protobuf defines a complicated name resolution scheme that involves matching subsets of names inspired by that of C++ (which is even more complicated than ours!). Instead, we should require that every name be either a single identifier OR fully-qualified. This is an attempt to move to Go-style name resolution, which is significantly simpler to implement and explain.
In particular, if a name is a single identifier, then:
- It must be the name of a type defined at the top level of the current file.
- If it is the name of a message or enum for a field's type, it may be the name of a type defined in the current message. This does not apply to extension fields.
Because any multi-component path must be fully qualified, we no longer need the
.foo.Bar
syntax anymore, except to refer to messages defined in files without
a package. We forbid .
-prefixed names except in that case.
We would introduce a feature like features.use_cpp_style_name_resolution
,
start it out as true, and then switch it to false.
Ideally, if we get strict identifier names, we can tell that Foo.Bar
is rooted
at a message, rather than a package. In that case, we could go as far as saying
that "names that start with a lower-case letter are fully-qualified, otherwise
they are relative to the current package, and will only find things defined in
the current file."
Unlike Go, we do not allow finding things in other packages without being fully-qualified; this mostly comes from doing source-diving in very large packages, like the Go runtime, where it is very hard to find where something is defined.
Unique Enum Values
Right now, we allow aliases in enums:
enum Foo {
BAR = 5;
BAZ = 5;
}
This results in significant complexity in some parts of the backend, and weird behavior in textproto and JSON. We should disallow this.
We would introduce a feature like features.allow_enum_aliases
, which would
switch from true to false.
Imports are Used
We should adopt the Go rule that all non-public imports are used (i.e, every import provides at least one type referred to in the file).
We would introduce a feature like features.allow_unused_imports
, which would
switch from true to false.
Next Field # is Reserved
There's a few idioms for this checked by linters, such as // Next ID: N
. We
should codify this in the language by rewriting that every message begin with
reserved N to max;
, with the intent that N
is the next never-used field
number. Because it is required to be the first production in the message, it can
be
We could, additionally, require that every field number be either used or
reserved, in addition to having a single N to max;
reservation. Alternatively,
we could require that every field number up to the largest one used be reserved;
gaps between message numbers are usually a smell.
This applies equally to message fields and enum values.
We would introduce a feature like features.allow_unused_numbers
, which we
would switch from true to false.
Disallow Implicit String Concatenation
Protobuf will implicitly concatenate two adjacent strings in any place it allows
quoted strings, e.g. option foo = "bar " "baz;
. This has caused interesting
problems around reserved
in the past, if a comma is omitted: reserved "foo" "bar";
is reserved "foobar";
.
We would introduce a feature like features.concatenate_adjacent_strings
, which
would switch from true to false.
Package Is First
The package
declaration can appear anywhere in the file after syntax
or
edition
. We should take cues from Go and require it to be the first thing in
the file, after the edition.
We would introduce a feature like features.package_anywhere
, which would
switch from true to false.
Strict Boolean Options
Boolean options can use true, false, True, False, T, or F as a value: option my_bool = T;
. We should restrict to only true
and false
.
We would introduce a feature like features.loose_bool_options
, which would
switch from true to false.
Decimal Field Numbers
We permit non-decimal integer literals for field numbers, e.g. optional int32 x = 0x01;
. Thankfully(?) we do not already permit a leading + or -. We should
require decimal literals, since there is very little reason to allow other
literals and makes the Protobuf language harder to parse.
We would introduce a feature like features.non_decimal_field_numbers
, which
would switch from true to false.