protobuf/docs/design/editions/stricter-schemas-with-editi...

# Stricter Schemas with Editions

**Author:** [@mcy](https://github.com/mcy)

**Approved:** 2022-11-28

## Overview

The Protobuf language is surprisingly lax in what it allows in some places, even
though these corners of the syntax space are rarely exercised in real use, and
which add complexity to backends and runtimes.

This document describes several such corners in the language, and how we might
use Editions to fix them (spoiler: we'll add a feature for each one and then
ratchet the features).

This is primarily a memo on a use-case for Editions, and not a design doc per
se.

## Potential Lints

### Entity Names

Protobuf does not enforce any constraints on names other than the "ASCII
identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This
results in problems for backends:

*   Backends need to be able to convert between PascalCase, camelCase,
    snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky.
*   Extraneous underscores, such as underscores in names that want to be
    PascalCase, trailing underscores, leading underscores, and repeated
    underscores create problems for case conversion and can clash with private
    names generated by backends.
*   Protobuf does not support non-ASCII identifiers, mostly out of inertia more
    than anything else. Because some languages (Java most prominent among them)
    do not support them, we can never support them, but we are not particularly
    clear on this point.

The Protobuf language should be as strict as possible in what patterns it
accepts for identifiers, since these need to be transformed to many languages.
Thus, we propose the following regexes for the three casings used in Protobuf:

*   `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for:
    *   Messages.
    *   Enums.
    *   Services.
    *   Methods.
*   `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for:
    *   Fields (including extensions).
    *   Package components.
*   `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for:
    *   Enum values.

These patterns are intended to reject extraneous underscores, and to make casing
of ASCII letters consistent. We explicitly only support ASCII for maximal
portability to target languages. Note that option names are not included, since
those are defined as fields in a proto, and would be subject to this rule
automatically.

To migrate, we would introduce a bool feature `feature.relax_identifier_rules`,
which can be applied to any entity. When set, it would cause the compiler to
reject `.proto` files which contain identifiers that don't match the above
constraints. It would default to true and would switch to false in a future
edition.

### Keywords as Identifiers

Currently, the Protobuf language allows using keywords as identifiers. This
makes the parser somewhat more complicated than it has to be for minimal
benefit, and shadowing behavior is not well-specified. For example, what does
the following compile as?

```
message Foo {
  message int32 {}
  optional int32 foo = 1;
}
```

This is particularly fraught in places where either a keyword or a type name can
follow. For example, `optional foo = 1;` is a proto3 non-optional with type
`optional`, but the parser can't tell until it sees the `=` sign.

To avoid this and eventually stop supporting this in the parser, we make the
following set of keywords true reserved names that cannot be used as
identifiers:

```
bool      bytes    double  edition   enum      extend    extensions  fixed32
fixed64   float    group   import    int32     int64     map         max
message   oneof    option  optional  package   public    repeated    required
reserved  returns  rpc     service   sfixed32  sfixed64  sint32      sint64
stream    string   syntax  to        uint32    uint64    weak
```

Additionally, we introduce the syntax `#optional` for escaping a keyword as an
identifier. This may *only* be used on keywords, and not non-keyword
identifiers.

To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`,
which can be applied to any entity. When set, it would cause the compiler to
reject `.proto` files which contain identifiers that use the names of keywords.
It would migrate true->false in a future edition. The `#optional` syntax would
not need to be feature-gated.

From time to time we may introduce new keywords. The best procedure for doing so
is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then
switch it to false in an edition, which would cause it to be treated as a
keyword for the purposes of this check. There's nothing stopping us from
starting to use it in the syntax without an edition if it would be relatively
unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they
really hate contextual keywords since it complicates the parser, so keywords
start out as contextual and become properly reserved in the next Rust edition.

### Nonempty Package

Right now, an empty package is technically permitted. We should remove this
functionality from the language completely and require every file to declare a
package.

We would introduce a feature like `feature.allow_missing_package`, start it out
as true, and switch it to false.

### Invalid Names in `reserved`

Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field
and thus should be rejected. Ideally we should remove this syntax altogether and
only permit the use of identifiers in this position, such as `reserved foo,
bar;`.

We would introduce a feature like `feature.allow_strings_in_reserved`, start it
out as true, and then switch it to false.

### Almost All Names are Fully Qualified

Right now, Protobuf defines a complicated name resolution scheme that involves
matching subsets of names inspired by that of C++ (which is even more
complicated than ours!). Instead, we should require that every name be either a
single identifier OR fully-qualified. This is an attempt to move to Go-style
name resolution, which is significantly simpler to implement and explain.

In particular, if a name is a single identifier, then:

*   It must be the name of a type defined at the top level of the current file.
*   If it is the name of a message or enum for a field's type, it may be the
    name of a type defined in the current message. This does *not* apply to
    extension fields.

Because any multi-component path must be fully qualified, we no longer need the
`.foo.Bar` syntax anymore, except to refer to messages defined in files without
a package. We forbid `.`-prefixed names except in that case.

We would introduce a feature like `features.use_cpp_style_name_resolution`,
start it out as true, and then switch it to false.

Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted
at a message, rather than a package. In that case, we could go as far as saying
that "names that start with a lower-case letter are fully-qualified, otherwise
they are relative to the current package, and will only find things defined in
the current file."

Unlike Go, we do not allow finding things in other packages without being
fully-qualified; this mostly comes from doing source-diving in very large
packages, like the Go runtime, where it is very hard to find where something is
defined.

### Unique Enum Values

Right now, we allow aliases in enums:

```
enum Foo {
  BAR = 5;
  BAZ = 5;
}
```

This results in significant complexity in some parts of the backend, and weird
behavior in textproto and JSON. We should disallow this.

We would introduce a feature like `features.allow_enum_aliases`, which would
switch from true to false.

### Imports are Used

We should adopt the Go rule that all non-public imports are used (i.e, every
import provides at least one type referred to in the file).

We would introduce a feature like `features.allow_unused_imports`, which would
switch from true to false.

### Next Field # is Reserved

There's a few idioms for this checked by linters, such as `// Next ID: N`. We
should codify this in the language by rewriting that every message begin with
`reserved N to max;`, with the intent that `N` is the next never-used field
number. Because it is required to be the first production in the message, it can
be

We could, additionally, require that *every* field number be either used or
reserved, in addition to having a single `N to max;` reservation. Alternatively,
we could require that every field number up to the largest one used be reserved;
gaps between message numbers are usually a smell.

This applies equally to message fields and enum values.

We would introduce a feature like `features.allow_unused_numbers`, which we
would switch from true to false.

### Disallow Implicit String Concatenation

Protobuf will implicitly concatenate two adjacent strings in any place it allows
quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting
problems around `reserved` in the past, if a comma is omitted: `reserved "foo"
"bar";` is `reserved "foobar";`.

We would introduce a feature like `features.concatenate_adjacent_strings`, which
would switch from true to false.

### Package Is First

The `package` declaration can appear anywhere in the file after `syntax` or
`edition`. We should take cues from Go and require it to be the first thing in
the file, after the edition.

We would introduce a feature like `features.package_anywhere`, which would
switch from true to false.

### Strict Boolean Options

Boolean options can use true, false, True, False, T, or F as a value: `option
my_bool = T;`. We should restrict to only `true` and `false`.

We would introduce a feature like `features.loose_bool_options`, which would
switch from true to false.

### Decimal Field Numbers

We permit non-decimal integer literals for field numbers, e.g. `optional int32
x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should
require decimal literals, since there is very little reason to allow other
literals and makes the Protobuf language harder to parse.

We would introduce a feature like `features.non_decimal_field_numbers`, which
would switch from true to false.
Adds "Stricter Schemas with Editions" to the GitHub code repository. PiperOrigin-RevId: 563172650 1 year ago			`# Stricter Schemas with Editions`

			`Author: [@mcy](https://github.com/mcy)`

			`Approved: 2022-11-28`

			`## Overview`

			`The Protobuf language is surprisingly lax in what it allows in some places, even`
			`though these corners of the syntax space are rarely exercised in real use, and`
			`which add complexity to backends and runtimes.`

			`This document describes several such corners in the language, and how we might`
			`use Editions to fix them (spoiler: we'll add a feature for each one and then`
			`ratchet the features).`

			`This is primarily a memo on a use-case for Editions, and not a design doc per`
			`se.`

			`## Potential Lints`

			`### Entity Names`

			`Protobuf does not enforce any constraints on names other than the "ASCII`
			identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This
			`results in problems for backends:`

			`* Backends need to be able to convert between PascalCase, camelCase,`
			`snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky.`
			`* Extraneous underscores, such as underscores in names that want to be`
			`PascalCase, trailing underscores, leading underscores, and repeated`
			`underscores create problems for case conversion and can clash with private`
			`names generated by backends.`
			`* Protobuf does not support non-ASCII identifiers, mostly out of inertia more`
			`than anything else. Because some languages (Java most prominent among them)`
			`do not support them, we can never support them, but we are not particularly`
			`clear on this point.`

			`The Protobuf language should be as strict as possible in what patterns it`
			`accepts for identifiers, since these need to be transformed to many languages.`
			`Thus, we propose the following regexes for the three casings used in Protobuf:`

			* `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for:
			`* Messages.`
			`* Enums.`
			`* Services.`
			`* Methods.`
			* `[a-z][a-z0-9](_[a-z0-9]+)` for snake_case. We require this case for:
			`* Fields (including extensions).`
			`* Package components.`
			* `[A-Z][A-Z0-9](_[A-Z0-9]+)` for SHOUTY_CASE. We require this case for:
			`* Enum values.`

			`These patterns are intended to reject extraneous underscores, and to make casing`
			`of ASCII letters consistent. We explicitly only support ASCII for maximal`
			`portability to target languages. Note that option names are not included, since`
			`those are defined as fields in a proto, and would be subject to this rule`
			`automatically.`

			To migrate, we would introduce a bool feature `feature.relax_identifier_rules`,
			`which can be applied to any entity. When set, it would cause the compiler to`
			reject `.proto` files which contain identifiers that don't match the above
			`constraints. It would default to true and would switch to false in a future`
			`edition.`

			`### Keywords as Identifiers`

			`Currently, the Protobuf language allows using keywords as identifiers. This`
			`makes the parser somewhat more complicated than it has to be for minimal`
			`benefit, and shadowing behavior is not well-specified. For example, what does`
			`the following compile as?`

			```
			`message Foo {`
			`message int32 {}`
			`optional int32 foo = 1;`
			`}`
			```

			`This is particularly fraught in places where either a keyword or a type name can`
			follow. For example, `optional foo = 1;` is a proto3 non-optional with type
			`optional`, but the parser can't tell until it sees the `=` sign.

			`To avoid this and eventually stop supporting this in the parser, we make the`
			`following set of keywords true reserved names that cannot be used as`
			`identifiers:`

			```
			`bool bytes double edition enum extend extensions fixed32`
			`fixed64 float group import int32 int64 map max`
			`message oneof option optional package public repeated required`
			`reserved returns rpc service sfixed32 sfixed64 sint32 sint64`
			`stream string syntax to uint32 uint64 weak`
			```

			Additionally, we introduce the syntax `#optional` for escaping a keyword as an
			`identifier. This may only be used on keywords, and not non-keyword`
			`identifiers.`

			To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`,
			`which can be applied to any entity. When set, it would cause the compiler to`
			reject `.proto` files which contain identifiers that use the names of keywords.
			It would migrate true->false in a future edition. The `#optional` syntax would
			`not need to be feature-gated.`

			`From time to time we may introduce new keywords. The best procedure for doing so`
			is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then
			`switch it to false in an edition, which would cause it to be treated as a`
			`keyword for the purposes of this check. There's nothing stopping us from`
			`starting to use it in the syntax without an edition if it would be relatively`
			`unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they`
			`really hate contextual keywords since it complicates the parser, so keywords`
			`start out as contextual and become properly reserved in the next Rust edition.`

			`### Nonempty Package`

			`Right now, an empty package is technically permitted. We should remove this`
			`functionality from the language completely and require every file to declare a`
			`package.`

			We would introduce a feature like `feature.allow_missing_package`, start it out
			`as true, and switch it to false.`

			### Invalid Names in `reserved`

			Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field
			`and thus should be rejected. Ideally we should remove this syntax altogether and`
			only permit the use of identifiers in this position, such as `reserved foo,
			bar;`.

			We would introduce a feature like `feature.allow_strings_in_reserved`, start it
			`out as true, and then switch it to false.`

			`### Almost All Names are Fully Qualified`

			`Right now, Protobuf defines a complicated name resolution scheme that involves`
			`matching subsets of names inspired by that of C++ (which is even more`
			`complicated than ours!). Instead, we should require that every name be either a`
			`single identifier OR fully-qualified. This is an attempt to move to Go-style`
			`name resolution, which is significantly simpler to implement and explain.`

			`In particular, if a name is a single identifier, then:`

			`* It must be the name of a type defined at the top level of the current file.`
			`* If it is the name of a message or enum for a field's type, it may be the`
			`name of a type defined in the current message. This does not apply to`
			`extension fields.`

			`Because any multi-component path must be fully qualified, we no longer need the`
			`.foo.Bar` syntax anymore, except to refer to messages defined in files without
			a package. We forbid `.`-prefixed names except in that case.

			We would introduce a feature like `features.use_cpp_style_name_resolution`,
			`start it out as true, and then switch it to false.`

			Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted
			`at a message, rather than a package. In that case, we could go as far as saying`
			`that "names that start with a lower-case letter are fully-qualified, otherwise`
			`they are relative to the current package, and will only find things defined in`
			`the current file."`

			`Unlike Go, we do not allow finding things in other packages without being`
			`fully-qualified; this mostly comes from doing source-diving in very large`
			`packages, like the Go runtime, where it is very hard to find where something is`
			`defined.`

			`### Unique Enum Values`

			`Right now, we allow aliases in enums:`

			```
			`enum Foo {`
			`BAR = 5;`
			`BAZ = 5;`
			`}`
			```

			`This results in significant complexity in some parts of the backend, and weird`
			`behavior in textproto and JSON. We should disallow this.`

			We would introduce a feature like `features.allow_enum_aliases`, which would
			`switch from true to false.`

			`### Imports are Used`

			`We should adopt the Go rule that all non-public imports are used (i.e, every`
			`import provides at least one type referred to in the file).`

			We would introduce a feature like `features.allow_unused_imports`, which would
			`switch from true to false.`

			`### Next Field # is Reserved`

			There's a few idioms for this checked by linters, such as `// Next ID: N`. We
			`should codify this in the language by rewriting that every message begin with`
			`reserved N to max;`, with the intent that `N` is the next never-used field
			`number. Because it is required to be the first production in the message, it can`
			`be`

			`We could, additionally, require that every field number be either used or`
			reserved, in addition to having a single `N to max;` reservation. Alternatively,
			`we could require that every field number up to the largest one used be reserved;`
			`gaps between message numbers are usually a smell.`

			`This applies equally to message fields and enum values.`

			We would introduce a feature like `features.allow_unused_numbers`, which we
			`would switch from true to false.`

			`### Disallow Implicit String Concatenation`

			`Protobuf will implicitly concatenate two adjacent strings in any place it allows`
			quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting
			problems around `reserved` in the past, if a comma is omitted: `reserved "foo"
			"bar";` is `reserved "foobar";`.

			We would introduce a feature like `features.concatenate_adjacent_strings`, which
			`would switch from true to false.`

			`### Package Is First`

			The `package` declaration can appear anywhere in the file after `syntax` or
			`edition`. We should take cues from Go and require it to be the first thing in
			`the file, after the edition.`

			We would introduce a feature like `features.package_anywhere`, which would
			`switch from true to false.`

			`### Strict Boolean Options`

			Boolean options can use true, false, True, False, T, or F as a value: `option
			my_bool = T;`. We should restrict to only `true` and `false`.

			We would introduce a feature like `features.loose_bool_options`, which would
			`switch from true to false.`

			`### Decimal Field Numbers`

			We permit non-decimal integer literals for field numbers, e.g. `optional int32
			x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should
			`require decimal literals, since there is very little reason to allow other`
			`literals and makes the Protobuf language harder to parse.`

			We would introduce a feature like `features.non_decimal_field_numbers`, which
			`would switch from true to false.`