diff --git a/docs/design/editions/README.md b/docs/design/editions/README.md index 1dc076a2bf..751c45390e 100644 --- a/docs/design/editions/README.md +++ b/docs/design/editions/README.md @@ -24,4 +24,5 @@ The following topics are in this repository: * [Protobuf Editions Design: Features](protobuf-editions-design-features.md) * [Editions: Life of a Featureset](editions-life-of-a-featureset.md) * [Protobuf Editions for Schema Producers](protobuf-editions-for-schema-producers.md) +* [Stricter Schemas with Editions](stricter-schemas-with-editions.md) * [Edition Naming](edition-naming.md) diff --git a/docs/design/editions/stricter-schemas-with-editions.md b/docs/design/editions/stricter-schemas-with-editions.md new file mode 100644 index 0000000000..e4d8e4a487 --- /dev/null +++ b/docs/design/editions/stricter-schemas-with-editions.md @@ -0,0 +1,245 @@ +# Stricter Schemas with Editions + +**Author:** [@mcy](https://github.com/mcy) + +**Approved:** 2022-11-28 + +## Overview + +The Protobuf language is surprisingly lax in what it allows in some places, even +though these corners of the syntax space are rarely exercised in real use, and +which add complexity to backends and runtimes. + +This document describes several such corners in the language, and how we might +use Editions to fix them (spoiler: we'll add a feature for each one and then +ratchet the features). + +This is primarily a memo on a use-case for Editions, and not a design doc per +se. + +## Potential Lints + +### Entity Names + +Protobuf does not enforce any constraints on names other than the "ASCII +identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This +results in problems for backends: + +* Backends need to be able to convert between PascalCase, camelCase, + snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky. +* Extraneous underscores, such as underscores in names that want to be + PascalCase, trailing underscores, leading underscores, and repeated + underscores create problems for case conversion and can clash with private + names generated by backends. +* Protobuf does not support non-ASCII identifiers, mostly out of inertia more + than anything else. Because some languages (Java most prominent among them) + do not support them, we can never support them, but we are not particularly + clear on this point. + +The Protobuf language should be as strict as possible in what patterns it +accepts for identifiers, since these need to be transformed to many languages. +Thus, we propose the following regexes for the three casings used in Protobuf: + +* `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for: + * Messages. + * Enums. + * Services. + * Methods. +* `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for: + * Fields (including extensions). + * Package components. +* `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for: + * Enum values. + +These patterns are intended to reject extraneous underscores, and to make casing +of ASCII letters consistent. We explicitly only support ASCII for maximal +portability to target languages. Note that option names are not included, since +those are defined as fields in a proto, and would be subject to this rule +automatically. + +To migrate, we would introduce a bool feature `feature.relax_identifier_rules`, +which can be applied to any entity. When set, it would cause the compiler to +reject `.proto` files which contain identifiers that don't match the above +constraints. It would default to true and would switch to false in a future +edition. + +### Keywords as Identifiers + +Currently, the Protobuf language allows using keywords as identifiers. This +makes the parser somewhat more complicated than it has to be for minimal +benefit, and shadowing behavior is not well-specified. For example, what does +the following compile as? + +``` +message Foo { + message int32 {} + optional int32 foo = 1; +} +``` + +This is particularly fraught in places where either a keyword or a type name can +follow. For example, `optional foo = 1;` is a proto3 non-optional with type +`optional`, but the parser can't tell until it sees the `=` sign. + +To avoid this and eventually stop supporting this in the parser, we make the +following set of keywords true reserved names that cannot be used as +identifiers: + +``` +bool bytes double edition enum extend extensions fixed32 +fixed64 float group import int32 int64 map max +message oneof option optional package public repeated required +reserved returns rpc service sfixed32 sfixed64 sint32 sint64 +stream string syntax to uint32 uint64 weak +``` + +Additionally, we introduce the syntax `#optional` for escaping a keyword as an +identifier. This may *only* be used on keywords, and not non-keyword +identifiers. + +To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`, +which can be applied to any entity. When set, it would cause the compiler to +reject `.proto` files which contain identifiers that use the names of keywords. +It would migrate true->false in a future edition. The `#optional` syntax would +not need to be feature-gated. + +From time to time we may introduce new keywords. The best procedure for doing so +is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then +switch it to false in an edition, which would cause it to be treated as a +keyword for the purposes of this check. There's nothing stopping us from +starting to use it in the syntax without an edition if it would be relatively +unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they +really hate contextual keywords since it complicates the parser, so keywords +start out as contextual and become properly reserved in the next Rust edition. + +### Nonempty Package + +Right now, an empty package is technically permitted. We should remove this +functionality from the language completely and require every file to declare a +package. + +We would introduce a feature like `feature.allow_missing_package`, start it out +as true, and switch it to false. + +### Invalid Names in `reserved` + +Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field +and thus should be rejected. Ideally we should remove this syntax altogether and +only permit the use of identifiers in this position, such as `reserved foo, +bar;`. + +We would introduce a feature like `feature.allow_strings_in_reserved`, start it +out as true, and then switch it to false. + +### Almost All Names are Fully Qualified + +Right now, Protobuf defines a complicated name resolution scheme that involves +matching subsets of names inspired by that of C++ (which is even more +complicated than ours!). Instead, we should require that every name be either a +single identifier OR fully-qualified. This is an attempt to move to Go-style +name resolution, which is significantly simpler to implement and explain. + +In particular, if a name is a single identifier, then: + +* It must be the name of a type defined at the top level of the current file. +* If it is the name of a message or enum for a field's type, it may be the + name of a type defined in the current message. This does *not* apply to + extension fields. + +Because any multi-component path must be fully qualified, we no longer need the +`.foo.Bar` syntax anymore, except to refer to messages defined in files without +a package. We forbid `.`-prefixed names except in that case. + +We would introduce a feature like `features.use_cpp_style_name_resolution`, +start it out as true, and then switch it to false. + +Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted +at a message, rather than a package. In that case, we could go as far as saying +that "names that start with a lower-case letter are fully-qualified, otherwise +they are relative to the current package, and will only find things defined in +the current file." + +Unlike Go, we do not allow finding things in other packages without being +fully-qualified; this mostly comes from doing source-diving in very large +packages, like the Go runtime, where it is very hard to find where something is +defined. + +### Unique Enum Values + +Right now, we allow aliases in enums: + +``` +enum Foo { + BAR = 5; + BAZ = 5; +} +``` + +This results in significant complexity in some parts of the backend, and weird +behavior in textproto and JSON. We should disallow this. + +We would introduce a feature like `features.allow_enum_aliases`, which would +switch from true to false. + +### Imports are Used + +We should adopt the Go rule that all non-public imports are used (i.e, every +import provides at least one type referred to in the file). + +We would introduce a feature like `features.allow_unused_imports`, which would +switch from true to false. + +### Next Field # is Reserved + +There's a few idioms for this checked by linters, such as `// Next ID: N`. We +should codify this in the language by rewriting that every message begin with +`reserved N to max;`, with the intent that `N` is the next never-used field +number. Because it is required to be the first production in the message, it can +be + +We could, additionally, require that *every* field number be either used or +reserved, in addition to having a single `N to max;` reservation. Alternatively, +we could require that every field number up to the largest one used be reserved; +gaps between message numbers are usually a smell. + +This applies equally to message fields and enum values. + +We would introduce a feature like `features.allow_unused_numbers`, which we +would switch from true to false. + +### Disallow Implicit String Concatenation + +Protobuf will implicitly concatenate two adjacent strings in any place it allows +quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting +problems around `reserved` in the past, if a comma is omitted: `reserved "foo" +"bar";` is `reserved "foobar";`. + +We would introduce a feature like `features.concatenate_adjacent_strings`, which +would switch from true to false. + +### Package Is First + +The `package` declaration can appear anywhere in the file after `syntax` or +`edition`. We should take cues from Go and require it to be the first thing in +the file, after the edition. + +We would introduce a feature like `features.package_anywhere`, which would +switch from true to false. + +### Strict Boolean Options + +Boolean options can use true, false, True, False, T, or F as a value: `option +my_bool = T;`. We should restrict to only `true` and `false`. + +We would introduce a feature like `features.loose_bool_options`, which would +switch from true to false. + +### Decimal Field Numbers + +We permit non-decimal integer literals for field numbers, e.g. `optional int32 +x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should +require decimal literals, since there is very little reason to allow other +literals and makes the Protobuf language harder to parse. + +We would introduce a feature like `features.non_decimal_field_numbers`, which +would switch from true to false. \ No newline at end of file