Adds "Stricter Schemas with Editions" to the GitHub code repository.

PiperOrigin-RevId: 563172650
2 years ago · d83e9f3498
parent 75af7f9838
commit d83e9f3498
2 changed files with 246 additions and 0 deletions
--- a/docs/design/editions/README.md
+++ b/docs/design/editions/README.md
@ -24,4 +24,5 @@ The following topics are in this repository:
 *   [Protobuf Editions Design: Features](protobuf-editions-design-features.md)
 *   [Editions: Life of a Featureset](editions-life-of-a-featureset.md)
 *   [Protobuf Editions for Schema Producers](protobuf-editions-for-schema-producers.md)
+*   [Stricter Schemas with Editions](stricter-schemas-with-editions.md)
 *   [Edition Naming](edition-naming.md)
--- a/docs/design/editions/stricter-schemas-with-editions.md
+++ b/docs/design/editions/stricter-schemas-with-editions.md
@ -0,0 +1,245 @@
+# Stricter Schemas with Editions
+
+**Author:** [@mcy](https://github.com/mcy)
+
+**Approved:** 2022-11-28
+
+## Overview
+
+The Protobuf language is surprisingly lax in what it allows in some places, even
+though these corners of the syntax space are rarely exercised in real use, and
+which add complexity to backends and runtimes.
+
+This document describes several such corners in the language, and how we might
+use Editions to fix them (spoiler: we'll add a feature for each one and then
+ratchet the features).
+
+This is primarily a memo on a use-case for Editions, and not a design doc per
+se.
+
+## Potential Lints
+
+### Entity Names
+
+Protobuf does not enforce any constraints on names other than the "ASCII
+identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This
+results in problems for backends:
+
+*   Backends need to be able to convert between PascalCase, camelCase,
+    snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky.
+*   Extraneous underscores, such as underscores in names that want to be
+    PascalCase, trailing underscores, leading underscores, and repeated
+    underscores create problems for case conversion and can clash with private
+    names generated by backends.
+*   Protobuf does not support non-ASCII identifiers, mostly out of inertia more
+    than anything else. Because some languages (Java most prominent among them)
+    do not support them, we can never support them, but we are not particularly
+    clear on this point.
+
+The Protobuf language should be as strict as possible in what patterns it
+accepts for identifiers, since these need to be transformed to many languages.
+Thus, we propose the following regexes for the three casings used in Protobuf:
+
+*   `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for:
+    *   Messages.
+    *   Enums.
+    *   Services.
+    *   Methods.
+*   `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for:
+    *   Fields (including extensions).
+    *   Package components.
+*   `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for:
+    *   Enum values.
+
+These patterns are intended to reject extraneous underscores, and to make casing
+of ASCII letters consistent. We explicitly only support ASCII for maximal
+portability to target languages. Note that option names are not included, since
+those are defined as fields in a proto, and would be subject to this rule
+automatically.
+
+To migrate, we would introduce a bool feature `feature.relax_identifier_rules`,
+which can be applied to any entity. When set, it would cause the compiler to
+reject `.proto` files which contain identifiers that don't match the above
+constraints. It would default to true and would switch to false in a future
+edition.
+
+### Keywords as Identifiers
+
+Currently, the Protobuf language allows using keywords as identifiers. This
+makes the parser somewhat more complicated than it has to be for minimal
+benefit, and shadowing behavior is not well-specified. For example, what does
+the following compile as?
+
+```
+message Foo {
+  message int32 {}
+  optional int32 foo = 1;
+}
+```
+
+This is particularly fraught in places where either a keyword or a type name can
+follow. For example, `optional foo = 1;` is a proto3 non-optional with type
+`optional`, but the parser can't tell until it sees the `=` sign.
+
+To avoid this and eventually stop supporting this in the parser, we make the
+following set of keywords true reserved names that cannot be used as
+identifiers:
+
+```
+bool      bytes    double  edition   enum      extend    extensions  fixed32
+fixed64   float    group   import    int32     int64     map         max
+message   oneof    option  optional  package   public    repeated    required
+reserved  returns  rpc     service   sfixed32  sfixed64  sint32      sint64
+stream    string   syntax  to        uint32    uint64    weak
+```
+
+Additionally, we introduce the syntax `#optional` for escaping a keyword as an
+identifier. This may *only* be used on keywords, and not non-keyword
+identifiers.
+
+To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`,
+which can be applied to any entity. When set, it would cause the compiler to
+reject `.proto` files which contain identifiers that use the names of keywords.
+It would migrate true->false in a future edition. The `#optional` syntax would
+not need to be feature-gated.
+
+From time to time we may introduce new keywords. The best procedure for doing so
+is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then
+switch it to false in an edition, which would cause it to be treated as a
+keyword for the purposes of this check. There's nothing stopping us from
+starting to use it in the syntax without an edition if it would be relatively
+unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they
+really hate contextual keywords since it complicates the parser, so keywords
+start out as contextual and become properly reserved in the next Rust edition.
+
+### Nonempty Package
+
+Right now, an empty package is technically permitted. We should remove this
+functionality from the language completely and require every file to declare a
+package.
+
+We would introduce a feature like `feature.allow_missing_package`, start it out
+as true, and switch it to false.
+
+### Invalid Names in `reserved`
+
+Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field
+and thus should be rejected. Ideally we should remove this syntax altogether and
+only permit the use of identifiers in this position, such as `reserved foo,
+bar;`.
+
+We would introduce a feature like `feature.allow_strings_in_reserved`, start it
+out as true, and then switch it to false.
+
+### Almost All Names are Fully Qualified
+
+Right now, Protobuf defines a complicated name resolution scheme that involves
+matching subsets of names inspired by that of C++ (which is even more
+complicated than ours!). Instead, we should require that every name be either a
+single identifier OR fully-qualified. This is an attempt to move to Go-style
+name resolution, which is significantly simpler to implement and explain.
+
+In particular, if a name is a single identifier, then:
+
+*   It must be the name of a type defined at the top level of the current file.
+*   If it is the name of a message or enum for a field's type, it may be the
+    name of a type defined in the current message. This does *not* apply to
+    extension fields.
+
+Because any multi-component path must be fully qualified, we no longer need the
+`.foo.Bar` syntax anymore, except to refer to messages defined in files without
+a package. We forbid `.`-prefixed names except in that case.
+
+We would introduce a feature like `features.use_cpp_style_name_resolution`,
+start it out as true, and then switch it to false.
+
+Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted
+at a message, rather than a package. In that case, we could go as far as saying
+that "names that start with a lower-case letter are fully-qualified, otherwise
+they are relative to the current package, and will only find things defined in
+the current file."
+
+Unlike Go, we do not allow finding things in other packages without being
+fully-qualified; this mostly comes from doing source-diving in very large
+packages, like the Go runtime, where it is very hard to find where something is
+defined.
+
+### Unique Enum Values
+
+Right now, we allow aliases in enums:
+
+```
+enum Foo {
+  BAR = 5;
+  BAZ = 5;
+}
+```
+
+This results in significant complexity in some parts of the backend, and weird
+behavior in textproto and JSON. We should disallow this.
+
+We would introduce a feature like `features.allow_enum_aliases`, which would
+switch from true to false.
+
+### Imports are Used
+
+We should adopt the Go rule that all non-public imports are used (i.e, every
+import provides at least one type referred to in the file).
+
+We would introduce a feature like `features.allow_unused_imports`, which would
+switch from true to false.
+
+### Next Field # is Reserved
+
+There's a few idioms for this checked by linters, such as `// Next ID: N`. We
+should codify this in the language by rewriting that every message begin with
+`reserved N to max;`, with the intent that `N` is the next never-used field
+number. Because it is required to be the first production in the message, it can
+be
+
+We could, additionally, require that *every* field number be either used or
+reserved, in addition to having a single `N to max;` reservation. Alternatively,
+we could require that every field number up to the largest one used be reserved;
+gaps between message numbers are usually a smell.
+
+This applies equally to message fields and enum values.
+
+We would introduce a feature like `features.allow_unused_numbers`, which we
+would switch from true to false.
+
+### Disallow Implicit String Concatenation
+
+Protobuf will implicitly concatenate two adjacent strings in any place it allows
+quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting
+problems around `reserved` in the past, if a comma is omitted: `reserved "foo"
+"bar";` is `reserved "foobar";`.
+
+We would introduce a feature like `features.concatenate_adjacent_strings`, which
+would switch from true to false.
+
+### Package Is First
+
+The `package` declaration can appear anywhere in the file after `syntax` or
+`edition`. We should take cues from Go and require it to be the first thing in
+the file, after the edition.
+
+We would introduce a feature like `features.package_anywhere`, which would
+switch from true to false.
+
+### Strict Boolean Options
+
+Boolean options can use true, false, True, False, T, or F as a value: `option
+my_bool = T;`. We should restrict to only `true` and `false`.
+
+We would introduce a feature like `features.loose_bool_options`, which would
+switch from true to false.
+
+### Decimal Field Numbers
+
+We permit non-decimal integer literals for field numbers, e.g. `optional int32
+x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should
+require decimal literals, since there is very little reason to allow other
+literals and makes the Protobuf language harder to parse.
+
+We would introduce a feature like `features.non_decimal_field_numbers`, which
+would switch from true to false.