PiperOrigin-RevId: 564482619pull/14039/head
parent
e46855432e
commit
f98fa1897e
2 changed files with 369 additions and 0 deletions
@ -0,0 +1,368 @@ |
||||
# Java Lite For Editions |
||||
|
||||
**Author:** [@zhangskz](https://github.com/zhangskz) |
||||
|
||||
**Approved:** 2023-05-26 |
||||
|
||||
## Background |
||||
|
||||
The "Lite" implementation for Java utilizes a custom format for embedding |
||||
descriptors motivated by critical code-size and performance requirements for |
||||
Android. |
||||
|
||||
The code generator for Java Lite encodes an descriptor-like info string which is |
||||
stored into `RawMessageInfo`. This is decoded into `MessageSchema` which serves |
||||
as the descriptor-like schema for Java lite for parsing and serialization. |
||||
|
||||
The current implementation makes significant use of an `is_proto3` bit in the |
||||
encoding, which is problematic for editions. Note that any parser changes to the |
||||
format would also need to maintain backwards compatibility, due to our |
||||
guarantees for parsers to remain backwards compatible within a major version. |
||||
|
||||
## Overview |
||||
|
||||
Fortunately, we already have corresponding bits for most |
||||
[Editions Zero Features](edition-zero-features.md) in the corresponding |
||||
`MessageInfo` field entry encoding. |
||||
|
||||
We will move existing remaining syntax usages reading `is_proto3` to use these |
||||
bits. Several other syntax usages need to be made to be editions compatible by |
||||
merging implementations. |
||||
|
||||
As new editions features are added that must be represented in `MessageInfo`, we |
||||
will eventually need to revamp `MessageInfo` encoding to support these changes. |
||||
However, this should be avoidable for Editions Zero. |
||||
|
||||
## Recommendation |
||||
|
||||
### Encoding: Add Is Edition Bit |
||||
|
||||
`RawMessageInfo` should be augmented with an additional `is_edition` bit in |
||||
flags' unused bits. |
||||
|
||||
\[0]: flags, flags & 0x1 = is proto2?, flags & 0x2 = is message?, flags & |
||||
**0x4 = is edition?** |
||||
|
||||
The decoded `ProtoSyntax` should add a corresponding Editions option based on |
||||
this bit. |
||||
|
||||
``` |
||||
public enum ProtoSyntax |
||||
PROTO2; |
||||
PROTO3; |
||||
EDITIONS; |
||||
``` |
||||
|
||||
For now, there is no need to explicitly encode the raw editions string or |
||||
feature options. These resolved features will be encoded directly in their |
||||
corresponding field entries. |
||||
|
||||
### Encoding: Editions Zero Features |
||||
|
||||
Field entries in `RawMessageInfo` already encode bits corresponding to most |
||||
***resolved*** Editions Zero features in `GetExperimentalJavaFieldType`. This is |
||||
decoded in `fieldTypeWithExtraBits` by reading the corresponding bits. |
||||
|
||||
<table> |
||||
<tr> |
||||
<td><strong>Edition Zero Feature</strong> |
||||
</td> |
||||
<td><strong>Existing Encoding </strong> |
||||
</td> |
||||
<td><strong>Changes</strong> |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td>features.field_presence |
||||
</td> |
||||
<td> <code>kHasHasBit (0x1000)</code> |
||||
</td> |
||||
<td>Keep as-is. |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td>java.legacy_closed_enum |
||||
</td> |
||||
<td><code>kMapWithProto2EnumValue (0x800)</code> |
||||
</td> |
||||
<td>Replace with <code>kLegacyEnumIsClosedBit</code> |
||||
<p> |
||||
This will now be set for all enum fields, instead of just enum map values. |
||||
<p> |
||||
We will still need to check syntax in the interim in case of gencode. |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td><em>features.enum_type</em> |
||||
</td> |
||||
<td><em><code>EnumLiteGenerator</code> writes <code>UNRECOGNIZED(-1)</code> value for open enums in gencode.</em> |
||||
<p> |
||||
<em>This is not encoded in MessageInfo since this is an enum feature.</em> |
||||
</td> |
||||
<td><em>This is not needed in Editions Zero since enum closedness in Java Lite's runtime is dictated per-field by java.legacy_closed_enum. (<a href="edition-zero-feature-enum-field-closedness.md">Edition Zero Feature: Enum Field Closedness</a>), but should be used when Java non-conformance is fixed.</em> |
||||
<p> |
||||
<em>Note, this is implicitly encoded in kLegacyEnumIsClosedBit if java.legacy_closed_enum is unset since the corresponding FieldDescriptor helper should fall back on the EnumDescriptor.</em> |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td>features.repeated_field_encoding |
||||
</td> |
||||
<td><code>GetExperimentalJavaFieldTypeForPacked</code> |
||||
</td> |
||||
<td>Keep as-is. |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td>features.string_field_validation |
||||
</td> |
||||
<td><code>kUtf8CheckBit (0x200)</code> |
||||
</td> |
||||
<td>Keep as-is. |
||||
<p> |
||||
HINT does not apply to Java and will have the same behavior as MANDATORY or NONE |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td>features.message_encoding |
||||
</td> |
||||
<td>Not present. |
||||
</td> |
||||
<td>Encode as type group. |
||||
<p> |
||||
See below. |
||||
</td> |
||||
</tr> |
||||
</table> |
||||
|
||||
Several places already use these bits properly, but there are a few syntax |
||||
usages in the decoding that should be replaced by checking the corresponding |
||||
feature bit. |
||||
|
||||
There are several unused bits that we could use for future field-level features |
||||
before breaking the encoding format, but we should not need these for editions |
||||
zero. |
||||
|
||||
The results of the `is_proto3` and feature bits only seem to be used within |
||||
protobuf, and don't seem to be publicly exposed. |
||||
|
||||
#### features.message_encoding |
||||
|
||||
In the compiler, message fields with `features.message_encoding = DELIMITED` |
||||
should be treated as a group *before* encoding message info. |
||||
|
||||
This means that `GetExperimentalJavaFieldTypeForSingular`, should encode the |
||||
field's type `GROUP` (17), instead of its actual type `MESSAGE` (9), e.g. |
||||
|
||||
``` |
||||
int GetExperimentalJavaFieldTypeForSingular(const FieldDescriptor* field) { |
||||
int result = field->type(); |
||||
if (result == FieldDescriptor::TYPE_MESSAGE) { |
||||
if (field->isDelimited()) { |
||||
return 17; // GROUP |
||||
} |
||||
} |
||||
} |
||||
``` |
||||
|
||||
`ImmutableMessageFieldLiteGenerator::GenerateFieldInfo` calls this when |
||||
generating the message field's field info. |
||||
|
||||
The nested message's `MessageInfo` encoding does not need to be changed as this |
||||
is already identical for group and message. |
||||
|
||||
Since each message field will be handled separately, this means that the |
||||
post-editions proto file below |
||||
|
||||
``` |
||||
// foo.proto |
||||
edition = "tbd" |
||||
|
||||
message Foo { |
||||
message Bar { |
||||
int32 x = 1; |
||||
repeated int32 y = 2; |
||||
} |
||||
Bar bar = 1 [features.message_encoding = DELIMITED]; |
||||
Bar baz = 2; // not DELIMITED |
||||
|
||||
} |
||||
``` |
||||
|
||||
will be encoded and treated by `MessageSchema` like its pre-editions equivalent |
||||
below. |
||||
|
||||
``` |
||||
message Foo { |
||||
group Bar = 1 { |
||||
int32 x = 1; |
||||
repeated int32 y = 2; |
||||
} |
||||
Bar baz = 2; // not DELIMITED |
||||
} |
||||
``` |
||||
|
||||
We recommended this alternative to minimize changes to the encoding and how |
||||
groups are treated. |
||||
|
||||
In a future breaking change, we could consider renaming `FieldType.GROUP` to |
||||
`FieldType.MESSAGE_DELIMITED` while preserving the same number and encoding for |
||||
clarity. For now, we will leave the naming for this enum as-is. |
||||
|
||||
##### Alternative: Add kIsMessageEncodingDelimitedBit |
||||
|
||||
Alternatively, we could encode `features.message_encoding = DELIMITED` as-is as |
||||
type `MESSAGE`. The `MessageInfo` encoding would encode these as a normal |
||||
message field, using an unused (0x1100) bit as `kIsMessageEncodingDelimitedBit`. |
||||
|
||||
This could be used to indicate that the message should be parsed/serialized from |
||||
the wire-format as if it were a group. This would need to be passed along to |
||||
`MessageSchema` which would then handle treating Messages with this bit set as |
||||
groups e.g. in `case Message`. |
||||
|
||||
This is less ideal, since it would require handling this in multiple places. |
||||
|
||||
### Unify non-feature syntax usages |
||||
|
||||
There are several places that branch on syntax into separate proto2/proto3 |
||||
codepaths. These generally duplicate a lot of code and should be unified into a |
||||
single syntax-agnostic code path branching on the relevant feature bits. |
||||
|
||||
This code tends to be pretty opaque, so we should document this with comments or |
||||
add helpers (e.g. `isEnforceUtf8`) to indicate what feature bits are used as we |
||||
make changes here. |
||||
|
||||
<table> |
||||
<tr> |
||||
<td><code>ManifestSchemaFactory.newSchema()</code> |
||||
</td> |
||||
<td>MessageInfo -> Schema |
||||
</td> |
||||
<td>Allow extensions for editions. |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td><code>MessageSchema.getSerializedSize()</code> |
||||
</td> |
||||
<td>Message -> Serialized Size |
||||
</td> |
||||
<td>Unify getSerializedSizeProto2/3 |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td><code>MessageSchema.writeTo()</code> |
||||
</td> |
||||
<td>Serialize Message |
||||
</td> |
||||
<td>Unify writeFieldsInAscendingOrderProto2/3 |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td><code>MessageSchema.mergeFrom()</code> |
||||
</td> |
||||
<td>Parse Message |
||||
</td> |
||||
<td>Unify parseProto2/3Message |
||||
</td> |
||||
</tr> |
||||
<tr> |
||||
<td><code>DescriptorMessageInfoFactory.convert()</code> |
||||
</td> |
||||
<td>Descriptor -> MessageInfo |
||||
</td> |
||||
<td>Unify convertProto2/3 |
||||
</td> |
||||
</tr> |
||||
</table> |
||||
|
||||
There is a lot of dead code in Java Lite so several syntax usages can also be |
||||
deleted or merged where possible. |
||||
|
||||
## Alternatives |
||||
|
||||
### Alternative 1: Introduce New Backwards-compatible MessageInfo Encoding |
||||
|
||||
Add a new backwards-compatible `MessageInfo` encoding for editions. |
||||
|
||||
The `is_edition` bit could toggle the encoding format being used, where |
||||
`is_edition == true` indicates the new encoding format but `is_edition == false` |
||||
indicates the old encoding. |
||||
|
||||
This would allow us to encode additional information that the current encoding |
||||
format does not currently have available bits to support, such as the editions |
||||
string or additional features. |
||||
|
||||
For example, the current encoding format only has a fixed number of available |
||||
field entry bits where we could encode new feature bits. We will need to |
||||
introduce a new encoding format once we exceed these, or if we want to encode |
||||
features at the message level. |
||||
|
||||
In a future major version bump when support for proto2/3 is officially dropped, |
||||
we could drop support for the previous encoding format. |
||||
|
||||
The recommendation is to revisit alternative 1 along with alternative 2 |
||||
post-Editions zero as we need to support additional feature bits. |
||||
|
||||
#### Pros |
||||
|
||||
* Future-proof for future editions and features |
||||
|
||||
#### Cons |
||||
|
||||
* Blocks editions zero on more complex encoding changes that won't be used |
||||
yet. |
||||
* Requires more invasive updates to all MessageInfo decodings |
||||
|
||||
### Alternative 2: Move to MiniDescriptor encoding |
||||
|
||||
We could switch Java Lite to use the MiniDescriptor encoding specification. |
||||
|
||||
Like Java Lite, this encoding seems to be optimized to be lightweight and with |
||||
minimal descriptor information. |
||||
|
||||
MiniDescriptors do not encode proto2/proto3 syntax currently, which makes it |
||||
mostly editions-compatible. MiniDescriptors encode FieldModifier/MessageModifier |
||||
bits that correspond to some editions zero similarly to the Java Lite field |
||||
feature bits, and can be augmented to support additional features. |
||||
|
||||
Supposedly, this encoding format *should* support an arbitrary number of |
||||
modifier bits, but this needs to be double-checked to verify there isn't a |
||||
similar hard limit to the number of features. |
||||
|
||||
It is unclear whether this is sufficiently optimized for Android's needs and how |
||||
compatible this would be with Java Lite's Schemas. |
||||
|
||||
The recommendation is to revisit alternative 2 along with alternative 1 |
||||
post-Editions zero as we need to support additional feature bits. |
||||
|
||||
#### Pros |
||||
|
||||
* Unify implementations for lower long-term maintenance cost |
||||
|
||||
* MiniDescriptor encoding will eventually need to be updated for editions |
||||
anyways. |
||||
|
||||
#### Cons |
||||
|
||||
* Blocks editions zero on more complex encoding changes that aren't necessary. |
||||
|
||||
* Requires even more invasive updates to all MessageInfo decodings |
||||
|
||||
* Probably requires major version bumps to break compatibility |
||||
|
||||
* Unknown code size /schema compatibility constraints that would need to be |
||||
explored. |
||||
|
||||
* There are a few possible changes to MiniDescriptors on the table that we |
||||
should wait to settle before bringing on additional implementations. |
||||
|
||||
### Alternative 3: Do Nothing |
||||
|
||||
Doing nothing is always an alternative. Describe the pros and cons of it. |
||||
|
||||
#### Pros |
||||
|
||||
* No work |
||||
|
||||
#### Cons |
||||
|
||||
* Editions is blocked since Java Lite protos are stuck in the past |
Loading…
Reference in new issue