Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Definition::Enum::tag_width field #215

Merged
merged 3 commits into from
Sep 11, 2023
Merged

Add Definition::Enum::tag_width field #215

merged 3 commits into from
Sep 11, 2023

Conversation

mina86
Copy link
Contributor

@mina86 mina86 commented Sep 8, 2023

The field allows specifying how many bytes the discriminant byte
takes. This in turn allows for better support of custom encoding
formats with more than 256 variants and untagged unions.

    Enum {
        /// Width in bytes of the discriminant tag.
        ///
        /// Zero indicates this is an untagged union.  In standard borsh
        /// encoding this is one however custom encoding formats may use larger
        /// width if they need to encode more than 256 variants.
        ///
        /// Note: This definition must not be used if the tag is not encoded
        /// using little-endian format.
        tag_width: u8,

        /// Possible variants of the enumeration.
        variants: Vec<(VariantName, Declaration)>,
    },

relates to: #181

@mina86 mina86 requested review from frol and dj8yfo as code owners September 8, 2023 18:40
@mina86
Copy link
Contributor Author

mina86 commented Sep 9, 2023

Another idea is to introduce a Width type:

pub enum Width {
    NoWidth,
    LE8 = 1,
    LE16 = 2,
    LE24 = 3,
    LE32 = 4,
    LE40 = 5,
    LE48 = 6,
    LE56 = 7,
    LE64 = 8,
}

which would specify how the tag is encoded. That would leave possibility for extensions in the future.

@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 9, 2023

Offtopic:

    /// A fixed-size tuple with the length known at the compile time and the
    /// elements of different types.
    Tuple {
        /// Types of elements of the tuple.
        elements: Vec<Declaration>,
    },
    /// A structure, structurally similar to a tuple.
    Struct {
        /// Named fields of the structure with their types.
        ///
        /// Tuple-like structures are modelled as a `Tuple`.
        fields: Vec<(FieldName, Declaration)>,
    },

will be better than

    Tuple {
       /// Types of elements of the tuple.
       elements: Vec<(Option<FieldName>, Declaration)>,
   },

as the latter allows a mix of named and unnamed fields.

@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 9, 2023

Please update branch with new changes in trunk as well as (periodically) other prs.
Occasionally compile or other errors occur despite a pr being able to be merged cleanly.

@dj8yfo dj8yfo marked this pull request as draft September 9, 2023 17:08
@mina86 mina86 force-pushed the e branch 3 times, most recently from 9d72f84 to f7d9c47 Compare September 9, 2023 17:53
The field allows specifying how many bytes the discriminant byte takes.
This in turn allows for better support of custom encoding formats with
more than 256 variants and untagged unions.

Issue: near#181
@mina86 mina86 marked this pull request as ready for review September 9, 2023 21:41
@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 10, 2023

@mina86 the change to Definition, proposed in comment is a little bit inconsistent with respect to mixing information about serialized form and the rust syntax forms that serialized representation originated from.

Let's take Enum first as an example.
Current version in master contains this:

    Enum {
        variants: Vec<(VariantName, Declaration)>,
    },

This pr adds some info about serialized form:

    Enum {
        tag_width: u8,
        variants: Vec<(VariantName, Declaration)>,
    },

But, some info was skipped. With the addition of enum discriminants the enum tag in serialized form for VariantName cannot be computed just by its index in containing Vec<...>, so the full definition,
containing both full info about serialized form and VariantName-s, which are relevant only for hinting, what the original enum in rust looked like, would look like:

    Enum {
        tag_width: u8,
        variants: Vec<(u64, VariantName, Declaration)>,
    },

If it were supposed to only convey information about serialized form, than it would look like:

    Enum {
        tag_width: u8,
        variants: BTreeMap<u64, Declaration>,    // the order of variants, as they were declared in rust enum,  
                                                                          // is irrelevant to serialization
    },

Similarly, this change for Structs, proposed in the same comment

    /// A fixed-size tuple with the length known at the compile time and the
    /// elements of different types.
    Tuple {
        /// Types of elements of the tuple.
        elements: Vec<Declaration>,
    },
    /// A structure, structurally similar to a tuple.
    Struct {
        /// Named fields of the structure with their types.
        ///
        /// Tuple-like structures are modelled as a `Tuple`.
        fields: Vec<(FieldName, Declaration)>,
    },

contains some loss of information about originating forms in rust, as some Tuple-s originated from tuples in rust, and some - from tuple structs, that info is lost, while structs with named fields don't lose that kind of information.

If it were supposed to only convey information about serialized form, than it would look like this both for tuples and structs on rust side:

    Tuple { elements: Vec<Declaration> },

as the names of the fields are irrelevant.

A variant with full info (both serialization & origin) is what it looks like now:

    Tuple { elements: Vec<Declaration> },
    Struct { fields: Fields },
    ...
pub enum Fields {
    NamedFields(Vec<(FieldName, Declaration)>),
    UnnamedFields(Vec<Declaration>),
    Empty,
}

thus, i believe, to keep up the same logic about the Definition as a whole, it should be either:

// max info
pub enum Definition {
    Sequence {    // extends prev `Sequence` and replaces `Array`
        length_width: u8,
        min_length: u64,
        max_length: u64,
        elements: Declaration,
    },
    Tuple {    // stays the same
        elements: Vec<Declaration>,
    },
    Enum {  // extended with `tag_width` and `u64` added to variants tuple
        tag_width: u8,
        variants: Vec<(u64, VariantName, Declaration)>,
    },
    Struct { fields: Fields },  // stays the same
}

pub enum Fields {
    NamedFields(Vec<(FieldName, Declaration)>),
    UnnamedFields(Vec<Declaration>),
    Empty,
}

or

// only serialized form
pub enum Definition {
    Sequence {  // changes
        length_width: u8,
        min_length: u64,
        max_length: u64,
        elements: Declaration,
    },
    Tuple {  // stays the same, replaces `Struct`
        elements: Vec<Declaration>,
    },
    Enum {    // changes
        tag_width: u8,
        variants: BTreeMap<u64, Declaration>,    // the order of variants, as they were declared in rust enum,  
    }
}

@frol what do you think?

As there's also process of deserialization, which turns serialized bytes back into rust forms, i vote for variant with max info,
where adding u64 to Vec<(u64, VariantName, Declaration)> would be the most complex change.

On the other hand, if BorshSchema's job is to only concisely describe the format of output data, and the task of implementing actual conversions is delegated to BorshSerialize and BorshDeserialize traits, then the minimal variant would be better.

@mina86
Copy link
Contributor Author

mina86 commented Sep 10, 2023

@mina86 the change to Definition, proposed in comment is a little bit inconsistent with respect to mixing information about serialized form and the rust syntax forms that serialized representation originated from.

Note that one more piece of information one has to go on is declaration. Tuples are called Tuple<...> which can be used to distinguish a tuple from a struct. (Though as a side note I’d argue for replacing the declaration by (...) to not clash with someone creating a generic struct called Tuple). Even if it’s important to keep information about the rust types, I don’t think Fields is necessary for it. (I’d be happy removing all information not related to encoding but I don’t have strong feelings).

Lastly, Enum::tag_width is being added in both situations so can we get this PR merged regardless? ;)

@frol
Copy link
Collaborator

frol commented Sep 10, 2023

@dj8yfo I don't have full context about Borsh Schema, and I still believe there are more problems with it than it solves, so anything that kind of works is fine with me. I am fine with iterative approach and this PR only adds tag_width, so we don't need to block it.

@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 10, 2023

@mina86 with all of that articulated,

so can we get this PR merged regardless?

yes
i'd be happy to see

    Sequence {    // extends prev `Sequence` and replaces `Array`
        length_width: u8,
        min_length: u64,
        max_length: u64,
        elements: Declaration,
    },

in another pr.

If you also considered adding info about enum discriminants' values (in this or another pr), that would be consistent with current pr's drive for adding details about data layout:

    Enum {  // extended with `tag_width` and `u128` added to variants tuple
        tag_width: u8,
        variants: Vec<(u128, VariantName, Declaration)>,
    },

The change of this pr might be useful not in the context of having enums defined with more than 256 variants, but having enums defined with reasonable number of variants which have discriminants outside of u8 range.

Tuples are called Tuple<...>

Due to the clash of names, i think there's no need to introduce any change to Definition::{Struct, Tuple} , it looks more like a change for the sake of a change, not adding anything new.

@mina86
Copy link
Contributor Author

mina86 commented Sep 10, 2023

If you also considered adding info about enum discriminants' values (in this or another pr), that would be consistent with current pr's drive for adding details about data layout:

I can look into it, but I’m essentially doing all this in my spare time so I give no promises that I’ll add discriminants to the definition.

@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 11, 2023

it might actually be

    Enum {  // extended with `tag_width` and `i64` added to variants tuple
        tag_width: u8,
        variants: Vec<(i64, VariantName, Declaration)>,
    },

as rust allows isize , if BorshSchema were to mirror what rust specifies for discriminants

I give no promises that I’ll add discriminants to the definition

no, i just meant it would be a more meaningful investment of time for this repo than churning Struct/Tuple again

dj8yf0μl added 2 commits September 11, 2023 17:05
to make `BorshSchema` more congruent with `BorshSerialize, BorshDeserialize`
@mina86
Copy link
Contributor Author

mina86 commented Sep 11, 2023

no, i just meant it would be a more meaningful investment of time for this repo than churning Struct/Tuple again

Difference is I already have Struct/Tuple mostly written. And adding discriminant to Enum seems like a huge can of worms.

@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 11, 2023

Difference is I already have Struct/Tuple mostly written

well, then you should be ready it gets rejected if it doesn't look absolutely stunning )

@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 11, 2023

seems like a huge can of worms.

adding #[borsh(tag_type = "i16")] for derives would be,
just adding the discriminants to variants' tuples shouldn't be too hard

@mina86
Copy link
Contributor Author

mina86 commented Sep 11, 2023

I think I don’t understand GitHub… I’m trying to comment on your changes but my comment don’t seem to appear.

Firstly, why is actually dynamically-sized sequence of zero-sized encoding forbidden?

Secondly, rather than calling is_zero_size it’s probably better to check sz == 0 not to duplicate the work.

@dj8yfo dj8yfo force-pushed the e branch 2 times, most recently from 668a885 to 2adf4ca Compare September 11, 2023 19:29
@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 11, 2023

I’m trying to comment on your changes but my comment don’t seem to appear.

there must be this button in upper right corner Finish your review

@dj8yfo dj8yfo merged commit 655119a into near:master Sep 11, 2023
@frol frol mentioned this pull request Sep 11, 2023
@mina86 mina86 deleted the e branch September 12, 2023 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants