Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BorshSchema vs custom serialisation #211

Closed
mina86 opened this issue Aug 31, 2023 · 4 comments · Fixed by #229
Closed

BorshSchema vs custom serialisation #211

mina86 opened this issue Aug 31, 2023 · 4 comments · Fixed by #229
Labels
question Further information is requested

Comments

@mina86
Copy link
Contributor

mina86 commented Aug 31, 2023

Say I’d like to use varint in borsh. Or have a custom SmallVec type which is encoded with 8-bit length rather than 32-bit length.

This is easy enough to do by implementing custom BorshSerialize and BorshDeserialize. However, BorshSchema becomes an issue. Varint could be modelled as a nested enum with 256 variants. Similarly SmallVec could be modeled as an enum with 256 variants each being an array. That’s hardly a clean solution though.

Do you guys have any thoughts on that?

@frol frol added the question Further information is requested label Aug 31, 2023
@frol
Copy link
Collaborator

frol commented Aug 31, 2023

I would avoid expanding the scope of borsh spec with varint/smallvec specializations. I would treat these types as application-specific ones and leave app developers to optimize their custom types on their end.

@mina86
Copy link
Contributor Author

mina86 commented Aug 31, 2023

So my question is how do I implement BorshSchema for such type? There’s no Definition for an application-specific encoding. The options seems to be:

  • Accept there’s not going to be impl BorshSchema.
  • Add Declaration for the type but don’t provide Definition for it.
  • Do extremely hacky stuff with Definition::Enum.

Perhaps it would make sense to have Definition::AppSpecific with some at least rudimentary description of the format (e.g. min and max encoded length). For varint for example this would mean a definition "VarInt<u32>"Definition::AppSpecific(1..5).

I think this also maybe relates to #181. Perhaps it would make sense to extend Sequence and Enum by adding length_size and tag_size fields respectively? So currently we’d have Sequence { length_size: 4, elements: ... } and Enum { tag_size: 1, variants: ... }. This would allow expressing smallvec and enums with different tag representation.

@dj8yfo
Copy link
Collaborator

dj8yfo commented Sep 3, 2023

A vector of varints Vec<VarInt> can be serialized as Vec<u8> first and then presented as that to borsh, if the need for compression, that varint provides, is required.
The info about total num of VarInt-s will be lost, the info about total bytes - not. So it will look like a Sequence { elements: "u8".to_string() } with respect to schema.

It's about the same with rust's String at the moment. A String is essentially a Vec<VarInt>. It's serialized as Vec<u8> with info about total characters lost in serialized form, and having a "string" Declaration for itself and empty Definition. (second option in comment )

Similarly to String, one can define a type VarintsVec(Vec<VarInt>), serialize and deserialize the contents as Vec<u8>, with error checking during deserialization (about the lengths of encountered varints), and define BorshSchema as special "varint_vector" Declaration and empty Definition.

A SmallVec type will on average be 127 bytes long (with minimal nonzero length of a type defined as 1 byte according to #209 ), and defining header_size field in Definition::Sequence for the gain of 3 bytes less spent on header of an average ~120 bytes payload doesn't appear a big gain compared to just using Vec.

@mina86
Copy link
Contributor Author

mina86 commented Sep 3, 2023

It's about the same with rust's String at the moment. A String is essentially a Vec<VarInt>. It's serialized as Vec<u8> with info about total characters lost in serialized form, and having a string Declaration for itself and empty Definition.

That’s not quite the same though. In String case, I can deserialise Vec<u8> and then convert it with no additional allocations to String. With Vec<VarInt> I’d have to first deserialise Vec<u8> and then allocate a new (say) Vec<VarInt<u32>>.

However, this is a bit besides the point. Of course, I can always write serialisation which can be described by BorshSchema. The question is what to do when serialisation I’m using cannot be described by BorschSchema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
3 participants