Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stdin.readByte is broken. #17

Open
mhermier opened this issue Nov 2, 2018 · 39 comments · May be fixed by #111
Open

Stdin.readByte is broken. #17

mhermier opened this issue Nov 2, 2018 · 39 comments · May be fixed by #111

Comments

@mhermier
Copy link
Contributor

mhermier commented Nov 2, 2018

Hi,
When usin Stdin.readByte it incorrectly read bytes depending on the value of the previously byte read.
If you look at the code:

static readByte() {
  return read_ {
    // Peel off the first byte.
    var byte = __buffered.bytes[0] // This operates on bytes
    __buffered = __buffered[1..-1] // This operates on UTF8 strings
    return byte
  }
}

Meaning that for every byte values > 127 the whole utf8 character is skipped instead of a single byte.

@ruby0x1 ruby0x1 transferred this issue from wren-lang/wren Sep 18, 2019
@joshgoebel
Copy link
Contributor

What should the correct behavior here be with regards to readLine then? It seems there is a conflict... if you read a single byte of a 4 byte UTF-8... what is readLine to do if you call it next? Your "position" in the buffer is now invalid for returning lines... the next 3 bytes are invalid...

@joshgoebel
Copy link
Contributor

One idea would be for readByte to have a mini-buffer of it's own... such that readByte always pulls a full UTF-8 character off the buffer and then subsequent calls returns the remaining bytes... but if sone were to alternate willy-nilly between readByte and readLine one could still lose data.

@mhermier
Copy link
Contributor Author

readLine should technically returns a buffer no matter what encoding errors happens in there. Strictly speaking UTF-8 encoding is supported but not required to be in a String. If we are in the middle of an UTF-8 char operator[_] should handle it somehow (or bomb which would be saner than what we currently have).

@joshgoebel
Copy link
Contributor

joshgoebel commented Apr 25, 2021

If we are in the middle of an UTF-8 char operator[_] should handle it somehow (or bomb which would be saner than what we currently have).

You mean add the support to String itself? Right now I can't even see any easy way of truncating the string and cutting off a portion of a character... it seems all the indexing stuff is based on full characters.

I could definitely see the path to making readLine throw an error if readByte was in the "middle" of a character though but then it almost feels like you need some API to check the status of the stream/buffer...

@ruby0x1
Copy link
Member

ruby0x1 commented Apr 25, 2021

If you mean indexing on String, indices are bytes. See the description here.

@mhermier
Copy link
Contributor Author

To solve the original problem, the proper solution would be to add range support to StringByteSequence.[_] or do some proper buffering for efficiency.

@mhermier
Copy link
Contributor Author

@ruby0x1 The doc is correct, but the non documented range implementation is not byte indexed....

@joshgoebel
Copy link
Contributor

joshgoebel commented Apr 25, 2021

If you mean indexing on String, indices are bytes.

Yes, right, I read that... but the behavior around "half-characters" is what's causing this bug.

__buffered = __buffered[lineSeparator + 1..-1]

Trying to advance the buffer by a single index seems to eat forward to the next whole UTF-8 character. So that results in bytes being dropped. I already have a fix if you assume that one doesn't use both readByte and readLine which seems problematic.

the proper solution would be to add range support to StringByteSequence.[_]

How would that help here? It seems (I suppose I'm assuming) that readLine expects to work with UTF-8 [strings] while readByte expects to work with bytes... The problem seems to be a desire to hold a cursor at TWO different places in the buffer concurrently.

// add some UTF-8 charcters to the buffer
Stdin.readByte() 
// we're now in the "middle" of a UTF-8 character
Stdin.readLine() 

What is readLine() expected to return here exactly? I was suggesting an error but then it would seem we need to add an API to detect this or advance to the next full-character so that user land could handle this, no?

If it could return the broken string at the exact correct position that might be one answer, but I'm not sure if that's possible or what that would look like.

Is Fiber.abort what one would use here to raise an error?

@ruby0x1
Copy link
Member

ruby0x1 commented Apr 25, 2021

@mhermier on the same page, I documented it recently, but the fact it's not indexed in bytes should be documented too.

image

@mhermier
Copy link
Contributor Author

mhermier commented Apr 25, 2021

@ruby0x1 test it with unicode characters, results should be broken, and be based on UTF-8 indexes:

@mhermier
Copy link
Contributor Author

mhermier commented Apr 25, 2021

EDIT: Previous message contained a broken test.

The following patch should be more conform to the byte indexing specification. I can make a branch if required.

commit 5baaffacb4909c83ea9302c8feed582d101e2600
Author: Michel Hermier <[email protected]>
Date:   Sun Apr 25 21:58:31 2021 +0200

    wren/core: Fix `String` range subscript so that it conforms to bytes indexing requirement.

diff --git a/src/vm/wren_value.c b/src/vm/wren_value.c
index 2700c4def..323517034 100644
--- a/src/vm/wren_value.c
+++ b/src/vm/wren_value.c
@@ -724,26 +724,12 @@ Value wrenNewStringLength(WrenVM* vm, const char* text, size_t length)
 Value wrenNewStringFromRange(WrenVM* vm, ObjString* source, int start,
                              uint32_t count, int step)
 {
+  ObjString* result = allocateString(vm, count);
   uint8_t* from = (uint8_t*)source->value;
-  int length = 0;
-  for (uint32_t i = 0; i < count; i++)
-  {
-    length += wrenUtf8DecodeNumBytes(from[start + i * step]);
-  }
-
-  ObjString* result = allocateString(vm, length);
-  result->value[length] = '\0';
-
   uint8_t* to = (uint8_t*)result->value;
   for (uint32_t i = 0; i < count; i++)
   {
-    int index = start + i * step;
-    int codePoint = wrenUtf8Decode(from + index, source->length - index);
-
-    if (codePoint != -1)
-    {
-      to += wrenUtf8Encode(codePoint, to);
-    }
+    to[i] = from[start + i *step];
   }
 
   hashString(result);
diff --git a/test/core/string/subscript_range.wren b/test/core/string/subscript_range.wren
index f4d6cd298..6c8e4ddb9 100644
--- a/test/core/string/subscript_range.wren
+++ b/test/core/string/subscript_range.wren
@@ -42,12 +42,13 @@ System.print("abc"[3..-1] == "") // expect: true
 // Bytes:           11111
 //        012345678901234
 // Chars: sø mé ஃ  thî ng
+System.print("søméஃthîng".bytes.join(", ")) // expect: 115, 195, 184, 109, 195, 169, 224, 174, 131, 116, 104, 195, 174, 110, 103
 System.print("søméஃthîng"[0..3]) // expect: søm
 System.print("søméஃthîng"[3...10]) // expect: méஃt
 
 // Only includes sequences whose first byte is in the range.
-System.print("søméஃthîng"[2..6]) // expect: méஃ
-System.print("søméஃthîng"[2...6]) // expect: mé
-System.print("søméஃthîng"[2...7]) // expect: méஃ
+System.print("søméஃthîng"[2..6].bytes.join(", ")) // expect: 184, 109, 195, 169, 224
+System.print("søméஃthîng"[2...6].bytes.join(", ")) // expect: 184, 109, 195, 169
+System.print("søméஃthîng"[2...7].bytes.join(", ")) // expect: 184, 109, 195, 169, 224
 
 // TODO: Strings including invalid UTF-8.

@joshgoebel
Copy link
Contributor

Personally I find the string tests easier to read than the byte value tests... from the tests though I can't tell that any behavior changed here? Looks like it's doing exactly what it was before?

@ruby0x1
Copy link
Member

ruby0x1 commented Apr 25, 2021

It changed indexing with a range (substring) to use bytes instead of whole characters. A single character can be multiple bytes which was what happened before, it counts those, the change doesn't.

@mhermier
Copy link
Contributor Author

It is completely different, since some produce invalid UTF-8 strings that makes the test suite explodes.

@joshgoebel
Copy link
Contributor

joshgoebel commented Apr 25, 2021

Yeah I think I see now, it just was very confusing that the entire format of the tests also changed so you can't look at the tests diff at a glance to see the diff.

I assume previously that in this specific case the beginning bytes were "rounded up" and the end bytes were also to the next whole character (that's how I'm reading it).

@joshgoebel
Copy link
Contributor

joshgoebel commented Apr 25, 2021

So this patch would actually resolve the whole issue here I think... but PERHAPS introduce the problem where someone calling readLine would blow up because they retrieved invalid UTF-8... (I'm not sure what the behavior of indexOf is in this scenario) and that would be a problem for the user?

And can + on a String still operate if the original string is not valid UTF-8? IE, can we still append the buffer here?

@mhermier
Copy link
Contributor Author

Yes, and a little bit more than that, if there was legitimate invalid chars in the middle of the sequence they would have bin previously silently removed.

The change is breaking some existing code, since now we don't trim invalid character anymore, next valid character is no longer reached at index 1...

+ should be safe, it does some printf magic, so it does not really care about invalid bytes, since it blindly memcpy wren String content.

@mhermier
Copy link
Contributor Author

However, when previously it was possible to revert utf-8 strings using negative range steps, it does only work with ascii-7 strings now...

@joshgoebel
Copy link
Contributor

if there was legitimate invalid chars in the middle of the sequence they would have bin previously silently removed.

I don't follow this and I'm not sure what a "legitimate invalid char" is. :)

@mhermier
Copy link
Contributor Author

Legitimate like, you use an invalid utf8 char on purpose to make a split between 2 strings, in a protocol.

Well while it fix somehow, the fact that we have bytes indexing in String is a not really logical to me, it would be more logical if this was accessible through StringByteSequence

@joshgoebel
Copy link
Contributor

the fact that we have bytes indexing in String is a not really logical to me,

Agree, it's more than a tad bit confusion.

@mhermier
Copy link
Contributor Author

Rethinking at it, maybe we should provide the old behavior unchanged, and provide the new one via StringByteSequence operator [_]. Previous one was allowing to invert letters of the string, so even if indexing was odd it has some usage (even if some test where missing)

@mhermier
Copy link
Contributor Author

Argg we are doomed, StringByteSequence[_] returns a Num ... Have to sleep on this, to see if an idea pops up.

@joshgoebel
Copy link
Contributor

and provide the new one via StringByteSequence operator [_]

Doesn't that already work? The problem this issue is raising is that StringByteSequence is being used for readBytes where-as the String (with it's indexing behavior) is being used to advance the cursor.

readByte itself can be fixed by buffering each character and only advancing after all the bytes in the character are read:

  static readByte() {
    return read_ {
      if (__char==null) {
        __offset = 0
        __char = __buffered[0]
        __max = __char.bytes.count
      }
      // Peel off the next byte.
      var byte = __char.bytes[__offset]
      __offset = __offset + 1

      if (__offset == __max) {
        __char = null
        __buffered = __buffered[1..-1]
      }

      return byte
    }
  }

But then you're left with specifying what the correct behavior of readLine should be... or readCodePoint as I suggested also.

@mhermier
Copy link
Contributor Author

That does not work, StringByteSequence operator [_] Does not work with range because smart index returns the byte as Num.

Yup, this is a mess from the start, I don't know what to think. As this your code only works on valid UTF-8. For correctness and efficiency you should store __buffered.bytes and peel off bytes from it.

@joshgoebel
Copy link
Contributor

joshgoebel commented Apr 25, 2021

StringByteSequence operator [_] Does not work with range because smart index returns the byte as Num.

I guess I'm not sure why we need it to? It would really help if we back-up and first specified the correct/desired behavior here otherwise we're just chasing our tails without end. How are readBytes and readLines intended to interact in this edge case? My assumptions:

  • readBytes always returns the next byte from the string, encoding is irrelevant
  • readLines returns native UTF-8 strings because that's our native String storage implementation

. For correctness and efficiency you should store __buffered.bytes and peel off bytes from it.

You're talking about the case where our input is just random gibberish? That this sort of relies on the input being UTF-8 at core...? I went with that because I feel like the underlying string implementation does that anyways, but I guess that's not 100% true if + is a memcpy...

SO... lets go back and figure out the correct behavior first. I get the impression the String semantics of Wren have been designed with intention (even if they are confusing) and that the correct solution here may not be to just break String, effectively turning it into a byte sequence. :-) So:

// first input some UTF-8 characters to STDIN the buffer
// then
Stdin.readByte() 
// we're now in the "middle" of a UTF-8 character
Stdin.readLine() 

What is readLine() expected to return here exactly? Or should it error? OTTOMH I feel like it should error and someone wanting to mix UTF-8 and bytes (building a protocol layer, etc) should do so very carefully. It seems more common that one would be reading most input via bytes or lines, rather than mixing.

@mhermier
Copy link
Contributor Author

mhermier commented Apr 26, 2021 via email

@mhermier
Copy link
Contributor Author

mhermier commented Apr 26, 2021 via email

@joshgoebel
Copy link
Contributor

I may have to bow out of this one. I thought I could solve it by storing a list of strings/packets as they arrive, but you end up left with all sorts of edge cases... what if a line is spread across multiple packets, what if a codePoint is? You can lazy join the packets before every read, but how many packets? Worse case you might scan the WHOLE input stream looking for a line ending you never find. And compiling strings into bigger ones just makes it harder to advance the cursor since we have no great way to do that one byte at a time... (while also GCing the old data you've moved past)

The simplest thing I tried was just storing the stream as a List of bytes... but this is slow because of all the conversions and also the cost of removing items from the beginning of a list. You might say that doesn't matter for typing in the Repl - but if you wanted to do string processing on a huge input file you were piping in suddenly it would matter a lot.

The fastest (and simplest) way to do this might just be using a circular buffer on the C side... store the data as literal bytes in RAM, append as needed... never any need to copy, transform, convert to or from lists. "Truncating" the beginning of the buffer is instant as it's just updating the head to point to the new location in the buffer. The most expensive operation then would become readCodePoint which requires building UTF-8 characters/codes.

Thats my thought for now. Happy to discuss further with anyone who wanted to pick this up and run with it.


If we only care about solving the exact issue filed here you could accomplish that with:

__buffered = __buffered.bytes[1..-1].map { |x| String.fromByte(x) }.join() 

This would be far more correct behavior... at the cost of rebuilding the whole buffer every time a character is read...

@mhermier
Copy link
Contributor Author

I think we can reduce the problem by giving a max size to readLine(), so in code it should be something like:

readLine() { readLine(1024) }
readLine(maxSize) { ... }

That way you upper bound lookup and buffering.

Making a circular buffer, would also requires such upper bound as the size of the buffer. So API wise, it would not change anything as the naive implementation. So we can go the naive implementation for now to make it work, and if it really become a bottle neck, we could go to a dedicated ring buffer.

@joshgoebel
Copy link
Contributor

Scoping readLine indeed helps reduce worse case on one of the edge cases. :-)

Making a circular buffer, would also requires such upper bound as the size of the buffer.

Of course, but the bigger win is in all the other things you'd get for free. Anyways I just burnt out on this one. So many other things I can better contribute to. Looking forward to seeing if this can truly be solved simply and if so how well that will hold up over time. :-)

@joshgoebel
Copy link
Contributor

__buffered = __buffered[1..-1]

Ruby, out of curiosity just how performant was the original to begin with? Is it smart enough to do a memcpy of the original string when given a range? I think because of UTF-8 it has to walk the whole string character by character, yes?

If so then perhaps the "simple fix" posted above might not even be that terrible of a performance regression.

@mhermier
Copy link
Contributor Author

The performance is poor, as it was only really used in the cli. So it is not really important to be efficient for now, since most of the effort is still on the VM, but yet we need something to be functional for the cli to show case.

@joshgoebel
Copy link
Contributor

Well if it was already quite slow as-is then I'd recommend we just patch it as I suggested above and close out this issue. It'll indeed be fine enough either way for reading CLI input from the keyboard.

I can still make a PR if I can get a confirmation everyone is on board with that solution.

@ruby0x1
Copy link
Member

ruby0x1 commented Apr 30, 2021

Slow is relative, it's probably not slow in the sense you're expecting. It's not the fastest it possibly could be but that doesn't mean performance is poor.

@joshgoebel
Copy link
Contributor

joshgoebel commented Apr 30, 2021

Slow is relative, it's probably not slow in the sense you're expecting.

Actually, it is slow in the sense I'm expecting. 🙃 I wasn't suggesting the performance was inadequate. I was meaning/expecting "in an absolute sense" (compared to C optimized code) not whether it's "fast enough" practically speaking.

My point was only IF it was already super optimized on the C side that I'd be more hesitant to drop in a change that then made it 20x slower. BUT since that's not the case I'll again suggest the tiny patch of just converting to bytes, shifting the data and converting back to a string.

Keeping in mind the actual bug/issue here is that we're reading bytes, but shifting by UTF-8. I feel like I may have gone down a rabbit hole and expanded the scope of the discussion too much originally vs fixing the issue and perhaps creating follow-ups for the larger concerns.

The bad code:

var byte = __buffered.bytes[0] // This operates on bytes
__buffered = __buffered[1..-1] // This operates on UTF8 strings

__buffered = __buffered.bytes[1..-1].map { |x| String.fromByte(x) }.join() 

Do you have any specific thoughts on that as a fix?

@mhermier
Copy link
Contributor Author

mhermier commented May 1, 2021

Maybe we could add a String.bytesAt(index_or_range) or alike, that would return a new string but that would operate on bytes?
It would not solve everything, but would allow a fast byte array manipulations, and would have a low impact on the API.

@joshgoebel
Copy link
Contributor

I think that would certainly make some sense, though what to name it is a question. bytesAt to me doesn't make me think I'm getting back a string.

@mhermier
Copy link
Contributor Author

mhermier commented May 1, 2021

Well thing is that I don't find a more sensible name to it, and have nothing more to propose for now.

The other possibility would be to make it a StringByteSequence method, but it would cause a double allocation (String and the new StringByteSequence).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants