stream_data icon indicating copy to clipboard operation
stream_data copied to clipboard

Strange behaviour of string(:printable) with min_length opt given

Open volnyvolnyvolny opened this issue 7 years ago โ€ข 5 comments

Hi! I'm using stream_data to test my Ecto schemas, but I ran into a small problem concerning min_length opt given to StreamData.string(:printable).

property "string(:printable) min_length" do
  # While this works fine, giving 41,41,41,...,44,45,42,43,44,47,44,...
  check all s <- string(:alphanumeric, min_length: 41) do
    IO.inspect(String.length(s))
    assert String.length(s) >= 41
  end

  # This returns 41,41,40,40,...39,40.
  check all s <- string(:printable, min_length: 41) do
    IO.inspect(String.length(s))
    assert String.length(s) >= 41
  end
end

Error:

1) property string string(:printable) min_length (StringTest)
   test/string_test.exs:6
   Failed with generated values (after 4 successful runs):
 
       * Clause:    s <- string(:printable, min_length: 41)
         Generated: "๐€€๐€€๐€€๐€€๐€€๊ขด๑ญบ๑…‚ค๐œšธ๒ซฑž๒‘‘๑„งธ๓ซ‹Ž๓ฃชฃ๒ตงธ๒™ ˜๐ฏœฌ๓ฝณท๓Žซ๒ฐนก๓‘ณƒ๒ญฒ ๑—ช™๑€Œพ๐–œ๓ ฑบ๑…–ฌ๓”ฌบ๑ฝ‡พ๐ฌ†Ÿ๑˜‚‘๑ฆชฅ๒–€ฃ๒ฟ–๑ชฒฒ๓™ญ ๐›ขข๑‰—’๔Š ฎ๑ฑ‰๒„“"
 
   Assertion with >= failed
   code:  assert String.length(s) >= 41
   left:  40
   right: 41
   stacktrace:
     test/string_test.exs:15: anonymous fn/2 in StringTest."property string string(:printable) min_length"/1
     (stream_data) lib/stream_data.ex:2063: StreamData.shrink_failure/6
     (stream_data) lib/stream_data.ex:2027: StreamData.check_all/7
     test/string_test.exs:13: (test)

PS> Off topic. Don't want to spam with feature requests, but is it possible to add invert function that reverts generators? So invert(string(:alphanumeric, min_length: 10)) gives us a one_of([string(:alphanumeric, max_length: 9), string(:printable) |> filter(&(not String.match?(&1, ~r/^[0-9A-Za-z]+$/)))])? At the first glance It is always set-theoretically possible and may be very handy.

volnyvolnyvolny avatar Apr 18 '18 12:04 volnyvolnyvolny

I'll investigate but my guess is String.length/1 which counts codepoints (while :min_length is probably for bytes). If you want you can try to see if byte_size instead of String.length behaves correctly.

PS> Off topic

Please open another issue for discussing this.

whatyouhide avatar Apr 18 '18 16:04 whatyouhide

I thought so too at first, but here is what I got:

string_length: 40
byte_size: 160

string_length: 39
byte_size: 160

string_length: 41
byte_size: 163

string_length: 41
byte_size: 161

string_length: 40
byte_size: 160

So, for 40-symbol codepoints string it is around 160 bytes.

property "string(:printable) min_length" do
   # While this works fine, giving 41,41,41,...,44,45,42,43,44,47,44,...
   check all s <- string(:alphanumeric, min_length: 41) do
     IO.inspect(String.length(s))
     assert String.length(s) >= 41
   end

   check all s <- string(:printable, min_length: 41) do
      IO.inspect(String.length(s), label: :string_length)
      IO.inspect(byte_size(s), label: :byte_size)
      assert String.length(s) >= 41
   end
end

volnyvolnyvolny avatar Apr 18 '18 23:04 volnyvolnyvolny

String.length considers graphemes and not codepoints, so maybe thatโ€™s it?

Josรฉ Valimwww.plataformatec.com.br http://www.plataformatec.com.br/Founder and Director of R&D

josevalim avatar Apr 19 '18 06:04 josevalim

The thing is that we are generating a list of :min_length elements, so that part is correct, the problem is that each element in the list can be multiple bytes long. For example, 0xFFFD is 3 bytes long. So our strategy is already "off" in some way, because we're not guaranteeing that String.length/1 will be more than :min_length (because of graphemes I suspect, as pointed out by @josevalim) but also we're saying that byte_size/1 will probably be much bigger because on average the codepoints we use are several bytes long. Not sure how to approach the problem yet.

whatyouhide avatar Apr 19 '18 11:04 whatyouhide

Yes, as the @josevalim pointed it's because of the difference between counting codepoints and counting graphems.

codepoints_length: 41
string_length: 41
byte_size: 159

codepoints_length: 41
string_length: 41
byte_size: 162

(!)    codepoints_length: 41
(!)    string_length: 40
byte_size: 163


defp codepoints_length(<<_::utf8, rest::binary>>, acc), do: codepoints_length(rest, acc + 1)
defp codepoints_length(<<_, rest::binary>>, acc), do: codepoints_length(rest, acc + 1)
defp codepoints_length(<<>>, acc), do: acc

property "string(:printable) min_length" do
  check all s <- string(:printable, min_length: 41) do
    IO.inspect(codepoints_length(s, 0), label: :codepoints_length)
    IO.inspect(String.length(s), label: :string_length)
    IO.inspect(byte_size(s), label: :byte_size)
    assert String.length(s) >= 41
  end
end

For me, personally, this problem arose when I tested my Ecto schemas (similar to the example in https://github.com/whatyouhide/stream_data/issues/100). In validator I had validate_length(max: 40) and I was very-very surprised when string(:printable, min_length: 41) gave me an Ecto-valid schema. But the thing is that Ecto.validate_length has a special option count that defaults to :graphems (but could be :codepoints). So for me now the bypass is to provide count: :codepoints option to all the validators :)

Maybe we could add count option to the &StreamData.string/2 similar to the one used in &Ecto.Changeset.validate_length/2? This way user could generate codepoints-counted and graphems-counted strings.

volnyvolnyvolny avatar Apr 19 '18 12:04 volnyvolnyvolny