Strange behaviour of string(:printable) with min_length opt given
Hi! I'm using stream_data to test my Ecto schemas, but I ran into a small problem concerning min_length opt given to StreamData.string(:printable).
property "string(:printable) min_length" do
# While this works fine, giving 41,41,41,...,44,45,42,43,44,47,44,...
check all s <- string(:alphanumeric, min_length: 41) do
IO.inspect(String.length(s))
assert String.length(s) >= 41
end
# This returns 41,41,40,40,...39,40.
check all s <- string(:printable, min_length: 41) do
IO.inspect(String.length(s))
assert String.length(s) >= 41
end
end
Error:
1) property string string(:printable) min_length (StringTest)
test/string_test.exs:6
Failed with generated values (after 4 successful runs):
* Clause: s <- string(:printable, min_length: 41)
Generated: "๐๐๐๐๐๊ขด๑ญบ๑
ค๐ธ๒ซฑ๒๑งธ๓ซ๓ฃชฃ๒ตงธ๒ ๐ฏฌ๓ฝณท๓ซ๒ฐนก๓ณ๒ญฒ ๑ช๑พ๐๓ ฑบ๑
ฌ๓ฌบ๑ฝพ๐ฌ๑๑ฆชฅ๒ฃ๒ฟ๑ชฒฒ๓ญ ๐ขข๑๔ ฎ๑ฑ๒"
Assertion with >= failed
code: assert String.length(s) >= 41
left: 40
right: 41
stacktrace:
test/string_test.exs:15: anonymous fn/2 in StringTest."property string string(:printable) min_length"/1
(stream_data) lib/stream_data.ex:2063: StreamData.shrink_failure/6
(stream_data) lib/stream_data.ex:2027: StreamData.check_all/7
test/string_test.exs:13: (test)
PS> Off topic. Don't want to spam with feature requests, but is it possible to add invert function that reverts generators? So invert(string(:alphanumeric, min_length: 10)) gives us a one_of([string(:alphanumeric, max_length: 9), string(:printable) |> filter(&(not String.match?(&1, ~r/^[0-9A-Za-z]+$/)))])? At the first glance It is always set-theoretically possible and may be very handy.
I'll investigate but my guess is String.length/1 which counts codepoints (while :min_length is probably for bytes). If you want you can try to see if byte_size instead of String.length behaves correctly.
PS> Off topic
Please open another issue for discussing this.
I thought so too at first, but here is what I got:
string_length: 40
byte_size: 160
string_length: 39
byte_size: 160
string_length: 41
byte_size: 163
string_length: 41
byte_size: 161
string_length: 40
byte_size: 160
So, for 40-symbol codepoints string it is around 160 bytes.
property "string(:printable) min_length" do
# While this works fine, giving 41,41,41,...,44,45,42,43,44,47,44,...
check all s <- string(:alphanumeric, min_length: 41) do
IO.inspect(String.length(s))
assert String.length(s) >= 41
end
check all s <- string(:printable, min_length: 41) do
IO.inspect(String.length(s), label: :string_length)
IO.inspect(byte_size(s), label: :byte_size)
assert String.length(s) >= 41
end
end
String.length considers graphemes and not codepoints, so maybe thatโs it?
Josรฉ Valimwww.plataformatec.com.br http://www.plataformatec.com.br/Founder and Director of R&D
The thing is that we are generating a list of :min_length elements, so that part is correct, the problem is that each element in the list can be multiple bytes long. For example, 0xFFFD is 3 bytes long. So our strategy is already "off" in some way, because we're not guaranteeing that String.length/1 will be more than :min_length (because of graphemes I suspect, as pointed out by @josevalim) but also we're saying that byte_size/1 will probably be much bigger because on average the codepoints we use are several bytes long. Not sure how to approach the problem yet.
Yes, as the @josevalim pointed it's because of the difference between counting codepoints and counting graphems.
codepoints_length: 41
string_length: 41
byte_size: 159
codepoints_length: 41
string_length: 41
byte_size: 162
(!) codepoints_length: 41
(!) string_length: 40
byte_size: 163
defp codepoints_length(<<_::utf8, rest::binary>>, acc), do: codepoints_length(rest, acc + 1)
defp codepoints_length(<<_, rest::binary>>, acc), do: codepoints_length(rest, acc + 1)
defp codepoints_length(<<>>, acc), do: acc
property "string(:printable) min_length" do
check all s <- string(:printable, min_length: 41) do
IO.inspect(codepoints_length(s, 0), label: :codepoints_length)
IO.inspect(String.length(s), label: :string_length)
IO.inspect(byte_size(s), label: :byte_size)
assert String.length(s) >= 41
end
end
For me, personally, this problem arose when I tested my Ecto schemas (similar to the example in https://github.com/whatyouhide/stream_data/issues/100). In validator I had validate_length(max: 40) and I was very-very surprised when string(:printable, min_length: 41) gave me an Ecto-valid schema. But the thing is that Ecto.validate_length has a special option count that defaults to :graphems (but could be :codepoints). So for me now the bypass is to provide count: :codepoints option to all the validators :)
Maybe we could add count option to the &StreamData.string/2 similar to the one used in &Ecto.Changeset.validate_length/2? This way user could generate codepoints-counted and graphems-counted strings.