stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

Function to take a number of random values from a list

Open sobolevn opened this issue 1 year ago • 10 comments

I started looking at the state of random in Gleam and I have several ideas.

  1. int.random and float.random do not have seeds. It might be a problem for libraries. For example: if you want a fake-data library for tests: you need to have the same data based on some test seed value. Otherwise, you won't have reproducable failures. https://docs.python.org/3/library/random.html#notes-on-reproducibility
  2. There's no int.random_range(x, y) function, which would generate an int from range >=x, <y. I think that this function is essential
  3. There's no choice(List) function, which in my experience is the second most frequently used random function in Python.

Maybe we should add random module with the needed functions and a nice api for Seed(Int)?

sobolevn avatar Aug 22 '24 06:08 sobolevn

For that there's the prng package https://hex.pm/packages/prng

giacomocavalieri avatar Aug 22 '24 06:08 giacomocavalieri

Maybe prng could have random_range and choice added?

inoas avatar Aug 22 '24 07:08 inoas

No plans for 1 and 2 at present. 3 sounds good 👍

lpil avatar Aug 22 '24 08:08 lpil

hm the gleam-@alias idea reappears for me where we could document and help people find functions when they come from other stdlibs such as pythons, phps and javascripts.

inoas avatar Aug 22 '24 08:08 inoas

Isn't this basically list.shuffle |> list.take(n)?

I suppose this can be made more efficient than shuffling the full list, if we have a large list and only want to pick a few random elements, but it could be a good first version.

shuffle runs a fold generating random numbers for each element of a list, then sort it and iterates over all elements to remove the generated random numbers. take runs in linear time.
A possible improvement would be to iterate over only the first n elements after the sort, which would avoid one full iteration of the list.

A possible version without the sort would generate a new random int every time, that is at most the length of the list, then pop the element at that position into an accumulator and repeat until it has n elements (or the list is empty).

Varpie avatar Aug 22 '24 21:08 Varpie

There are algorithms for random sampling from a linked list. I haven't done the research to say how each approach compares.

lpil avatar Aug 22 '24 22:08 lpil

Reservoir sampling would be the obvious choice but the implementations I'm familiar with use indexing into arrays which wouldn't work here since it would have to work with lists instead. I guess you could use a dictionary instead of an array as the reservoir but there's probably a better approach out there.

apainintheneck avatar Sep 06 '24 06:09 apainintheneck

Oops, misclick.

How about we copy whatever Elixir or Elm or some other similar language does?

lpil avatar Sep 10 '24 15:09 lpil

Good point. It's worth taking a look at what other languages do in this area.

The Elixir standard library has the Enum.take_random method which now uses a modified reservoir sampling algorithm for performance reasons (relevant commit). Internally it uses a tuple as the reservoir in place of the traditional fixed length array.

I looked at the Elm and couldn't find any relevant methods in the standard library or packages.

A few Haskell libraries had implemented versions of it which used things like IntMap as the reservoir data structure.

None of this really helps us but it's still interesting.

apainintheneck avatar Sep 14 '24 01:09 apainintheneck

Are y'all wanting a multiple sample like python's random.sample(List, Int) or just a single element sample like random.choice(List)?

ethanthoma avatar Nov 19 '24 23:11 ethanthoma

We want a function which takes a specified number of values from a list.

lpil avatar Nov 20 '24 14:11 lpil

I was looking at impls, it seems like algo L seems optimal (and i could be wrong but seems similar to what elixer does). However, it requires taking the natural log which I couldnt find impl in the stdlib, whats the ideal solution? A slower impl without needing natural log or adding natural log to gleam/float?

ethanthoma avatar Nov 20 '24 18:11 ethanthoma

Sounds like a good reason to add that function to me! Unless anyone else has any other suggestions.

It looks like that algorithm wants array mutation at a random index. How would you do that in Gleam given we don't have constant time indexing or array mutation.

lpil avatar Nov 21 '24 11:11 lpil

I will try something and let you know what I come up with!

ethanthoma avatar Nov 27 '24 19:11 ethanthoma

Should I make a separate PR just for natural log? I think it warrants inclusion irrespective of this issue

ethanthoma avatar Nov 27 '24 19:11 ethanthoma

Same PR for both please 🙏

lpil avatar Nov 28 '24 12:11 lpil