foundry icon indicating copy to clipboard operation
foundry copied to clipboard

Performance Problem with random

Open amyAMeQ opened this issue 1 year ago • 25 comments

While diagnosing why some of our tests are so slow, I discovered the culprit for us is with random(). We have a function:

public static function randomApiUser(): ApiUser
    {
        $user = self::random()->object();
        return new ApiUser($user);
    }

And it was making tests take five times as long as when we didn't use it.

After poking around, I noticed that rather than getting X random records, it's getting ALL of them and then getting random ones from that.

public function randomRange(int $min, int $max, array $attributes = []): array
    {
        if ($min < 0) {
            throw new \InvalidArgumentException(\sprintf('$min must be positive (%d given).', $min));
        }

        if ($max < $min) {
            throw new \InvalidArgumentException(\sprintf('$max (%d) cannot be less than $min (%d).', $max, $min));
        }

        $all = \array_values($this->findBy($attributes));

        \shuffle($all);

        if (\count($all) < $max) {
            throw new \RuntimeException(\sprintf('At least %d "%s" object(s) must have been persisted (%d persisted).', $max, $this->getClassName(), \count($all)));
        }

        return \array_slice($all, 0, \random_int($min, $max)); // @phpstan-ignore-line
    }

I think it would be more performant if the code used something like this to just get random records from the database (with the appropriate attribute filters):

SELECT * 
FROM table_name
ORDER BY RAND()
LIMIT 1;

I am willing to do a PR for this if you would like.

amyAMeQ avatar Jun 06 '24 19:06 amyAMeQ

Hi @amyAMeQ

thanks for the report, you're totally right about this, it could be improved.

Problem is: I think it is not the same syntax between mysql (ORDER BY RAND()) and postgresql (ORDER BY RANDOM())

We're about to release Foundry 2.0, certainly the next week. I'd rather this to be fixed in 2.0 than in 1.x

nikophil avatar Jun 07 '24 07:06 nikophil

@nikophil - OK. I will pull the 2.0 branch and play locally. Hoping I can find a way to do it in a database agnostic way.

amyAMeQ avatar Jun 07 '24 13:06 amyAMeQ

you'll find an upgrade guide here: https://github.com/zenstruck/foundry/blob/1.x/UPGRADE-2.0.md (not 100% up to date)

BTW don't hesitate to report any problems you'd find in 2.x branch :sweat_smile:

nikophil avatar Jun 07 '24 13:06 nikophil

Thanks for this report @amyAMeQ, I actually remember when writing that code that this could be a problem but decided to wait until someone reported. Here we are ;)

kbond avatar Jul 24 '24 20:07 kbond

Hoping I can find a way to do it in a database agnostic way

sadly that's not possible. Doctrine does not ship any way to proceed a random. You'll have to add some new DQL function to do this, but we don't want to do this is in Foundry.

then we need to do it with raw SQL, but ordering by rand is slightly different whether if you use MySQL or Postgres.... or Mongo which we actually support! IMO, this would introduce too much complexity....

The first solution that comes in my mind, which is a trade-off, may be to use the query builder (maybe we'll need to split orm and odm behaviors), and select all ids, and then only take few randomly. This is not ideal, but it would prevent some unneeded complexity.

nikophil avatar Dec 12 '24 15:12 nikophil

Aside from ordering by RAND(), I think a different (and possibly more database agnostic) approach would be to first COUNT() the number of rows, then pick a random number between 1 and that number and then finally using LIMIT <random number>,1 to get that specific row.

At least for MySQL the first argument of LIMIT is the offset, the second is the number of rows to return.

This would potentially be much faster than a heavy ORDER operation.

But again, I have not looked into MongoDB.

--

Ok sadly for Postgres the syntax is similar but different:

MySQL: LIMIT 10,1 Postgres: LIMIT 1 OFFSET 10

But maybe there is a clean way we can implement this. The thing is that the functionality is working as is, just doesn't perform very well. So maybe we can add a configuration option that enables this performance optimization, but with the side note that this requires some user setup (i.e. configuring Doctrine). Then it would be backwards compatible but people that really need this, can decide to benefit from the performance optimization.

mdeboer avatar Apr 04 '25 12:04 mdeboer

hey @mdeboer

this sounds like a good idea, do you think it is doable with doctrine/persistence only? we're abstracting away the DBMS (Mongo included) thanks to doctrine/persistence, this would be awesome that we could do this in a db agnostic way

nikophil avatar Apr 04 '25 12:04 nikophil

@nikophil I will have a look this weekend! 👍🏻

mdeboer avatar Apr 04 '25 12:04 mdeboer

I just saw your edit: isn't doctrine/persistence + dbal helps with this? here is the prototype of ObjectRepositroy::findBy():

    public function findBy(
        array $criteria,
        ?array $orderBy = null,
        ?int $limit = null,
        ?int $offset = null
    );

it actually has limit + offset parameters, I think your approach is ok for any DBs!

nikophil avatar Apr 04 '25 12:04 nikophil

Ah yes, I had a suspicion it was in there as it is such a trivial task, but wasn't sure. That will make things a lot easier.

But I will spend some time on it over the weekend, to see what I can come up with. Probably do a benchmark as well.

mdeboer avatar Apr 04 '25 12:04 mdeboer

A benchmark which can be reused from time to time to ensure there is not a performance issue before a release would be definitely very useful!

nikophil avatar Apr 04 '25 13:04 nikophil

I mean, it could be a lot of queries but this could dramatically reduce memory usage.

kbond avatar Apr 04 '25 14:04 kbond

I mean, it could be a lot of queries

how I understand it, it would be one extra query per "random" call, but I think this is worth it!

nikophil avatar Apr 04 '25 15:04 nikophil

Oh right. I forgot we only return a single entity here.

kbond avatar Apr 04 '25 15:04 kbond

hum it could be challenging for randomRange() maybe 🤔

nikophil avatar Apr 04 '25 15:04 nikophil

Another thing to consider is that for example we use this in our tests and we usually seed the RNG to get random yet consistent results. This would not be possible with ORDER BY RAND() as far as I know.

I'm pulling up the code now to have a look at random() and randomRange() :)

mdeboer avatar Apr 06 '25 13:04 mdeboer

Alright, so fixing this for random() (one random entity) is easy and yields a nice improvement. But I haven't spend much time working on randomSet() and randomRange(), they seem a bit harder to implement at first glance.

Most time went to setting up phpbench but I got it working. Writing benchmarks is pretty similar to writing either unit or integration tests now.

Anyway, here are the results for random() I got with ORM so far. Comparing the original implementation in the 2.x repository with my implementation I explained before (using the random offset):

PHPBench (1.4.1) running benchmarks... #standwithukraine
with configuration file: /home/maarten/Projects/Forks/foundry/phpbench.json
with PHP version 8.4.5, xdebug ❌, opcache ❌
comparing [actual vs. before]

\Zenstruck\Foundry\Tests\Benchmark\ORM\GenericEntityFactoryBench

    bench_random # 1........................I9 - [Mo2.111ms vs. Mo2.020ms] +4.48% (±3.89%)
    bench_random # 50.......................I9 - [Mo2.464ms vs. Mo3.504ms] -29.68% (±6.08%)
    bench_random # 500......................I9 - [Mo2.738ms vs. Mo11.773ms] -76.75% (±9.23%)
    bench_random # 1000.....................I9 - [Mo3.515ms vs. Mo35.876ms] -90.20% (±12.87%)

Subjects: 1, Assertions: 0, Failures: 0, Errors: 0
+---------------------------+--------------+------+------+-----+------------------+-----------------+-----------------+
| benchmark                 | subject      | set  | revs | its | mem_peak         | mode            | rstdev          |
+---------------------------+--------------+------+------+-----+------------------+-----------------+-----------------+
| GenericEntityFactoryBench | bench_random | 1    | 1    | 10  | 36.310mb +0.00%  | 2.111ms +4.48%  | ±3.89% -2.19%   |
| GenericEntityFactoryBench | bench_random | 50   | 1    | 10  | 37.108mb +0.00%  | 2.464ms -29.68% | ±6.08% -59.30%  |
| GenericEntityFactoryBench | bench_random | 500  | 1    | 10  | 45.293mb -1.22%  | 2.738ms -76.75% | ±9.23% -74.94%  |
| GenericEntityFactoryBench | bench_random | 1000 | 1    | 10  | 59.239mb -10.32% | 3.515ms -90.20% | ±12.87% -47.94% |
+---------------------------+--------------+------+------+-----+------------------+-----------------+-----------------+

Depending on the number of items in the repository, performance is increased by 30 to 90%! 😍 Also the stability in speed is improved considerable (50-75%).

You can find my changes here: https://github.com/mdeboer/foundry/tree/feat/speedup-random

If you like I can make a PR 👍🏻

mdeboer avatar Apr 06 '25 17:04 mdeboer

Hi @mdeboer

this is very cool! We were actually talking with @kbond the we would like to add a CI check about performances, using PHPBench, maybe based on this example

Would you mind to create two PRs? one which just fixes the random stuff, and another one with your work on PHPBench?

If you're willing to introduce the GitHub Action, this is super cool, but we can also handle this ourselves, based on your work

Alright, so fixing this for random() (one random entity) is easy and yields a nice improvement. But I haven't spend much time working on randomSet() and randomRange(), they seem a bit harder to implement at first glance.

We can work on randomRange() another time.

Another thing to consider is that for example we use this in our tests and we usually seed the RNG to get random yet consistent results. This would not be possible with ORDER BY RAND() as far as I know

makes totally sense.

thanks!

nikophil avatar Apr 06 '25 17:04 nikophil

about randomRange(): I'd really like that we keep only relying on doctrine/persistence.

Based on this constraint, I can only see half solutions, both being trade-offs:

  • we can use what you've done with random() and perform only one query, and change the limit... This would be kinda fake random, but I think this is ok-ish for testing purpose
  • we can perform as many queries as requested objects, at the price of performance (but still be better than the current implementation), but this would give real randomness

nikophil avatar Apr 06 '25 17:04 nikophil

Hey @nikophil that sounds great! I will create twos separate PRs in that case.

Will probably leave the GitHub action to you - I have experience with CI but not with GH and not too much time unfortunately.

mdeboer avatar Apr 06 '25 17:04 mdeboer

yeah, this is perfectly understandable, what you've already done will really help, and have a perfect timing :)

nikophil avatar Apr 06 '25 18:04 nikophil

Alright, got some time tonight to do it 👍🏻

mdeboer avatar Apr 08 '25 11:04 mdeboer

There we go, I created two PRs. I see that at least one of them has some psalm issues but one of the reasons being that phpbench is not installed there. But alright, maybe best to discuss that there and not pollute this issue.

mdeboer avatar Apr 08 '25 19:04 mdeboer

@amyAMeQ just curious, is this still an issue for you?

mdeboer avatar Aug 19 '25 11:08 mdeboer

I think randomRange() is still problematic: https://github.com/zenstruck/foundry/blob/2.x/src%2FPersistence%2FRepositoryDecorator.php#L219

See https://github.com/zenstruck/foundry/issues/612#issuecomment-2781526431

nikophil avatar Aug 20 '25 16:08 nikophil