random cdisc data very slow for larger data
Original message
Running the following code takes a long time! This is on r.roche.com, r 3.6.3
<REDACTED>NEST/nest_on_bee/master/bee_nest_utils.R") bee_use_nest(release = "2021_05_05") ADSL <- radsl(N = 1002) ADLB <- radlb(ADSL)
I reduced this from 15000 as it took way too long. Using system.time I get the following results:
user system elapsed 37.852 0.584 38.436
This is extremely long to make a dataset with 21,000 records! I know random.cdisc really only exists for dummy data, but this seems like extremely poor performance
Provenance:
Creator: martik32
TODO
Improve performance. A few suggestion
- use mclapply
- datatable if necessary
See: internal_github_url/NEST/random.cdisc.data/issues/242 - I suspect there are a lot of places which could be improved
In the past users used rcd directly calling radsl etc. - now we don't release rcd to users (and only use it to create a snapshot to be saved in scda) so I guess there's less value in optimizing this than there was at the time the issue was created
@shajoezhu does it matter for you guys? We don't use rcd at all, I'd close it it it was for us. NEST users should switch to scda instead.
Thanks @gogonzo , we will put this back into the backlog, I agree we are using scda data most of time for our NEST package development, I remember discussion that teams were using these functions to create large fake data for stress testing tasks. let's keep this open please. Thanks