random.cdisc.data icon indicating copy to clipboard operation
random.cdisc.data copied to clipboard

random cdisc data very slow for larger data

Open cicdguy opened this issue 4 years ago • 3 comments

Original message

Running the following code takes a long time! This is on r.roche.com, r 3.6.3

<REDACTED>NEST/nest_on_bee/master/bee_nest_utils.R") bee_use_nest(release = "2021_05_05") ADSL <- radsl(N = 1002) ADLB <- radlb(ADSL)

I reduced this from 15000 as it took way too long. Using system.time I get the following results:

user system elapsed 37.852 0.584 38.436

This is extremely long to make a dataset with 21,000 records! I know random.cdisc really only exists for dummy data, but this seems like extremely poor performance

Provenance:

Creator: martik32

TODO

Improve performance. A few suggestion

  1. use mclapply
  2. datatable if necessary

cicdguy avatar Aug 05 '21 13:08 cicdguy

See: internal_github_url/NEST/random.cdisc.data/issues/242 - I suspect there are a lot of places which could be improved

In the past users used rcd directly calling radsl etc. - now we don't release rcd to users (and only use it to create a snapshot to be saved in scda) so I guess there's less value in optimizing this than there was at the time the issue was created

nikolas-burkoff avatar May 04 '22 10:05 nikolas-burkoff

@shajoezhu does it matter for you guys? We don't use rcd at all, I'd close it it it was for us. NEST users should switch to scda instead.

gogonzo avatar May 23 '22 12:05 gogonzo

Thanks @gogonzo , we will put this back into the backlog, I agree we are using scda data most of time for our NEST package development, I remember discussion that teams were using these functions to create large fake data for stress testing tasks. let's keep this open please. Thanks

shajoezhu avatar May 24 '22 07:05 shajoezhu