bedshift
bedshift copied to clipboard
Shift is too slow
The performance of shift is really slow. I think it can be improved if regions are not modified in place, but are added as new regions and old regions are removed.
Well, my change to creating new regions and dropping old regions didn't help improve the shift performance by much. I think the slow part about shift is in the Pandas Dataframe accession, when the code needs to get the chromosome, start, and end position at a certain row. Now imagine when you have a 50,000 region BED file and a high shift rate of 0.8, the code will have to access a lot of regions iteratively.
New idea:
- take a subset of the Dataframe, which will be the rows to modify
- use an
applyfunction on the start and end columns to get shifted positions. - Drop the old rows, and append this new Dataframe