bedshift icon indicating copy to clipboard operation
bedshift copied to clipboard

Bedshift seems to produce invalid regions

Open nleroy917 opened this issue 2 years ago • 1 comments

I noticed using bedshift recently that the algorithm seems to produce regions that are invalid. Specifically, it shifted my bedfile to create a region with a start that occurred after the end (i.e. chr1 1400 900). bedtools was throwing errors because of this.

I saw this happen after I did repeated rounds of shifting on a single bedfile so I'm not sure if this is because of that.

To create the shifted bed files, this is what I ran (I can provide filesm GitHub doesn't let you upload bedfiles):

# %%
import os
from tqdm import tqdm
from bedshift import Bedshift

# %%
# params
ADD_RATE = 0.01
ADD_MEAN = 320.0
ADD_STDEV = 20.0
SHIFT_RATE = 0.01
SHIFT_MEAN = -10.0
SHIFT_STDEV = 120.0
CUT_RATE = 0.01
MERGE_RATE = 0.01
DROP_RATE = 0.03
N_SHIFTS = 100
OUT_DIR = "shifted"

# %%
# create bedshifter object on original bed file
bedshifter = Bedshift("pbmcs.bed", "hg38.chrom.sizes")

for shift_num in tqdm(range(N_SHIFTS), total=N_SHIFTS):
    file_name = f"shifted_{shift_num}.bed"
    file_path = os.path.join(OUT_DIR, file_name)

    bedshifter.all_perturbations(
        addrate=ADD_RATE,  # the rate (as a proportion of the total number of regions) to add regions
        addmean=ADD_MEAN,  # the mean length of added regions
        addstdev=ADD_STDEV,  # the standard deviation of the length of added regions
        shiftrate=SHIFT_RATE,  #  the rate to shift regions (both the start and end are shifted by the same amount)
        shiftmean=SHIFT_MEAN,  #  the mean shift distance
        shiftstdev=SHIFT_STDEV,  #  the standard deviation of the shift distance
        cutrate=CUT_RATE,  #  the rate to cut regions into two separate regions
        mergerate=MERGE_RATE,  #  the rate to merge two regions into one
        droprate=DROP_RATE,  # the rate to drop/remove regions
        seed=42,
    )

    bedshifter.to_bed(file_path)

    # start at the file we just created
    bedshifter = Bedshift(file_path, "hg38.chrom.sizes")

Then I analyzed overlaps with bedtools with (one directory up):

#!/bin/bash

mkdir -p results/
rm results/intersect.csv
touch results/intersect.csv

original_file="data/pbmcs.bed"

# add baseline to results
total=$(wc -l < "$original_file")
n_overlaps=$(bedtools intersect -wa -a "$original_file" -b "$original_file" | wc -l)
percentage=$(echo "scale=2; $n_overlaps / $total" | bc)

echo "pbmcs.bed,$percentage" >> results/intersect.csv

for file in data/shifted_clean/*.bed; do
    filename=$(basename "$file")

    # compute num intersections and store
    total=$(wc -l < "$file")
    n_overlaps=$(bedtools intersect -u -a "$file" -b "$original_file" | wc -l)
    percentage=$(echo "scale=2; $n_overlaps / $total" | bc)
    
    echo "Processing $file"
    echo "$filename,$percentage" >> results/intersect.csv

done

nleroy917 avatar Jun 07 '23 00:06 nleroy917

if you're providing "hg38.chrom.sizes" then it shouldn't go outside of that coordinate system...

so probably something isn't getting that setting correctly, maybe in the Python API...

nsheff avatar Jun 07 '23 12:06 nsheff