SUPPA icon indicating copy to clipboard operation
SUPPA copied to clipboard

IndexError when running diffSplice

Open ipatop opened this issue 11 months ago • 3 comments

Hi!

I am using SUPPA in a preliminary experiment with 1 replica for each condition, and I am getting an error when running "diffSplice" that I have not been able to sort out. Thanks in advance for your help.

"Calculating differential analysis between conditions: SUPPA_0kcl and SUPPA_1kcl 
ERROR:__main__:Unknown error: (<class 'IndexError'>, IndexError('list index out of range'), <traceback object at 0x7fcb7ac603c0>)"

I created the tpm counts with IsoQuant, and I am using SUPPA for the psi calculations and the differential splicing. I verified that the first column in the psi have the same names along samples and they re not all nan.

This is my main pipeline, and by the end, I get the error:

#Load modules and set enviroment
module load gcc/9.2.0 python/3.8.12
source ~/IsoQuant.python.3.8.12/bin/activate #enviroment with IsoQuant and SUPPA
source ~/.profile

#SETUP
ref=Mus_musculus.GRCm39.dna_sm.chromosome.ALL.fa
gtf=Mus_musculus.GRCm39.110.chr.gtf
exp=IsoQuant/my_ont/my_ont.transcript_grouped_tpm.tsv #expression file from IsoQuant
out=SUPPApSI #outname
suppa=/scripts/SUPPA/suppa.py #suppa directory

#I separate the tpm counts from IsoQuant by sample
awk '{print $1,$3}' IsoQuant/my_ont/my_ont.transcript_grouped_tpm.tsv > SUPPA.1.tpm
awk '{print $1,$2}' IsoQuant/my_ont/my_ont.transcript_grouped_tpm.tsv > SUPPA.0.tpm

#run the generation of splicing cases
python $suppa generateEvents -i $gtf -o $out -f ioi 
oi=${out}.ioi

#quantify isoforms for both samples together
python $suppa psiPerIsoform -g $gtf -e $exp -o $out
psi=${out}_isoform.psi
#separate psi by sample
awk '{print $1,$3}' SUPPApSI.txt_isoform.psi > SUPPA.1.psi
awk '{print $1,$2}' SUPPApSI.txt_isoform.psi > SUPPA.0.psi

#diffSplice
python $suppa diffSplice -m empirical -nan 1 --input $oi --psi SUPPA_0kcl.psi SUPPA_1kcl.psi --tpm SUPPA_0kcl.tpm SUPPA_1kcl.tpm --area 1000 --lower-bound 0.05 -pa -gc -o $out

This is how the head of tpm and psi look like:

`sample1
ENSMUST00000000266 0.000000
ENSMUST00000000514 0.119406
ENSMUST00000000834 0.000000`

`sample1
ENSMUST00000000266 0.000000
ENSMUST00000000514 0.000000
ENSMUST00000000834 0.000000`

`sample1
ENSMUSG00000104478;ENSMUST00000194081 nan
ENSMUSG00000104385;ENSMUST00000194393 nan
ENSMUSG00000102135;ENSMUST00000194605 nan`

`sample1
ENSMUSG00000104478;ENSMUST00000194081 nan
ENSMUSG00000104385;ENSMUST00000194393 nan
ENSMUSG00000102135;ENSMUST00000194605 nan`

ipatop avatar May 10 '25 15:05 ipatop

Hi,

Thanks for your message.

That type of error appears when there is something wrong with the format of the input files

Does the awk print put an "\t" by default? If not, it could be that since a "\t" will be expected.

Also, please check that you're using the latest version of the code from the github (v 2.4), which adds some bug fixes with respect to the v 2.3

I hope this helps

Eduardo

On Sun, 11 May 2025 at 01:27, ipatop @.***> wrote:

ipatop created an issue (comprna/SUPPA#213) https://github.com/comprna/SUPPA/issues/213

Hi!

I am using SUPPA in a preliminary experiment with 1 replica for each condition, and I am getting an error when running "diffSplice" that I have not been able to sort out. Thanks in advance for your help.

"Calculating differential analysis between conditions: SUPPA_0kcl and SUPPA_1kcl ERROR:main:Unknown error: (<class 'IndexError'>, IndexError('list index out of range'), <traceback object at 0x7fcb7ac603c0>)"

I created the tpm counts with IsoQuant, and I am using SUPPA for the psi calculations and the differential splicing. I verified that the first column in the psi have the same names along samples and they re not all nan.

This is my main pipeline, and by the end, I get the error:

#Load modules and set enviroment module load gcc/9.2.0 python/3.8.12 source ~/IsoQuant.python.3.8.12/bin/activate #enviroment with IsoQuant and SUPPA source ~/.profile

#SETUP ref=Mus_musculus.GRCm39.dna_sm.chromosome.ALL.fa gtf=Mus_musculus.GRCm39.110.chr.gtf exp=IsoQuant/my_ont/my_ont.transcript_grouped_tpm.tsv #expression file from IsoQuant out=SUPPApSI #outname suppa=/scripts/SUPPA/suppa.py #suppa directory

#I separate the tpm counts from IsoQuant by sample awk '{print $1,$3}' IsoQuant/my_ont/my_ont.transcript_grouped_tpm.tsv > SUPPA.1.tpm awk '{print $1,$2}' IsoQuant/my_ont/my_ont.transcript_grouped_tpm.tsv > SUPPA.0.tpm

#run the generation of splicing cases python $suppa generateEvents -i $gtf -o $out -f ioi oi=${out}.ioi

#quantify isoforms for both samples together python $suppa psiPerIsoform -g $gtf -e $exp -o $out psi=${out}_isoform.psi #separate psi by sample awk '{print $1,$3}' SUPPApSI.txt_isoform.psi > SUPPA.1.psi awk '{print $1,$2}' SUPPApSI.txt_isoform.psi > SUPPA.0.psi

#diffSplice python $suppa diffSplice -m empirical -nan 1 --input $oi --psi SUPPA_0kcl.psi SUPPA_1kcl.psi --tpm SUPPA_0kcl.tpm SUPPA_1kcl.tpm --area 1000 --lower-bound 0.05 -pa -gc -o $out

This is how the head of tpm and psi look like:

sample1 ENSMUST00000000266 0.000000 ENSMUST00000000514 0.119406 ENSMUST00000000834 0.000000

sample1 ENSMUST00000000266 0.000000 ENSMUST00000000514 0.000000 ENSMUST00000000834 0.000000

sample1 ENSMUSG00000104478;ENSMUST00000194081 nan ENSMUSG00000104385;ENSMUST00000194393 nan ENSMUSG00000102135;ENSMUST00000194605 nan

sample1 ENSMUSG00000104478;ENSMUST00000194081 nan ENSMUSG00000104385;ENSMUST00000194393 nan ENSMUSG00000102135;ENSMUST00000194605 nan

— Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/213, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKBZHGTPBRF6JJDENKVL25YLGXAVCNFSM6AAAAAB426NTLSVHI2DSMVQWIX3LMV43ASLTON2WKOZTGA2TIMRXGI2DQOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

EduEyras avatar May 25 '25 13:05 EduEyras

Hi, I can confirm I am receiving the same error in v2.4 when performing the same operation of splitting my .psi and .tpm files by condition using awk. I have output the files both as space delimited and tab delimited files during different attempts and it does not resolve the error.

I looked into this a bit deeper. Similar to #196 this only happens when the method is set to empirical. At least in my case, the error appears to be that the numpy array output by create_replicates_distribution is empty and is then input to calculate_events_pvals, which causes diffSplice to fail with the error message ERROR:__main__:Unknown error: (<class 'IndexError'>, IndexError('list index out of range'), <traceback object at 0x7f91c272f440>).

I also only have 1 replicate per condition in my case, could this be the cause of this error?

brandon-hastings avatar Jun 10 '25 13:06 brandon-hastings

Hey, I had the same problem! The split_file.R script did not work for me, so I wrote my own Python script for it. There were 2 issues with my script:

  1. The nan values became blank when I use df.to_csv(). I over-rode this by na_rep="nan" in the to_csv() command. The blanks were messing up the formatting of the tsv.
  2. My output files did not contain the event/transcript ID in the first column. I fixed this by adding index=True in my to_csv() command.

I would suggest you carefully look into the formatting of your split TPM and events files.

P.S. I just noticed that your head command has "sample1" at the start. You should remove that.

ugh-astya avatar Sep 10 '25 14:09 ugh-astya