evodiff Question about generating IDRs from EvoDiff-Seq

Hello, I am having trouble executing the example in the Generating intrinsically disordered regions of the README file.

Per #41, I downloaded the dataset needed from https://zenodo.org/records/5146063, extracted the human_idr_homologues.zip, and saved it as human_protein_alignments directory, so the layout of the directory looks like this

data/
├── blosum62-special-MSA.mat
├── human_idr_alignments
│   ├── human_idr_boundaries_gap.tsv
│   ├── human_idr_boundaries.tsv
│   └── human_protein_alignments
│       ├── HUMAN00009_1to68.fasta
│       ├── HUMAN00009_633to749.fasta
│       ├── HUMAN00009_92to145.fasta
...

From the root directory of the repository, I executed and observed the following

export AMLT_OUTPUT_DIR=./test_output
python evodiff/conditional_generation_msa.py --model-type msa_oa_dm_maxsub --cond-task idr --num-seqs 1 --amlt
INDEX FILE LEN 10634
Traceback (most recent call last):
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 1065, in <module>
    main()
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 150, in main
    src, start_idx, end_idx, original_msa, num_sequences, b_src, b_start_idx, b_end_idx, oma_id = get_IDR_MSAs(index_file, data_top_dir,
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 826, in get_IDR_MSAs
    msa_data, new_start_idx, new_end_idx, num_sequences, b_start_idx, b_end_idx, oma_id = subsample_IDR_MSA(index_file, tokenizer, max_seq_len=max_seq_len, n_sequences=n_sequences,
  File "/home/xxxxx/evodiff/evodiff/conditional_generation_msa.py", line 893, in subsample_IDR_MSA
    query_idx = [i for i, name in enumerate(msa_names) if name == row['OMA_ID']][0]  # get query index
IndexError: list index out of range

I stepped through PDB and found these

(Pdb) p index_file.loc[index]
OMA_ID                                                HUMAN04185
UNIPROT_ID                                                Q96K76
START                                                        424
END                                                          479
IDR_SEQ        EDEKSPQTESCTDSGAENEGSCHSDQMSNDFSNDDGVDEGICLETN...
LENGTHS                                                       55
GAP START                                                    997
GAP END                                                     1141
GAP LENGTHS                                                  144
(Pdb) p row['OMA_ID']
'HUMAN04185'
(Pdb) p [file for i, file in enumerate(all_files) if 'HUMAN04185' in file]
['HUMAN04185_1to38.fasta', 'HUMAN04185_424to479.fasta', 'HUMAN04185_839to1026.fasta']
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_1to38.fasta', return_names=True)
(Pdb) bb
['BRAFL21358 0 to 5', 'EPTBU02539 0 to 0', 'LEPOC10560 3 to 40', 'ANATE13683 3 to 20', 'SERDU25819 0 to 11', 'SCOMX25917 1 to 18', 'GASAC17394 1 to 37', 'TAKRU19760 1 to 40', 'TETNG11216 1 to 37', 'ORYLA12382 1 to 37', 'ORYME02443 0 to 14', 'NOTFU11912 3 to 20', 'CYPVA13923 3 to 20', 'POEFO06820 1 to 37', 'XIPMA06130 3 to 20', 'ORENI17527 1 to 38', 'AMPOC21119 3 to 20', 'HIPCM02252 3 to 20', 'GADMO19517 1 to 38', 'ASTMX08999 5 to 38', 'PYGNA16253 0 to 12', 'ICTPU01019 9 to 31', 'DANRE39301 3 to 20', 'LATCH10026 1 to 38', 'ORNAN18050 0 to 26', 'PROCA13584 0 to 25', 'LOXAF12537 1 to 39', 'ECHTE14028 0 to 25', 'RABIT01068 1 to 38', 'OCHPR15109 0 to 25', 'DIPOR05931 0 to 0', 'FUKDA04471 0 to 5', 'HETGA12775 0 to 25', 'CAVAP13955 0 to 17', 'CAVPO05047 0 to 25', 'CHILA04061 1 to 38', 'OCTDE12798 1 to 38', 'JACJA01745 0 to 25', 'CRIGR16916 1 to 38', 'MOUSE45885 1 to 18', 'RATNO01797 1 to 38', 'NANGA02552 1 to 38', 'CERAT32976 1 to 13', 'CHLSB00649 1 to 18', 'MACFA09490 1 to 13', 'MACMU07436 1 to 38', 'MACNE29351 1 to 13', 'MANLE36987 1 to 13', 'PAPAN05860 0 to 25', 'COLAP32362 1 to 13', 'RHIBE07503 1 to 13', 'RHIRO33601 0 to 0', 'GORGO03243 0 to 6', 'HUMAN04185 1 to 38', 'PANPA06196 0 to 0', 'PANTR02333 0 to 0', 'PONAB01347 1 to 38', 'NOMLE01511 1 to 18', 'AOTNA04675 1 to 13', 'SAIBB00262 1 to 13', 'TARSY11018 0 to 25', 'PROCO03960 1 to 13', 'OTOGA19308 0 to 25', 'TUPBE14316 0 to 0', 'CANLF08543 0 to 12', 'VULVU21503 0 to 0', 'MUSPF13712 0 to 24', 'AILME06514 0 to 39', 'URSAM01994 0 to 0', 'URSMA27578 0 to 12', 'FELCA11798 1 to 39', 'TURTR04946 0 to 21', 'BOVIN04360 0 to 38', 'SHEEP06239 0 to 39', 'PIGXX17664 1 to 38', 'VICPA03255 0 to 25', 'PTEVA15708 0 to 25', 'MYOLU05549 1 to 39', 'ERIEU12752 0 to 21', 'HORSE18107 0 to 25', 'DASNO16007 0 to 38', 'CHOHO10481 0 to 5', 'SARHA06263 1 to 18', 'MONDO10274 1 to 38', 'MACEU07613 0 to 25', 'PHACI02145 1 to 38', 'ANAPL07288 0 to 25', 'MELGA10549 0 to 25', 'CHICK11008 0 to 43', 'FICAL13955 0 to 0', 'TAEGU16862 0 to 25', 'CHRPI18449 1 to 38', 'SPHPU04621 0 to 6', 'ANOCA16740 1 to 38', 'XENTR16027 0 to 24', 'CIOSA04555 0 to 0', 'STRPU17710 1 to 56', 'STRMM09003 1 to 20', 'DAPPU07360 0 to 0', 'ORCCI04184 11 to 56', 'DROPE01541 1 to 2', 'DROPS09123 1 to 2', 'LUCCU03187 10 to 33', 'CULSO18336 1 to 4', 'ANOGA02647 1 to 4', 'AEDAE08107 1 to 4', 'CULQU04626 1 to 13', 'APIME11570 0 to 5', 'BOMIM10786 0 to 5', 'LINHU12916 0 to 5', 'OOCBI04348 0 to 5', 'CAMFO12507 0 to 5', 'ATTCE04431 0 to 5', 'SOLIN10701 0 to 0', 'HARSA07974 0 to 5', 'RHOPR10225 0 to 0', 'PEDHC04140 31 to 113', 'ZOONE05774 1 to 20', 'LINUN26257 1 to 18', 'CRAGI03987 1 to 61', 'OCTBM24223 1 to 18', 'NEMVE01956 1 to 18', 'HYDVU05760 1 to 12', 'AMPQE22746 9 to 63']
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]
(Pdb) [i for i, name in enumerate(bb) if 'HUMAN04185' in name]
[53]
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_424to479.fasta', return_names=True)
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]
(Pdb) aa, bb=parse_fasta(data_dir + 'human_protein_alignments/HUMAN04185_839to1026.fasta', return_names=True)
(Pdb) [i for i, name in enumerate(bb) if name == 'HUMAN04185']
[]

It seems to me that

https://github.com/microsoft/evodiff/blob/f696cfc0e58dcb17b31bf4110aaf11a8a612b07b/evodiff/conditional_generation_msa.py#L793

needs to be changed to row['OMA_ID'] in name. Is this correct?

Sep 21 '24 15:09 zhang-bo-lilly

Additionally,

https://github.com/microsoft/evodiff/blob/f696cfc0e58dcb17b31bf4110aaf11a8a612b07b/evodiff/conditional_generation_msa.py#L888

this line should be iloc instead of loc as the labels in the human_idr_boundaries_gap.tsv file are not consecutive. This seems to be also aligned with the following commented code.

https://github.com/microsoft/evodiff/blob/f696cfc0e58dcb17b31bf4110aaf11a8a612b07b/evodiff/conditional_generation_msa.py#L816-L824

Sep 21 '24 19:09 zhang-bo-lilly

Next, assume the following line is changed to iloc https://github.com/microsoft/evodiff/blob/f696cfc0e58dcb17b31bf4110aaf11a8a612b07b/evodiff/conditional_generation_msa.py#L888

The execution will throw another error at https://github.com/microsoft/evodiff/blob/f696cfc0e58dcb17b31bf4110aaf11a8a612b07b/evodiff/conditional_generation_msa.py#L686

(Pdb) n
IndexError: index 662 is out of bounds for dimension 2 with size 201
> /home/c271831/evodiff/evodiff/conditional_generation_msa.py(686)generate_idr_msa()
-> p = preds[:, random_x, random_y, :]
(Pdb) p random_x
33
(Pdb) p random_y
662
(Pdb) p preds.shape
torch.Size([1, 64, 201, 31])

Appreciate help on getting the code running.

Sep 21 '24 19:09 zhang-bo-lilly

I don't think this is the correct dataset (the folder should contain alignments not single fasta files) - I have messaged the authors to get the IDR alignments uploaded to their Zenodo - in the meantime please shoot me an email ([email protected]) so I can share the correct dataset with you

Sep 30 '24 20:09 sarahalamdari