DIPS-Plus question - are all protein pairs direct interactors?

I was thinking of using DIPS-Plus for training a classifier for protein-protein interaction (I confess I have little experience in ML, any advice welcome). The database describes the pairs based on the distance between their atoms, and I saw some few examples of pairs where there are less than 5 atoms within the threshold distance. So two questions:

are all pairs direct interactors? (eg say we have a PDB with chains A, B, C, where A::B and B::C, but not A::C. Would the pair A::C be included in the database?)
suggestions to create a set of non-interactors? As a I understand DIPS-Plus you can train a model to tell if any two atoms are interacting or not, but not (as is) to train a model to tell if two proteins are interactors or not.

Perhaps the second question is more for eg Stack Exchange, but I'm open to any advice here!

Jul 02 '24 15:07 rubenalv

I decided to map the chain pairs to IntAct, to check if they were annotated as direct interactors. I used these resources:

IntAct (https://www.ebi.ac.uk/intact/download/ftp)
mapping of pdb to Uniprot (http://www.bioinf.org.uk/pdbsws, updated server is http://www.bioinf.org.uk/servers/, where you can download a mapping file)
the list of pdb pairs in this github

At an IntAct miscore >= 0.45 (recommended setting) and selecting only direct interactors, out of the 42K pairs in the DIPS-plus I mapped only 4564. With a miscore < 0.45 I collected 1758 pairs.

So the conclusion, at least based on the IntAct data, is that only a fraction of the DIPS-plus pairs contain chains in direct interaction. Anyone that wants to use this dataset for classification of protein-protein direct interaction should curate it. If the goal is to annotate atoms in proximity between the chains in the pairs, or create point cloud embeddings like the dMaSIF ones, 42K pairs makes a great dataset.

@amorehead, I'll leave the issue open in case you would like to comment, otherwise feel free to close it. Thanks for the resource!

Jul 08 '24 15:07 rubenalv

I noted that the chain pairs in the .dill files do not take into account homomultimers. E.g. 2YKS, that is a pentamer of 5 identical sequences, generates pairs 2YKS_A_B, 2YKS_A_C, etc, which are identical. This will create some imbalance (and burden) when training. I realised this running FoldSeek cluster, so using the FoldSeek-based split in the database is encouraged.

Jul 18 '24 15:07 rubenalv