ProteinMPNN produces low-complexity result from RFDiffusion output

Open bwllc opened this issue 2 years ago • 0 comments

This message is a near-duplicate of an issue I opened on the ProteinMPNN discussion forum (https://github.com/dauparas/ProteinMPNN/issues/61). Since the issue involves using RFDiffusion output in ProteinMPNN and I have yet to see a reply in that forum, I will also try to ask my questions here.

I am attempting to follow the work flow recommended in the RFDiffusion paper, https://www.biorxiv.org/content/10.1101/2022.12.09.519842v2.

I am obtaining low amino acid sequence diversity in my ProteinMPNN outputs. My problem is not as severe as the one shown in an earlier reported issue (https://github.com/dauparas/ProteinMPNN/issues/46), but it is problematic. Here is a typical example. I am executing ProteinMPNN as shown in ProteinMPNN/examples/submit_example_1.sh.

MIYKHAGYYNAKKGKGKGYTFSTGAKGKGYTKRFKKFSVGKGKATDKETLRAMLTLGGIIFEIDKKKKNKWKGYSTDKGLTAGYSTGKGTKALGYQITPNFGVGYAYNKKPYFGVSYQTKDGSVGVGYNFGLRIVSVSYGNPKTGKGAGYSYKA
{ A : 6.5%
  C : 0.0%
  D : 2.6%
  E : 1.3%
  F : 4.5%
  G : 18.2%
  H : 0.6%
  I : 3.9%
  K : 18.2%
  L : 3.9%
  M : 1.3%
  N : 3.9%
  P : 1.9%
  Q : 1.3%
  R : 1.9%
  S : 5.8%
  T : 8.4%
  V : 4.5%
  W : 0.6%
  Y : 10.4% }

There are a surprisingly large number of G and K residues. I also wonder about the high abundance of Y. The calculated isoelectric point is 10.08. I generated 10 candidate sequences from this particular structure. They were all pretty similar to this one.

I have some other structures I have tried which have similar geometry, but which are dominated by E residues and which have highly acidic isoelectric points. I can't seem to get structures with good diversity, in the pI range of 5 to 9. Something seems wrong.

A response to the earlier issue report was as follows:

"Hello! This might happen if the model is uncertain about the prediction, or the input backbone is of low quality. You could try adding negative alanine bias."

Originally posted by @dauparas in https://github.com/dauparas/ProteinMPNN/issues/46#issuecomment-1497947341

I can of course attempt to apply negative biases to certain amino acids in Protein MPNN, as recommended in the earlier post. Before I do this, I would like to ask whether there are any criteria we can use to measure, or adjust, the "quality" of input backbones.

My PDB input files are being generated by RFDiffusion. I specify a partial scaffold, and RFDiffusion inpaints the rest. At least in PyMol, the secondary structures of the RFDiffusion output files look reasonable. The automated secondary structure assignment algorithm in PyMol is identifying regions of alpha helix and beta sheet. That doesn't mean that I don't have issues with my RFDiffusion outputs, but I don't know what to look for.

My structures include some purely computer-generated residues which are expected to become part of a beta sheet. The geometry is accepted by PyMol, but I wonder whether doing something simple like placing beta strands, say, 0.2 Angstroms closer together than they "should be" would cause the problems I am seeing.

Will RFDiffusion attempt to correct small positional errors in specified scaffold residues? Does RFDiffusion refuse to proceed if the positions of specified scaffold residues are physically unrealistic?

I am considering experimenting with the geometric parameters of my beta-sheet generator. Alternately, I am considering specifying side chains to provide some actual mass to the protein (all residues in my RFDiffusion inputs are glycines), then performing a molecular-dynamics energy minimization using GROMACS. Both of those approaches are stabs in the dark, as well as potentially quite tedious.

Is there a better way?

Thanks for any information you can provide.

Jul 31 '23 22:07 bwllc