Calculate gene similarity on the HPO
Dear Kevin,
I would like to calculate the similarity for a few genes (~2000). I annotated these genes with the HPO codes from the human phenotype ontology webpage (http://compbio.charite.de/jenkins/job/hpo.annotations/lastSuccessfulBuild/artifact/util/annotation/genes_to_phenotype.txt).
I obtained reshaped and got a file like this:
A4GALT . HP:0010970|HP:0000006
AAAS . HP:0040281|HP:0040282|HP:0040283|HP:0011463|HP:0001278|HP:0000972|HP:0012332|HP:0008259|HP:0004322|HP:0001251|HP:0000648|HP:0000007|HP:0002571|HP:0004319|HP:0001263|HP:0008163|HP:0001249|HP:0009916|HP:0003487|HP:0007002|HP:0000252|HP:0001347|HP:0000522|HP:0003676|HP:0000649|HP:0001324|HP:0000953|HP:0001260|HP:0000846|HP:0001250|HP:0007440|HP:0000505|HP:0000982|HP:0001761|HP:0010486|HP:0000830|HP:0007556|HP:0002093|HP:0001430|HP:0001252|HP:0002376|HP:0000612|HP:0000407
AASS . HP:0000119|HP:0000752|HP:0001083|HP:0001903|HP:0003593|HP:0001250|HP:0002161|HP:0000736|HP:0001252|HP:0100543|HP:0000007|HP:0001256|HP:0000750|HP:0001249
ABAT . HP:0025356|HP:0000278|HP:0000098|HP:0007291|HP:0000007|HP:0002415|HP:0001321|HP:0000494|HP:0001347|HP:0006829|HP:0001263|HP:0001274|HP:0001250|HP:0001254|HP:0025430|HP:0003819
ABCA4 . HP:0040280|HP:0040281|HP:0040282|HP:0040283|HP:0040284|HP:0000006|HP:0007663|HP:0000662|HP:0001133|HP:0000608|HP:0000512|HP:0000543|HP:0000007|HP:0007737|HP:0007722|HP:0000510|HP:0007984|HP:0007843|HP:0000548|HP:0000580|HP:0000572|HP:0008035|HP:0000639|HP:0000618|HP:0000405|HP:0000603|HP:0000135|HP:0000493|HP:0000463|HP:0001249|HP:0007703|HP:0000613|HP:0000987|HP:0030329|HP:0000649|HP:0000648|HP:0000551|HP:0008046|HP:0000407|HP:0007704|HP:0007814|HP:0008736|HP:0000035|HP:0008002|HP:0007675|HP:0000431|HP:0000610|HP:0000518|HP:0000602|HP:0001513|HP:0008059|HP:0000501|HP:0000563|HP:0000842|HP:0030500|HP:0001347|HP:0000505|HP:0005978|HP:0011504|HP:0011462|HP:0011463|HP:0003621|HP:0007994
ABCB11 . HP:0040283|HP:0000989|HP:0002014|HP:0003155|HP:0000952|HP:0001081|HP:0003593|HP:0001394|HP:0001744|HP:0001046|HP:0002240|HP:0002630|HP:0002908|HP:0000007|HP:0003819|HP:0004322|HP:0001508|HP:0001406|HP:0001402
which I think is the correct format for phenopy. I then used the command:
phenopy score gene_lists_with_HPO.txt --threads 12 --self
and I got as output something like this:
#query entity_id score
A4GALT A4GALT 1.0
A4GALT ABCD1 0.0
A4GALT ACAT1 0.010405043493187662
A4GALT ACVRL1 0.03336405048957507
A4GALT ADGRG1 0.0
A4GALT AGXT 0.009234121604447244
A4GALT AKT1 0.003509945769583653
A4GALT ALG1 0.0
A4GALT AMER1 0.0
However, the identity for some genes are not 1 as I was expecting. For instance:
ABCB7 ABCB7 0.5558528984777618
Would you expect something like this? How would you explain it? Should I use a different --summarization-method ?
Best regards,
Luca
Hi Luca,
Thank you for checking out the repo. It looks like you have successfully run phenopy on your input files, that's great! The behavior you describe is expected. It's a property of the HRSS semantic similarity scoring algorithm. It's a way to scale similarity scores by rewarding nodes being compared further down the ontology. The way the algorithm is implemented here, even a phenotype-to-itself is only ever 1.0 by HRSS when the beta_ic is 0.0. This is the case in leaf nodes. Does this explanation help?
so how would i set a network-cutoff value then, if same terms might not result in 1.0? Also is there any possibility to introduce my own scores, if I have some frequency values attached to Phenotypes?