Which column is the ground-truth label used in the paper?

Open YueCao94 opened this issue 4 years ago • 0 comments

Hi! This is a really nice work and I much love it! Thank you very much for making it open-sourced! I am trying to reproduce the results in your paper. But I am confused with the excel files in the 'calc_logprobs/output/' that are downloaded using 'demo_calc_logprobs.sh'.

1). I am using the file with title ended with 'autoregressive_results.csv'. And I saw 40 files there that matched the number in the paper. Am I correct here?

2). For some files, there are multiple columns that seem to be the ground-truth labels. For instance, in 'MTH3_HAEAESTABILIZED_Tawfik2015_MTH3_HAEAESTABILIZED_hmmerbit_plmc_n5_m30_f50_t0.2_r1-330_id100_b165_autoregressive_results.csv' file, I saw 6 columns with column name: 'Wrel_G3', 'Wrel_G7'.... Which one should I choose as the ground-truth label?

3). I saw in some rows, the ground-truth column has empty value, should I directly remove them? Moreover, in some files, there are no-mutation columns like 'A16A' or 'C173C', should I also remove them? (e.g. CALM1_HUMAN_Roth2017_autoregressive_results.csv ) If not, I saw some files have 0 ground-truth label for those no-mutation sample but some files have 1, which really confused me considering that it is log of ratio...

4). Based on the paper, I think the model 'mutation_effect_prediction_all_mean_channels-48''s performance is finally reported in the paper, right?

Again, thanks a lot for sharing those data and I will be much appreciative if you could help me! Thanks!

Aug 24 '21 03:08 YueCao94