What should be the MCMC parameters for different datasets?
I am using SCITE on our data which has number of cells from [100, 500, 1000, 1500] and number of mutations from [50, 200, 500]. I wanted to know for this data what should I set the desired number of repetitions and desired chain length for each MCMC repetition?
I also wanted to know why we get so many ML/MAP trees as an output. Usually, we get one ML tree as output. Is there any way to get one final ML/MAP tree as an output?
Hi, for simulated data we found good performance for the mutation tree with the number of iterations scaling with n^2 log(n) with n the number of mutations (see figure 6 of the original article). For the constant, one could check when you start to get similar results across runs, but I think around 10 should work (ie 10n^2 log(n) iterations, so around 100k for 50 mutations but tens of millions for 500 mutations which can take quite some time).
You can get lots of ML/MAP trees when there are many trees with exactly the same likelihood (or marginal likelihood). This rarely happens with lots of cells well spread across the tree, but can occur when there are regions of the tree with no cells attached. To only get one tree as output, you can use the -max_treelist_size 1 option.
Thank you, Jack for answering all my questions!
Hi Jack,
I ran SCITE on our dataset with 500 cells and 200 mutations with the # of iterations set as 1. I found out that the time it took to complete on this dataset for 1 iteration is 56 minutes. Is this time normal for this dataset? If yes, what is the reasonable number of iterations to use (apart from the formula 10n^2 log(n) you proposed to use)?
Thanks, Ritu
Hi Ritu,
that seems much too slow. For example when I run
./scite -i test.csv -n 200 -m 500 -r 1 -l 1 -fd 6.e-5 -ad 0.2 0.2 -cc 1e-05
on a random test csv file
it takes 75 milliseconds on my desktop, and only 53 if I compile with the -O3 option
clang++ *.cpp -o scite -O3
All the best, Jack
Hi Jack,
I apologize for the confusion. I used the following command to run SCITE on our dataset:
scite -i input_t8_rep1.D.csv -n 200 -m 500 -r 1 -l 1565000 -fd 0.01 -ad 0.2 0.2 -a -max_treelist_size 1 -o testData
Here is the csv file: input_ap001_rep1.D.csv
I passed # of iterations (-r) as 1 and calculated the constant (-l) with the formula you suggested 10n^2 log(n). With these it finished the run in 56 mins. So in order to reduce the time what can be a reasonable value of -r and -l ?
Thanks, Ritu
Hi Ritu,
the parameter -l is the number of iterations in each MCMC run, so taking about an hour for 1.5 million iterations for this size data would be around what we would expect (with tens of milliseconds for each iteration). On my computer it is a bit faster on your data, doing a million iterations in about 10 minutes (you could try compiling with the -O3 flag to make it faster too).
The -r parameter is the number of repetitions, ie how many times it runs the whole MCMC chain. This can be done externally by running several jobs with -r 1 on different cores.
In terms of reasonable values of -l for your data, you would want each run to give similar results. Here the suggestion would be to run several copies of your command (with different seeds from the -seed option) and check the output tree scores or output trees are similar. If they are then -l is large enough (and could even have been reduced), if not then you may want to run even longer chains (which of course takes more compute time). Playing around with your data, the runs start to get similar with the order of hundreds of thousands or a million iterations, so the 1.5M you have seems reasonable.
All the best, Jack
PS in your case you only have states 0,1 and 3 in the data, so you only need one value for the -ad parameter.
Hi Jack,
Thank you for the detailed explanation of the parameters. I am able to run SCITE much faster after compiling it with -O3. This is very helpful!
Thanks, Ritu
Hi Jack,
I am using the -a option to get the cells attached to the leaf nodes because I want to get the clones with the cells. I noticed that some cells are attached to multiple leaf nodes. For example, in the output testData_ml0.gv I noticed that cell s499 is attached to the following three nodes:
185 -> s499;
191 -> s499;
199 -> s499;
testData_ml0.docx I included the contents of testData_ml0.gv in the docx file.
Does SCITE have multiple clones for one cell? Please let me know if my understanding is correct here.
Thanks, Ritu
So for the maximum likelihood result, especially with missing data, there may be several locations where each cell may attach which have exactly the same likelihood (for example the different predicted genotypes for the cell only differ where there is missing data). The code returns them all for completeness, and you may pick any one or average across them.
Thank you for answering my question!