reduce 200x200000 into 200x1000
Hi, I have a ChIP-seq style dataset of RPKM values that I want to reduce from 200x200000 into 200x1000, so that I only end up with 1000 variables at the end of the MDR process, for my 200 records.
What would be the recommended way to use scikit-mdr for this task?
Hi @avilella,
MDR can perform feature construction to compress some number of features down to a single feature. Theoretically, MDR could do so with thousands of features; practically, MDR works best when only passed up to about 5 features. As such, a common practice with MDR is to exhaustively evaluate up to all n-way MDR models and keep only the best k, where n and k are defined by the user. In your case, k=1000 and maybe n=2 (for example). MDR would have to evaluate ~19999900000 models, which is likely outside your computational budget.
For that reason, we've developed some feature selection algorithms in the scikit-rebate package that may be better for your use case. The scikit-rebate algorithms can scan your dataset and assign feature importance scores to every feature (in terms of their ability to predict the outcome, potentially interacting with other features) and select a subset of features down to, say, 1000 features. From there, MDR can more reasonably be used in the way I describe above to explicitly construct new, condensed features from the remaining 1000 features.
Hope that helps.
Beautiful! I will try it!
Great. I should note that scikit-rebate may take a while to run on a dataset with 200k features, but there is a n_jobs parameter that will allow it to use multiple processors and speed the algorithm up.