Sample duplication issue in ROSMAP RNASeq and Genotype
This serves as a reminder of the Questions to be answered, @Rhopala will make a comprehensive description of the scenario later on. @gaow please correct me if I got the question wrong.
According to the ROSMAP metadata csv on synapse
-
RNA Seq: 1.1. For some individual ID("R2809589"), there will be two records of RNA Seq for the same tissue. 1.1.1 why is that 1.1.2. how does it impact our analysis 1.1.3 what to do with it.
-
Genotype 2.1 For some individual ID("R3257830"), there will be two records of WGS for the different or even same tissue ("R1631616"). 2.1.1 Why 2.1.2. how does it impact our analysis 2.1.3 The way to deal with it is by removing duplicates while maximizing overlap
Description of the question in included in the file mapping notebook: WGS_file_mapping.ipynb
- First, Duplicates exist in both synapse metadata rna-Seq and synapse metadata WGS, i.e. multiple rna-Seqs or multiple WGS records with unique specimenID are mapped with the same individualID. For the senario of having duplicates, we need to figure out:
- why duplicates occur
- how does it impact our analysis
- what to do with it.
The exact index of individualIDs contain duplicates and the duplicated specimenIDs for rna-Seq and WGS can be found in synapse metadata RNA-seq and synapse metadata WGS (= CTCN WGS) sections as dup_dlpfc_rna and dup_wgs.
- Second, Among the 1196 synapes metadata records for WGS, 45 of the specimen IDs are replicates of their corresponding individual IDs, for example:
individualID specimenID
R9809661 R9809661 (problematic data)
R9996478 SM-CJFME (Normal data)
The problematic data are technically missing specimen ID which is needed to map to genotype data (WGS or array). Also, the there are only 35 unique individual IDs out of these 45 samples. Still, we need to figure out:
- why these 45 weird data exist
- how does it impact our analysis
- what to do with it.
Most of these problematic records should be correspond to a specimen ID in format "ROSXXXXXX" or "MAPxxxxxx". The exact list of the 45 records can be found in the 45 missing specimen IDs investigation section at the end.
For the first question, indeed that is what we need to figure out now. For the second questions, out of the 45 samples, I have identified 28 of them with a RNASeq ID
If we decided to keep all the rnaSeq sample, then there are 895 overlaps samples. However if for each genotype samples, we keep only one rnaseq samples. we will have 860 samples.
I think we should make sure that the samples are unique and thus keep only 895 of them. I.e. keep only 860 samples.