Mem optimization of sumstat standardization
This ticket is dedicated to problem 8 in #412. To records potential optimization options
- reducing reuse of unneeded data. At the moment, full rows of the query table will be called into the compare_snp function. However, those information really was not used. So perhaps changing
def snps_match_dup(query,subject,keep_ambiguous=True):
pm = compare_snps(query,subject)
if not keep_ambiguous:
pm = pm[~pm.ambiguous]
new_subject = subject.loc[pm.sidx]
#update beta and snp info
new_query = pd.concat([new_subject.iloc[:,:5],query.loc[pm.qidx].iloc[:,5:]],axis=1)
new_query.loc[list(pm.flip) , "STAT"] = -new_query.STAT[list(pm.flip)]
return new_query, new_subject
into
def snps_match_dup(query,subject,keep_ambiguous=True):
pm = compare_snps(query.iloc[:,0:5],subject)
if not keep_ambiguous:
pm = pm[~pm.ambiguous]
new_subject = subject.loc[pm.sidx]
#update beta and snp info
new_query = pd.concat([new_subject.iloc[:,:5],query.loc[pm.qidx].iloc[:,5:]],axis=1)
new_query.loc[list(pm.flip) , "STAT"] = -new_query.STAT[list(pm.flip)]
return new_query, new_subject
can save us some mem
I was under the impression that numpy leverage c code under the hood and thus being more efficient that a python for-loop. Might be one option.
Since it is part of the cugg packages, when optimizing the sumstat merger, we should optimize the leftover to reduce its dependency on internet, which is why sometimes the liftover don't works within jupyterlab with following error
ConnectionError: HTTPSConnectionPool(host='hgdownload.cse.ucsc.edu', port=443): Max retries exceeded with url: /goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2ad41b3f83a0>: Failed to establish a new connection: [Errno 110] Connection timed out'))
If you have the chain file offline then it should not redownload it every time, right?
If you have the chain file offline, then it should not redownload it every time, right?
I have downloaded it, but somehow the software cant recognizes it. I have struggled to figure this out for some time but decided to postpone it after other stuff.