xqtl-protocol icon indicating copy to clipboard operation
xqtl-protocol copied to clipboard

Mem optimization of sumstat standardization

Open hsun3163 opened this issue 3 years ago • 4 comments

This ticket is dedicated to problem 8 in #412. To records potential optimization options


  1. reducing reuse of unneeded data. At the moment, full rows of the query table will be called into the compare_snp function. However, those information really was not used. So perhaps changing
def snps_match_dup(query,subject,keep_ambiguous=True):
    pm = compare_snps(query,subject)
    if not keep_ambiguous:
        pm = pm[~pm.ambiguous]
    new_subject = subject.loc[pm.sidx]
    #update beta and snp info
    new_query = pd.concat([new_subject.iloc[:,:5],query.loc[pm.qidx].iloc[:,5:]],axis=1)
    new_query.loc[list(pm.flip) , "STAT"] = -new_query.STAT[list(pm.flip)]
    return new_query, new_subject

into

def snps_match_dup(query,subject,keep_ambiguous=True):
    pm = compare_snps(query.iloc[:,0:5],subject)
    if not keep_ambiguous:
        pm = pm[~pm.ambiguous]
    new_subject = subject.loc[pm.sidx]
    #update beta and snp info
    new_query = pd.concat([new_subject.iloc[:,:5],query.loc[pm.qidx].iloc[:,5:]],axis=1)
    new_query.loc[list(pm.flip) , "STAT"] = -new_query.STAT[list(pm.flip)]
    return new_query, new_subject

can save us some mem

hsun3163 avatar Oct 03 '22 20:10 hsun3163

I was under the impression that numpy leverage c code under the hood and thus being more efficient that a python for-loop. Might be one option.

hsun3163 avatar Nov 28 '22 17:11 hsun3163

Since it is part of the cugg packages, when optimizing the sumstat merger, we should optimize the leftover to reduce its dependency on internet, which is why sometimes the liftover don't works within jupyterlab with following error

ConnectionError: HTTPSConnectionPool(host='hgdownload.cse.ucsc.edu', port=443): Max retries exceeded with url: /goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x2ad41b3f83a0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

hsun3163 avatar Dec 15 '22 18:12 hsun3163

If you have the chain file offline then it should not redownload it every time, right?

gaow avatar Dec 15 '22 19:12 gaow

If you have the chain file offline, then it should not redownload it every time, right?

I have downloaded it, but somehow the software cant recognizes it. I have struggled to figure this out for some time but decided to postpone it after other stuff.

hsun3163 avatar Dec 15 '22 19:12 hsun3163