bio-transformers icon indicating copy to clipboard operation
bio-transformers copied to clipboard

a worker died or was killed while executing a task by an unexpected system error

Open wushixian opened this issue 4 years ago • 6 comments

I use 4 GPUs to calculate MSA embeddings, but each time the process terminated, the error was raise by ray, the error message is " a worker died or was killed while executing a task by an unexpected system error", the GPU process terminated one by one, I tried several times, I update ray with lastest version, the problem is same. How can I treat the problem? Thanks!

wushixian avatar Aug 19 '21 09:08 wushixian

I tried again and just use CPU to calculate embeddings. and I found it still teminated. I check esm document and it is said that some problem with model esm_msa1_t12_100M_UR50S and recommend using esm_msa1b_t12_100M_UR50S, but I can't find where to modify the code to use esm_msa1b_t12_100M_UR50S, could somebody tell me? thanks.

wushixian avatar Aug 20 '21 09:08 wushixian

Hello,

I will add esm_msa1b_t12_100M_UR50S model in few minutes.

delfosseaurelien avatar Aug 20 '21 11:08 delfosseaurelien

I tried again and just use CPU to calculate embeddings. and I found it still teminated. I check esm document and it is said that some problem with model esm_msa1_t12_100M_UR50S and recommend using esm_msa1b_t12_100M_UR50S, but I can't find where to modify the code to use esm_msa1b_t12_100M_UR50S, could somebody tell me? thanks.

delfosseaurelien avatar Aug 20 '21 11:08 delfosseaurelien

model esm_msa1b_t12_100M_UR50S added.

delfosseaurelien avatar Aug 20 '21 11:08 delfosseaurelien

I use 4 GPUs to calculate MSA embeddings, but each time the process terminated, the error was raise by ray, the error message is " a worker died or was killed while executing a task by an unexpected system error", the GPU process terminated one by one, I tried several times, I update ray with lastest version, the problem is same. How can I treat the problem? Thanks!

I will check this, it seems there is an issue with Ray.

delfosseaurelien avatar Aug 20 '21 11:08 delfosseaurelien

Thank you very much!

wushixian avatar Aug 21 '21 02:08 wushixian