emcee MPIPool: Child processes continue when parent job is deleted on cluster

General information:

emcee version: 2.2.1
platform: Linux
installation method (pip/conda/source/other?): pip

Problem description:

I am running a 30-core job on a cluster using emcee. When I delete the job through the ubiquitous qdel <job ID> command, only the parent is killed, while the child processes continue to run. In other words: qdel makes the job-ID disappear from the queue, but the 30 (= number of cores) initiated Python processes remain present in the background, contributing heavily to the cluster load. Furthermore, I can only ''manually'' kill the background processes on the one node I am logged into.

What have you tried so far?:

I have consulted the emcee documentation and the github source code relating to MPIPool, which I use to parallelise the job, but I did not manage to pinpoint where exactly things go wrong. Ideally, I would like a child process to call on its parent regularly and to terminate itself when it notices that the parent is no longer there. At the moment it seems that deleting the job simply cuts the rope that ties parent and child together, without deleting the child.

If you know what should be changed, I would be delighted to hear it!

Minimal example:

The snippet below shows how the pool object is integrated my main script. In addition I use a bash script to submit a job to the cluster queue (containing echo 'mpirun -np '$NCORES' python '$SKRIPTNAME >> $TMPFILE) and request the number of cores I want to use. The latter should work fine.

import emcee
from emcee.utils import MPIPool

pool = MPIPool()

if not pool.is_master():
    pool.wait()
    sys.exit(0)

sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, pool = pool)
pos, prob, state = sampler.run_mcmc(p0, 1000) # p0 contains the initial walker positions

pool.close()

Jan 31 '18 13:01 joostwardenier

This is not an issue with emcee as it doesn't take over responsibility for the pool. The correct way to do this is wrapping in a context manager to make sure it dies:

import sys
import emcee
from schwimmbad import MPIPool

with MPIPool() as pool:

    if not pool.is_master():
        pool.wait()
        sys.exit(0)

    sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, pool=pool)
    pos, prob, state = sampler.run_mcmc(p0, 1000) # p0 contains the initial walker positions

schwimmbad is a Python package used for parallelisation.

Jan 31 '18 21:01 andyfaff

Many thanks for the suggestion!

Feb 02 '18 09:02 joostwardenier

@andyfaff: I implemented your solution, but the problem seems to persist. My current suspicion is that the issue may be related to the fact that the lnprob function calls a piece Fortran code. It could be that deleting the (Python) job while the Fortran script is running, does not kill the latter. In the past, I have also noticed that Fortran does not respond to ctrl C. The programme only stops when arriving at the Python layer. I am using the f2py module to enable the communication between the two languages.

Feb 11 '18 14:02 joostwardenier

I am very sorry to bother you again, but the issue has not been solved yet. It even occurs for this minimal piece of code. I am completely clueless as to where the source of this malfunctioning hides. Hopefully one of you can help!

import numpy as np
import sys
import emcee
from emcee.utils import MPIPool

def calc_log_prob(a,b,c,d):

    for i in np.arange(1000):
        for j in np.arange(1000):
            for k in np.arange(1000):
                for g in np.arange(1000):
                    x = i + j + k + g

    return -np.abs(a + b) 

def lnprob(x):
    return calc_log_prob(*x)

ndim, nwalkers = 4, 180

p0 = [np.array([np.random.normal(loc = -5.5, scale = 2., size=1)[0], \
                np.random.normal(loc = -0.3, scale = 1., size=1)[0], \
                0.+3000.*np.random.uniform(size=1)[0], \
                -6.+3.*np.random.uniform(size=1)[0]]) for i in range(nwalkers)]

with MPIPool() as pool:

    if not pool.is_master():
        # Wait for instructions from the master process.
        pool.wait()
        sys.exit(0)

    sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, pool = pool)

    pos, prob, state = sampler.run_mcmc(p0, 560)

pool.close()

Feb 12 '18 18:02 joostwardenier

Have you tried the example from the emcee documentation? That works for me, both on a cluster and locally. If it doesn't work then I suspect it's going to be something to do with your mpi/compiler/mpi4py setup.

Feb 13 '18 01:02 andyfaff

Thanks for your quick reply! The above excerpt was directly inspired by the examples from the emcee documentation + the change you suggested. I am also inclined to conclude that we are dealing with a cluster-specific problem here. The programme + parallelisation works totally fine, but when the job is deleted, the (# of cores) processes just do not disappear...

Feb 13 '18 07:02 joostwardenier

@jpw96: This does seem like it's probably a problem with your cluster config, but there is also one bug that I can see in your code. If you use a with statement for the pool, then you can't include the pool.close() line at the end. I don't suspect that that would cause this problem, but it will throw an exception, I think!

Jun 07 '18 16:06 dfm