easybuild icon indicating copy to clipboard operation
easybuild copied to clipboard

Downloading external data

Open ygrange opened this issue 7 years ago • 5 comments

I'd like to build a library that requires an external dataset to be downloaded and untarred somewehere in the install directory (and then this directory needs to be configured in CMake).

So let me show how I would define a dummy version of the workflow in a shell script so that you see what I want to achieve:

git clone https://github.com/somecode/code.git
wget ftp://my.ftp.server/somedata.ztar
mkdir ${INSTALLPREFIX}/data
cd ${INSTALLPREFIX}/data
tar zf /path/to/somedata.ztar
cd /path/to/gitcode
mkdir build
cmake -DDATA_DIR=${INSTALLPREFIX}/data .. && make 

I'm not really sure what the best approach is here. Should I treat it as a source_url, and basically define a custom install script (if I only add it as an extra source URL, easybuild bails out telling me it doesn't know how to build; which makes sense), or should it be done using extensions parameters? Or should I be doing something even more different?

ygrange avatar May 25 '18 13:05 ygrange

@ygrange You can download the dataset as an additional 'source', and then copy it in place via preconfigopts for example:

easyblock = 'CMakeMake'
...
source_urls = [
    '<source_urls_for_software>',
    'ftp://my.ftp.server/',
]
sources = [
    '<source_tarball_for_software>',
    'somedata.tar.gz',
]
...
preconfigopts = "mkdir -p %(installdir)s && cp -a %(builddir)s/data %(installdir)s/data && "
configopts = "-DDATA_DIR=%(installdir)s/data"

One potential problem with this is that the installation directory you're creating via mkdir -p is probably going to be wiped away in the installation step (make install), so you may need to use keeppreviousinstall = True too (which doesn't cooperate well when doing a forced reinstall using --force).

If the data doesn't need to actually be in place during cmake, it's better to copy the dataset via postinstallcmds (and then you won't have the problem mentioned above):

postinstallcmds = ["cp -a %(builddir)s/data %(installdirs)/"]

boegel avatar May 27 '18 09:05 boegel

Thanks for the primer! I think the main issue I encounter here is that even though the data is not needed at build time, the tests do depend on the data and the tests are executed before the install.

I see different "XXXopts" parameters and will try to look through the code if any of those happens after the build but before the tests. Else I'll have to revert to the preconfigopts idea. Any hints are obviously welcome :)

ygrange avatar May 28 '18 11:05 ygrange

I ended up doing it twice (since it's not a huge data set and I don't like breaking the --force functionality in this specific case). Once I use preconfigopts = "mkdir -p %(installdir)s/data/ && cp -a %(builddir)s/data/* %(installdir)s/data/ && " and later I have

postinstallcmds = ["mkdir -p %(installdir)s/data/",
"cp -a %(builddir)s/data/* %(installdir)s/data/"]

Works quite well actually. AFAIC, this issue be closed (thanks again for the help).

ygrange avatar May 29 '18 07:05 ygrange

Actually, one more question:

Now, my file looks something like this:

source_urls = [
    '<source_urls_for_software>',
    'ftp://my.ftp.server/',
    'http://some/patchlocation/'
]
sources = [
    '<source_tarball_for_software>',
    'somedata.tar.gz',
]
patches=['mypatch.patch']

checksums=['abcce', #  source tarbal
                      'cddee', #  external data
                      'fhgji']    # mypatch]

The problem now is that the external data set will be updated far more regularly than the code (say: weekly vs a few times per year). So having a checksum in there will quite easily break the eb file. Since I have the ambition to provide the ebs to the central repo and I am not looking forward to committing a weekly pull request, I'd like to drop the checksum check on the second file. I haven't really been able to do so though. (edit after happily hacking away):

So I tried the following based on some documentation I found in a pull request:

checksums=['abcce', #  source tarbal
                      None, #  external data
                      'fhgji']    # mypatch]

Now it becomes a bit strange because what happens here is that '<source_tarball_for_software>' is checked against abcce (as expected), but the external data gets checked against fhgji and the patch then gets None as checksum (since the script fails before checking it I don't really know what happens).

ygrange avatar May 30 '18 13:05 ygrange

Just bumping this one to make sure this question (meaning the one in the last reply by me) doesn't fall between the cracks. I am planning to pull-request my additions, but I'd like to have this in first.

ygrange avatar Jun 11 '18 09:06 ygrange