Very long resquiggle runtime with tombo 1.4 on DRS data...
I've been testing tombo 1.4 on some of our DRS data and I'm finding that the resquiggle step is taking a long time to complete and it using a lot of resources.
The dataset I'm running it on is small; only `260k reads. I'm running it on an HPC cluster using 28 cores. Tombo and its dependancies were installed from bioconda into a conda environment. The resquiggle command has been running now for 44hrs and has used a max memory footprint of 96GB... and its not finished yet. The log output is not reporting any errors or problems. Is this expected? It is considerably longer than previous versions used to take but I'm aware that considerable improvements to the RNA workflow were made in this update so I don't know if this is typical now.
The command:
tombo resquiggle path/to/fast5dir transcriptome.fasta --processes 28 --overwrite
Abridged output:
[14:16:54] Loading minimap2 reference.
[14:16:59] Getting file list.
[14:17:03] Loading default canonical ***** RNA ***** model.
[14:17:03] Re-squiggling reads (raw signal to genomic sequence alignment).
0%| | 250/255282 [00:06<1:46:51, 39.78it/s]
0%| | 496/255282 [00:12<1:50:30, 38.43it/s]
0%| | 745/255282 [00:19<1:51:47, 37.95it/s]
...
98%|█████████▊| 250650/255282 [2:34:27<02:51, 27.05it/s]
99%|█████████▊| 251550/255282 [2:35:12<02:18, 27.01it/s]
99%|█████████▉| 252750/255282 [2:36:46<01:37, 27.03it/s]
99%|█████████▉| 253750/255282 [2:36:19<00:56, 27.05it/s]
100%|█████████▉| 254791/255282 [2:36:43<00:18, 27.09it/s]
OK, so this finally core dumped on me over the weekend:
*** glibc detected *** more: double free or corruption (!prev): 0x0000000001f1cef0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x30a6275dee]
/lib64/libc.so.6[0x30a6278c80]
/lib64/libc.so.6(fclose+0x14d)[0x30a626633d]
more[0x405bd4]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x30a621ed1d]
more[0x401a89]
======= Memory map: ========
00400000-00408000 r-xp 00000000 fd:01 6029373 /bin/more
00608000-00609000 rw-p 00008000 fd:01 6029373 /bin/more
01f1a000-01f3b000 rw-p 00000000 00:00 0 [heap]
30a5e00000-30a5e20000 r-xp 00000000 fd:01 2359654 /lib64/ld-2.12.so
30a6020000-30a6021000 r--p 00020000 fd:01 2359654 /lib64/ld-2.12.so
30a6021000-30a6022000 rw-p 00021000 fd:01 2359654 /lib64/ld-2.12.so
30a6022000-30a6023000 rw-p 00000000 00:00 0
30a6200000-30a638a000 r-xp 00000000 fd:01 2361686 /lib64/libc-2.12.so
30a638a000-30a658a000 ---p 0018a000 fd:01 2361686 /lib64/libc-2.12.so
30a658a000-30a658e000 r--p 0018a000 fd:01 2361686 /lib64/libc-2.12.so
30a658e000-30a6590000 rw-p 0018e000 fd:01 2361686 /lib64/libc-2.12.so
30a6590000-30a6594000 rw-p 00000000 00:00 0
30a7e00000-30a7e16000 r-xp 00000000 fd:01 2361690 /lib64/libgcc_s-4.4.7-20120601.so.1
30a7e16000-30a8015000 ---p 00016000 fd:01 2361690 /lib64/libgcc_s-4.4.7-20120601.so.1
30a8015000-30a8016000 rw-p 00015000 fd:01 2361690 /lib64/libgcc_s-4.4.7-20120601.so.1
30b4c00000-30b4c1d000 r-xp 00000000 fd:01 2361727 /lib64/libtinfo.so.5.7
30b4c1d000-30b4e1c000 ---p 0001d000 fd:01 2361727 /lib64/libtinfo.so.5.7
30b4e1c000-30b4e20000 rw-p 0001c000 fd:01 2361727 /lib64/libtinfo.so.5.7
30b4e20000-30b4e21000 rw-p 00000000 00:00 0
2b8d24470000-2b8d24473000 rw-p 00000000 00:00 0
2b8d24473000-2b8d2a308000 r--p 00000000 fd:01 6835494 /usr/lib/locale/locale-archive
2b8d2a308000-2b8d2a309000 rw-p 00000000 00:00 0
2b8d2a388000-2b8d2a38f000 r--s 00000000 fd:01 6960203 /usr/lib64/gconv/gconv-modules.cache
2b8d2a38f000-2b8d2a390000 rw-p 00000000 00:00 0
2b8d2c000000-2b8d2c021000 rw-p 00000000 00:00 0
2b8d2c021000-2b8d30000000 ---p 00000000 00:00 0
7ffcc8011000-7ffcc8027000 rw-p 00000000 00:00 0 [stack]
7ffcc80e2000-7ffcc80e3000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Abort (core dumped)
I'm guessing this is because of the ancient glibc version?? (2.12), so I'm going to try rerunning it with a newer glibc..
The abridged output indicates that the reads are being processed at a rate of ~27 reads per second and that a total time of just over 2 and a half hours has elapsed. So I think there are 2 most likely possibilities here: 1) there are a small number of very long reads in here that require a very large amount of memory to re-squiggle (no chunking is currently performed in Tombo) or 2) Something in the post-processing is hanging. (or something else entirely)
There is not a whole lot going on in the main thread post-processing (which would cause this process to hang and require large compute resources), so my inclination is that the answer might be the first situation. The abridged output shows that there are 491 reads left to process in the last update. Did this progress update at all after this? Do you expect (or have from some mapping results) some very long (>100kb) reads from this run?
This is RNA data, there are no reads anywhere near 100kb.
I tried it on the first batch 6148 reads from this run (because minKNOW does that) as a test set. This also fails with the same error right at the end of the processing, after about 6hrs runtime (so the delay between getting close to the end of the processing and the core dump is related to the number of reads?? n=2). Here is the full output
[10:19:15] Loading minimap2 reference.
[10:19:19] Getting file list.
[10:19:20] Loading default canonical ***** RNA ***** model.
[10:19:20] Re-squiggling reads (raw signal to genomic sequence alignment).
5%|▍ | 298/6148 [00:07<02:38, 37.07it/s]
10%|█ | 622/6148 [00:16<02:24, 38.42it/s]
15%|█▍ | 899/6148 [00:23<02:18, 37.98it/s]
18%|█▊ | 1137/6148 [00:31<02:19, 35.91it/s]
21%|██▏ | 1314/6148 [00:39<02:24, 33.49it/s]
25%|██▌ | 1560/6148 [00:46<02:18, 33.33it/s]
29%|██▉ | 1798/6148 [00:53<02:12, 33.05it/s]
33%|███▏ | 1985/6148 [01:01<02:08, 32.37it/s]
35%|███▌ | 2169/6148 [01:08<02:05, 31.74it/s]
38%|███▊ | 2321/6148 [01:15<02:04, 30.73it/s]
39%|███▉ | 2414/6148 [01:22<02:08, 29.12it/s]
43%|████▎ | 2620/6148 [01:32<02:05, 28.20it/s]
47%|████▋ | 2890/6148 [01:43<01:56, 28.15it/s]
52%|█████▏ | 3180/6148 [01:52<01:46, 28.06it/s]
56%|█████▌ | 3450/6148 [02:04<01:37, 27.71it/s]
60%|█████▉ | 3660/6148 [02:12<01:29, 27.72it/s]
63%|██████▎ | 3870/6148 [02:20<01:22, 27.63it/s]
66%|██████▌ | 4050/6148 [02:27<01:16, 27.54it/s]
70%|██████▉ | 4270/6148 [02:35<01:08, 27.53it/s]
73%|███████▎ | 4490/6148 [02:43<01:00, 27.44it/s]
76%|███████▌ | 4680/6148 [02:52<00:54, 27.15it/s]
80%|███████▉ | 4880/6148 [03:02<00:47, 26.79it/s]
84%|████████▍ | 5150/6148 [03:12<00:37, 26.71it/s]
88%|████████▊ | 5390/6148 [03:22<00:28, 26.65it/s]
92%|█████████ | 5610/6148 [03:31<00:20, 26.49it/s]
95%|█████████▍| 5840/6148 [03:39<00:11, 26.63it/s]
99%|█████████▊| 6030/6148 [03:48<00:04, 26.35it/s]
100%|█████████▉| 6139/6148 [03:49<00:00, 26.71it/s]
*** glibc detected *** more: double free or corruption (!prev): 0x00000000010f4ef0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x30a6275dee]
/lib64/libc.so.6[0x30a6278c80]
/lib64/libc.so.6(fclose+0x14d)[0x30a626633d]
more[0x405bd4]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x30a621ed1d]
more[0x401a89]
======= Memory map: ========
00400000-00408000 r-xp 00000000 fd:01 6029373 /bin/more
00608000-00609000 rw-p 00008000 fd:01 6029373 /bin/more
010f2000-01113000 rw-p 00000000 00:00 0 [heap]
30a5e00000-30a5e20000 r-xp 00000000 fd:01 2359654 /lib64/ld-2.12.so
30a6020000-30a6021000 r--p 00020000 fd:01 2359654 /lib64/ld-2.12.so
30a6021000-30a6022000 rw-p 00021000 fd:01 2359654 /lib64/ld-2.12.so
30a6022000-30a6023000 rw-p 00000000 00:00 0
30a6200000-30a638a000 r-xp 00000000 fd:01 2361686 /lib64/libc-2.12.so
30a638a000-30a658a000 ---p 0018a000 fd:01 2361686 /lib64/libc-2.12.so
30a658a000-30a658e000 r--p 0018a000 fd:01 2361686 /lib64/libc-2.12.so
30a658e000-30a6590000 rw-p 0018e000 fd:01 2361686 /lib64/libc-2.12.so
30a6590000-30a6594000 rw-p 00000000 00:00 0
30a7e00000-30a7e16000 r-xp 00000000 fd:01 2361690 /lib64/libgcc_s-4.4.7-20120601.so.1
30a7e16000-30a8015000 ---p 00016000 fd:01 2361690 /lib64/libgcc_s-4.4.7-20120601.so.1
30a8015000-30a8016000 rw-p 00015000 fd:01 2361690 /lib64/libgcc_s-4.4.7-20120601.so.1
30b4c00000-30b4c1d000 r-xp 00000000 fd:01 2361727 /lib64/libtinfo.so.5.7
30b4c1d000-30b4e1c000 ---p 0001d000 fd:01 2361727 /lib64/libtinfo.so.5.7
30b4e1c000-30b4e20000 rw-p 0001c000 fd:01 2361727 /lib64/libtinfo.so.5.7
30b4e20000-30b4e21000 rw-p 00000000 00:00 0
2b4eab014000-2b4eab017000 rw-p 00000000 00:00 0
2b4eab017000-2b4eb0eac000 r--p 00000000 fd:01 6835494 /usr/lib/locale/locale-archive
2b4eb0eac000-2b4eb0ead000 rw-p 00000000 00:00 0
2b4eb0f2c000-2b4eb0f33000 r--s 00000000 fd:01 6960203 /usr/lib64/gconv/gconv-modules.cache
2b4eb0f33000-2b4eb0f34000 rw-p 00000000 00:00 0
2b4eb4000000-2b4eb4021000 rw-p 00000000 00:00 0
2b4eb4021000-2b4eb8000000 ---p 00000000 00:00 0
7ffd55018000-7ffd5502e000 rw-p 00000000 00:00 0 [stack]
7ffd55143000-7ffd55144000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Abort (core dumped)
Does the progress output continue for the 6 hours? Or does that output actually complete in the 3-4 minutes indicated in the progress message and then hang for the next 6 hours?
The progress shown completes in a few minutes, then the process stops logging output and crashes. I hesitte to say it hangs exactly - its registered as running according to our cluster monitoring software, but it certainly doesn't log anything.
Doing a bit of googling, this appears to potentially be a numpy version/gcc compiler version issue. What is your numpy version?
- numpy=1.14.3=py36h14a74c5_0
- numpy-base=1.14.3=py36hdbf6ddf_0
installed from conda
So I tried this again in a completely fresh conda env with the small ~6k DRS reads dataset and I'm still getting the same problem two reads short of the end. The fresh conda env with tombo installed from conda has numpy 1.15.0. Here is what I did:
> conda create -n tombotest -c bioconda ont-tombo
> source activate tombotest
> qsub -b y -V -cwd -pe smp 24 -q all.q tombo resquiggle pathto/fast5 pathto/transcriptome.fasta --processes 24 --overwrite --ignore-read-locks
the conda env export is attached for versions and the cluster error log file shows the system write and the crash report. As far as I'm aware the accessible gcc version is the system version, 4.4.7.
I'm at a bit of a loss here. Doing a bit of research on these types of errors it appears as though this is happening within a call to more since this is noted right after the *** glibc detected *** text. I am a bit confused by this, but it would suggest to me that this may be an issue with the SGE on your system. This could potentially have to do with the logging from that system. Are you able to complete this command outside of a job submission system?
I ran the command on the login node of the cluster (::looks nervously around for angry sysadmins::) and got the same error:
(tombo) bash-4.1$ tombo resquiggle /cluster/gjb_lab/cdr/ON_MinION_datastore/20170725_1532_fpa8_sample3_2222/fast5 /cluster/gjb_lab/nschurch/Projects/Arabidopsis_annotations/Araport/v11/201606/Araport11_genes.201606.cdna.fasta --processes 4 --overwrite --ignore-read-locks
_frozen_importlib:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
[10:38:01] Loading minimap2 reference.
[10:38:08] Getting file list.
[10:38:09] Loading default canonical ***** RNA ***** model.
[10:38:09] Re-squiggling reads (raw signal to genomic sequence alignment).
47%|█████████████████████████████████████████████████████████████████████████████████████████████▏ | 2910/6148 [03:38<04:03, 13.30it/s]Exception in thread Thread-3:
Traceback (most recent call last):
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1564, in _io_and_mappy_thread_worker
fast5_fn, corr_grp, overwrite, True, bc_grp, return_fp=True)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/tombo_helper.py", line 2108, in prep_fast5
corr_grp = analyses_grp.create_group(corr_grp)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/h5py/_hl/group.py", line 60, in create_group
gid = h5g.create(self.id, name, lcpl=lcpl, gcpl=gcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5g.pyx", line 161, in h5py.h5g.create
ValueError: Unable to create group (name already exists)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 6139/6148 [09:59<00:00, 10.24it/s]
^CProcess Process-6:
Process Process-7:
Process Process-4:
Traceback (most recent call last):
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1476, in _resquiggle_worker
for conn_num, rsqgl_conn in enumerate(rsqgl_conns)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1479, in _resquiggle_worker
sleep(0.1)
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1638, in _get_progress_fail_queues
tot_num_rec_proc += progress_q.get(block=False)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/queues.py", line 107, in get
raise Empty
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1724, in _get_index_queue
reads_index.add_read_data(*index_q.get(block=False))
queue.Empty
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/queues.py", line 107, in get
raise Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
queue.Empty
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1646, in _get_progress_fail_queues
errorType, fn, is_tombo_error = failed_reads_q.get(block=False)
During handling of the above exception, another exception occurred:
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/queues.py", line 107, in get
raise Empty
Traceback (most recent call last):
queue.Empty
During handling of the above exception, another exception occurred:
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
Traceback (most recent call last):
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1728, in _get_index_queue
sleep(0.1)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
KeyboardInterrupt
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1663, in _get_progress_fail_queues
sleep(0.5)
KeyboardInterrupt
Traceback (most recent call last):
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/bin/tombo", line 11, in <module>
sys.exit(main())
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/__main__.py", line 192, in main
resquiggle._resquiggle_main(args)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1969, in _resquiggle_main
args.num_most_common_errors)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/site-packages/tombo/resquiggle.py", line 1821, in resquiggle_all_reads
rsqgl_p.join()
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/cluster/gjb_lab/nschurch/cluster_installs/miniconda2/envs/tombo/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
The errors at the end are caused by me ctrl-C-ing after hours of it hanging. I'm downgrading to tombo 1.2.1.2 - the version that was working previously - to check that that still works and that something screwy isn't goin on with our system here...
This is actually a bit informative. The (non-KeyboardInterrupt) error is occurring immediately after the line that is supposed to delete that hdf5 group (https://github.com/nanoporetech/tombo/blob/1.4/tombo/tombo_helper.py#L2108) which the error indicates is still there. So it looks like I should probably upgrade that error handling a bit at this point in the code. Given that you are also providing the --ignore-read-locks flag, I am guessing that there may be some zombie processes around that may still be accessing this file.
I have tried to create a setup where the child processes would be killed when a tombo process was killed (and added the extra layer of the reads directory lock), but this has proven difficult to create a robust solution across various architectures/dependency versions. In any case, it seems as though there are other processes still active and accessing/writing to these read files. This is likely not the root of the problem here (why a few reads are taking a long time to process), but this is most likely causing the more recent errors and possibly the glibc error as well.
hmmm, I can't see any zombie processes on out cluster management system that might be causing the error but who knows.
The downgrade to 1.2.1.2 (via pip) in a new conda env was successful and the code ran fine:
bash-4.1$ conda env export | grep tombo
- ont-tombo==1.2.1.2
(tombo_old) bash-4.1$ tombo -v
tombo version: 1.2.1.2
(tombo_old) bash-4.1$ tombo resquiggle /cluster/gjb_lab/cdr/ON_MinION_datastore/20170725_1532_fpa8_sample3_2222/fast5 /cluster/gjb_lab/nschurch/Projects/Arabidopsis_annotations/Araport/v11/201606/Araport11_genes.201606.cdna.fasta --processes 4 --overwrite --ignore-read-locks
Loading minimap2 reference.
Getting file list.
Using default canonical ***** RNA ***** model.
Estimating global scale parameter
Correcting 6148 files with 1 subgroup(s)/read(s) each (Will print a dot for each 1000 reads completed).
......
Failed reads summary (3262 total failed):
Alignment not produced : 1483
Error opening or writing to fast5 file : 1
Fastq slot not present in --basecall-group : 1119
Not enough raw signal around potential genomic deletion(s) : 3
Poor raw to expected signal matching : 470
Poor raw to expected signal matching at read start : 178
Read event to sequence alignment extends beyond --bandwidth : 8
So it doesn't look like anything funny in our system causing the problem. I'll try upgrading again fresh and see if things fail still...
I have not had a chance to dig into this enough, but I am starting to think this might have to do with a cython, numpy memory allocation issue. I think this is brought out by the RNA runs since the signal is generally much longer for the same number of bases and thus requests more memory from the system. All just guesses at this point, but may point towards a potential fix.
@marcus1487 I can comment that this is not only for RNA, I have a DNA run with several very long reads > 10 over 200kb and about 1,000 over 100kb. Is this entirely what is contributing to the very high memory consumption? Are you putting these into numpy arrays? I can imagine constructing a numpy array for a matrix this size would be hungry. Here are some specs for a failed run (I have not had a successful run, but I currently have one using 12 cores ( 18 processes 35 threads). Here is my job output: tombo resquiggle ./SingleFast5s/ /research/rgs01/reference/public/genomes/Homo_sapiens/GRCh37-lite/FASTA/GRCh37-lite. fa --dna --processes 12 --overwrite --num-most-common-errors 5
TERM_MEMLIMIT: job killed after reaching LSF memory usage limit. Exited with signal termination: 9.
Resource usage summary:
CPU time : 641464.00 sec.
Max Memory : 612910 MB
Average Memory : 301615.03 MB
Total Requested Memory : 600000.00 MB
Delta Memory : -12910.00 MB
Max Swap : 909640 MB
Max Processes : 18
Max Threads : 35
Run time : 143396 sec.
Turnaround time : 143412 sec.
The output (if any) follows:
[15:57:13] Loading minimap2 reference. [15:59:33] Getting file list. [16:05:16] Re-squiggling reads (raw signal to genomic sequence alignment).
The single largest resquiggle used over 600GB before dying, which is insane. I have one using almost 1TB (12 core * 80GB per core = 960GB) but if you have to go much larger than this seems unusable on long nanopore reads.
Hi @nickschurch ,
This is certainly what is causing the very high memory consumption. As a temporary workaround for this, the resquiggle command has added the --signal-length-range and --sequence-length-range arguments (as discussed in #139 ; help found with tombo resquiggle --print-advanced-arguments), though this just skips those reads.
For the current version of the resquiggle command this is actually the best path to results as it is very unlikely that a read this long would resquiggle successfully. This is since either the reference or read are likely to have a large enough observed indel to throw the resquiggle algorithm out of the adaptive bandwidth. Thus trying to resquiggle these ultra long reads is likely to fail after a long time and a lot of memory consumption.
On the topic of memory consumption, tombo currently allocates an array with dimensions n_events (determined dynamically in tombo) by bandwidth. Then if a read fails to resquiggle for any reason a larger bandwidth is attempted to "save" the read. So for a 200kb read, we would have ~2M raw obs, and 400k events. The default save_bandwidth is 1500 events, so a numpy array with dimensions 400k by 1500 is allocated to attempt re-squiggling.
Unfortunately, the solution for this is not simple. First in order for ultra-long reads to have a good chance of successfully being resquiggled a split mapping approach would have to be supported. Additionally some sort of chunking and stitching (similar to guppy) would have to be implemented for the long contiguous mapping stretches to be completed in a reasonable memory footprint. Given that this is intended as a research tool, this work is not on the planned on this repo at this time.
Hi marcus, thanks for the above post. I think the best practice for this in the future will be to drop reads >100kb in the future. As for now, my current run has about 1k reads over 100kb. It is running with in total about 1TB of RAM. It has been going for about 8 days now, and looks to be finishing soon (based on the last date of file modification). All the singleFast5 directories appear to be almost done resquiggling. I will update with success or failure, but I thought it might be interesting for you to know.
Dennis
Check this out, was a monster of a run but it successfully completed. The longest read length in this run was ~300kb and I'm not sure if the alignment was produced yet but I thought I'd share the specs so people can know what to expect in terms of RAM:
Your job looked like:
LSBATCH: User input
tombo resquiggle ./SingleFast5s/ ./genomes/Homo_sapiens/GRCh37-lite/FASTA/GRCh37-lite.fa --dna --overwrite --num-most-common-errors 5
Successfully completed.
Resource usage summary:
CPU time : 465244.66 sec.
Max Memory : 746462 MB
Average Memory : 356162.56 MB
Total Requested Memory : 960000.00 MB
Delta Memory : 213538.00 MB
Max Swap : 831001 MB
Max Processes : 7
Max Threads : 67
Run time : 586384 sec.
Turnaround time : 586402 sec.
Hi, I was looking for any recommendation on the specification of the machine to use for running tombo steps on RNA reads and came across this issue. I can see the resource summary of a successful job but I wondered if you guys have any general recommendation in terms of RAM, GPU vs CPU, etc. to handle RNA data. I would appreciate if you guide me to that. Thanks!
We do not have general recommendations for Tombo RNA processing. There are many data set variables and Tombo settings that can effect the compute resources required. This allows for a robust tool set, but does not easily lend to a generic compute resource recommendation. Best of luck with your RNA research!