sleuth icon indicating copy to clipboard operation
sleuth copied to clipboard

HDF5 files not being read properly (hdf5 lib incompatibility, or base R and rhdf5 clash?)

Open channsoden opened this issue 7 years ago • 39 comments

sleuth_prep seems to always error with "subscript out of bounds" during step "normalizing est_counts". I'm trying this on my own data since, as others have pointed out, the link to the sample data is broken.

design <- read.table(file.path(".", "metadata_table.tsv"),
                      header = TRUE,
                      stringsAsFactors=FALSE)
s2c <- dplyr::select(design, sample, condition = carbon, path)
s2c

sample | condition | path
-- | -- | --
MM10_16C_L_wt | L | wt/MM10_16C_L_corrected_kallisto
MM10_16C_R_wt | R | wt/MM10_16C_R_corrected_kallisto
MM10_22C_L_wt | L | wt/MM10_22C_L_corrected_kallisto
MM10_22C_R_wt | R | wt/MM10_22C_R_corrected_kallisto
MM18_16C_L_wt | L | wt/MM18_16C_L_corrected_kallisto
MM18_16C_R_wt | R | wt/MM18_16C_R_corrected_kallisto
MM18_22C_L_wt | L | wt/MM18_22C_L_1_corrected_kallisto
MM18_22C_L_wt | L | wt/MM18_22C_L_2_corrected_kallisto
MM18_22C_L_wt | L | wt/MM18_22C_L_3_corrected_kallisto
MM18_22C_L_wt | L | wt/MM18_22C_L_corrected_kallisto
MM18_22C_R_wt | R | wt/MM18_22C_R_corrected_kallisto
MM20_16C_L_wt | L | wt/MM20_16C_L_corrected_kallisto
MM20_16C_R_wt | R | wt/MM20_16C_R_corrected_kallisto
MM20_22C_L_wt | L | wt/MM20_22C_L_corrected_kallisto
...

so <- sleuth_prep(s2c, extra_bootstrap_summary = TRUE)

reading in kallisto results
dropping unused factor levels
..............................................................................................
normalizing est_counts
Error in result[, which_order, drop = FALSE]: subscript out of bounds
Traceback:

1. sleuth_prep(s2c, extra_bootstrap_summary = TRUE)
2. spread_abundance_by(obs_raw, "est_counts", sample_to_covariates$sample)

Any advice on troubleshooting this?

channsoden avatar May 25 '18 21:05 channsoden

Hi @channsoden,

Thanks for using the tool, and for reporting your issue. To start identifying the problem, I would suggest running the following code, which is using some of the guts of sleuth_prep to help (assume you have s2c with path already created):

## collect the data from your kallisto results
kal_list <- lapply(seq_along(s2c$path), function(i) {
  path <- s2c$path[i]
  suppressMessages(kal <- read_kallisto(path, read_bootstrap = FALSE))
  kal$abundance <- dplyr::mutate(kal$abundance, sample = s2c$sample[i])
  kal
})
## make the 'obs_raw' data frame
obs_raw <- dplyr::bind_rows(lapply(kal_list, function(k) k$abundance))
## FIRST CHECK
stopifnot(all(s2c$sample %in% obs_raw$sample))
## THE ABOVE SHOULD PASS; IF IT FAILS, RUN:
s2c$sample[which(!(s2c$sample %in% obs_raw$sample))]
## to ID which samples are causing the problem
## This is the problem

## If the above passes, then that isn't the issue
## now run the code below (guts of 'spread_abundance_by'):
abund <- data.table::as.data.table(obs_raw)
var_spread <- data.table::dcast(abund, target_id ~ sample, 
        value.var = "est_counts")
var_spread <- var_spread[order(var_spread$target_id), ]
var_spread <- as.data.frame(var_spread, stringsAsFactors = FALSE)
rownames(var_spread) <- var_spread$target_id
var_spread["target_id"] <- NULL
result <- as.matrix(var_spread)
s2c$sample[which(!(s2c$sample %in% colnames(result)))]
### The above line should identify which samples are causing a problem
### If this doesn't yield anything, then we have something very unusual happening

Let me know if that helps identify the problematic samples.

warrenmcg avatar May 25 '18 21:05 warrenmcg

Thank you for your prompt reply!

It does indeed fail at the first check - obs_raw is entirely empty.

channsoden avatar May 25 '18 21:05 channsoden

The read_kallisto function doesn't look like it's succeeding. e.g.:

read_kallisto("wt/W795_22C_L_corrected_kallisto", read_bootstrap = FALSE) kallisto object

transcripts: 0 bootstraps: 0

channsoden avatar May 25 '18 21:05 channsoden

It seems like the kallisto files might be empty, or otherwise something went really wrong with the input process.

If the files are empty, though that pushes the troubleshooting for you back to the kallisto step, it's still important for us because it means we should put a sanity check to make sure the file has something in it.

If the kallisto file is present and an appropriate size (a few MBs usually), would you mind sending me an example kallisto file for me to do some additional troubleshooting on my end?

warrenmcg avatar May 25 '18 21:05 warrenmcg

The files don't look empty to me. The tsvs look sane enough, and the h5s are several MB. The error logs for the kallisto runs all looked normal enough to me too.

Here's the above sample: https://drive.google.com/file/d/1puBadjckcFas0pDHZsf4UOdlYVFdMvP-/view?usp=sharing

channsoden avatar May 25 '18 21:05 channsoden

Hi @channsoden,

If the kallisto log files do not show any errors, then the abundance.h5 files are corrupted. The one you sent me can be read if you do the following line:

test_kal <- rhdf5::h5read('wt/W795_22C_L_corrected_kallisto/abundance.h5', '/')
rhdf5::H5close()

When I did this, I first of all got several read errors. I also found at least three things that stood out as troubling:

  1. test_kal$est_counts is NULL, which it should not be.
  2. test_kal$aux$ids was also empty.
  3. test_kal$aux$lengths was also showing very weird results.

As I have seen with other issues, it's probably not an issue with kallisto or sleuth, but with whatever cloud storage you're using to transfer files around. Let me know if you get different results when you try the h5read code above out on your end.

warrenmcg avatar May 25 '18 22:05 warrenmcg

The fields of test_kal are entirely NULL. Weird. I'll run a checksum on the original files, maybe try running Kallisto over again to see if I get something different.

channsoden avatar May 25 '18 22:05 channsoden

The odd thing for me is that your abundance.tsv file that you sent me does not have anything grossly wrong with it. It's just your abundance.h5 file. My guess is that it's a sync/cloud issue, but feel free to report an error to the kallisto team if the problem persists and you can confirm it's not related to syncing / cloud storage problems.

warrenmcg avatar May 25 '18 22:05 warrenmcg

FYI, the checksums check out, so if they got corrupted it was on my serve filesystem somehow.

channsoden avatar May 25 '18 22:05 channsoden

@channsoden This is definitely a kallisto issue. I've downloaded the files and played with them a bit and am getting a lot of NULL values.

But before we open report a bug there, I noticed you are running version 0.43.0. Any chance you can update to the latest v0.44.0 and rerun?

Thanks,

Harold

pimentel avatar Jun 03 '18 19:06 pimentel

I repeated the run with v.0.44.0 to the same effect. Should I open up a bug for Kallisto? I can provide you with additional sample data for the 0.44.0 run.

channsoden avatar Jun 06 '18 00:06 channsoden

Hi @channsoden,

thanks for verifying. Yes, please open a kallisto bug: https://github.com/pachterlab/kallisto/issues

Things that will be important:

  • Noting the reference to build the index
  • System info (e.g. OS, GCC version)
  • HDF5 version
  • If you built from source or if you used the pre-built binary
  • ld $(which kallisto)
  • An example run used v0.44.0

Thanks!

Harold

pimentel avatar Jun 06 '18 01:06 pimentel

Sorry, that's supposed to be;

ldd $(which kallisto)

pimentel avatar Jun 06 '18 01:06 pimentel

@channsoden I looked at your files in the kallisto bug report and I'm not having any issues with these. Is the error the same or different now? Note: with your old files I couldn't read them, these ones I can. Is it possible you didn't update the path in "metadata_table.tsv"? (I only ask because I am guilty of doing this sort of thing myself)

Here is what I get:

R> Sys.glob('*/abundance.h5')
[1] "MM2_16C_L_A100-A101-A102_L001_corrected_kallisto/abundance.h5"
[2] "MM2_16C_L_A100-A101-A102_L002_corrected_kallisto/abundance.h5"
[3] "MM2_16C_L_A100-A101-A102_L003_corrected_kallisto/abundance.h5"
R> x = Sys.glob('*/abundance.h5')
R> tmp = lapply(x, read_kallisto)
Found 100 bootstrap samples
Found 100 bootstrap samples
Found 100 bootstrap samples
R> tmp[[1]]
	kallisto object

transcripts:  10345
original number of transcripts:  10345
Original or Subset:  Subsetbootstraps:  100
R> tmp[[1]]$abundance %>% head
     target_id est_counts  eff_len  len        tpm
1 NEUDI_100023        199 1849.681 2060   6.390675
2 NEUDI_100042        415 1148.070 1309  21.471897
3 NEUDI_100045        400 3507.601 3708   6.773926
4 NEUDI_100052       4863 1643.335 1866 175.779709
5 NEUDI_100057        857 1519.923 1697  33.492676
6 NEUDI_100075       1695 2522.850 2703  39.908822

Everything looks fairly normal now?

pimentel avatar Jun 06 '18 20:06 pimentel

@pimentel Yes, I am still getting the exact same errors. I put the new files in same path as before, and when I do

test_kal <- rhdf5::h5read('MM2_16C_L_A100-A101-A102_L001_corrected_kallisto/abundance.h5', '/')  
rhdf5::H5close()

the h5read function doesn't throw any errors but test_kal is NULL in all fields.

channsoden avatar Jun 07 '18 16:06 channsoden

ah, not the case for me. it definitely works and looks correct for me now.

in some ways this is good and bad. Perhaps good for me since it means there probably isn't a hdf5 bug in kallisto. bad for you because it means that there's something weird with your rhdf5.

do you have the H5 utilities installed on that machine? if so, what do you get when you type()

h5dump -d /est_counts abundance.h5

pimentel avatar Jun 08 '18 17:06 pimentel

Hey, thanks for helping me trouble-shoot this. Good to narrow down the source of the problem - I may try a different machine to see if I can replicate.

The dump seems to work:

HDF5 "abundance.h5" { DATASET "/est_counts" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 10345 ) / ( 10345 ) } DATA { (0): 202, 310, 306, 2492, 301, 1077, 466, 398, 1242, 214, 1373, 358, 302, (13): 203, 766, 3540, 898, 2216, 764, 1467, 261, 358, 1279, 517, 29, 123, (26): 138, 451, 118, 150, 565, 6141, 558, 126, 235, 5, 174, 255, 304, 50, (40): 31, 287, 204, 436, 59, 1636, 780, 63, 2593, 351, 292.011, 852, 1295, (53): 50, 465, 170, 165, 10, 301, 440, 121, 242, 975, 146, 562, 2486, 260, . . . (10340): 0, 0, 0, 0.2, 0 } } }

channsoden avatar Jun 08 '18 17:06 channsoden

@channsoden, what version of rhdf5 do you have installed? In other words, it will be helpful to see your sessionInfo() output after the error.

warrenmcg avatar Jun 08 '18 19:06 warrenmcg

This affects me with rhdf5 2.22.0 and 2.25.3.

Here's my sessionInfo immediately after the error:

R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS

Matrix products: default BLAS: /home/christopher/anaconda3/lib/R/lib/libRblas.so LAPACK: /home/christopher/anaconda3/lib/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] bindrcpp_0.2.2 sleuth_0.29.0 dplyr_0.7.5
[4] ggplot2_2.2.1 RevoUtils_10.0.8 RevoUtilsMath_10.0.1

loaded via a namespace (and not attached): [1] Rcpp_0.12.17 compiler_3.4.3 pillar_1.2.3
[4] plyr_1.8.4 bindr_0.1.1 base64enc_0.1-3
[7] tools_3.4.3 digest_0.6.15 uuid_0.1-2
[10] rhdf5_2.25.3 jsonlite_1.5 evaluate_0.10.1
[13] tibble_1.4.2 gtable_0.2.0 pkgconfig_2.0.1
[16] rlang_0.2.1 IRdisplay_0.5.0 parallel_3.4.3
[19] IRkernel_0.8.12.9000 repr_0.15.0 stringr_1.3.1
[22] grid_3.4.3 tidyselect_0.2.4 data.table_1.11.4
[25] glue_1.2.0 R6_2.2.2 pbdZMQ_0.3-3
[28] tidyr_0.8.1 Rhdf5lib_1.0.0 purrr_0.2.5
[31] magrittr_1.5 scales_0.5.0 htmltools_0.3.6
[34] assertthat_0.2.0 colorspace_1.3-2 stringi_1.2.2
[37] lazyeval_0.2.1 munsell_0.4.3 crayon_1.3.4

channsoden avatar Jun 08 '18 20:06 channsoden

Interesting, I'm running rhdf5 version 2.18.0. If Harold is also running an older version, this would put the newer versions of rhdf5 as the cause. Still don't know the exact nature of the cause, but it seems we're getting closer?

warrenmcg avatar Jun 08 '18 20:06 warrenmcg

@pimentel and @channsoden, after some thinking, here's an idea:

  • kallisto uses whatever system hdf5 library is available (on a linux, which h5cc would show that path and h5cc -showconfig | head -6 would show the version number)
  • after digging into rhdf5, it uses its own version of the hdf5 library internally (this can be checked by running rhdf5::h5version())

Chris, if you run those, do you get different versions?? If my hypothesis is correct, this bug is occurring because the system hdf5 lib (possibly 1.10?) is newer than the hdf5 lib (which will be 1.8.x) used by rhdf5, so it can't recognize the new-format HDF5 files, as discussed on the HDF5 website.

warrenmcg avatar Jun 08 '18 20:06 warrenmcg

You're on the nose with the versions.

Server (Kallisto): HDF5 Version: 1.10.1 Local (Sleuth): This is Bioconductor rhdf5 2.25.3 linking to C-library HDF5 1.8.19

So I need to install the new HDF5 library locally? Any guidance there, or shall I just start BASHing away?

channsoden avatar Jun 08 '18 21:06 channsoden

rhdf5 installs its own internally, so we either would have to ask the developer there to switch the package to hdf5 lib 1.10, or you would need to install the older version of hdf5 lib on your machine (1.8.x). What I don’t know is whether it works to have both versions on your machine and which kallisto will use. If you’re comfortable experimenting, that would be the next step.

@pimentel, from our end, I think the best thing to do is open an issue in kallisto to see if the hdf5 lib version can be included either in the h5 files themselves or the run info json file. It would also be worth it to include a warning of hdf5 lib is version 1.10, since it can’t currently be processed by sleuth. Then, sleuth could do a check to make sure that the rhdf5 internal hdf5 lib version is compatible with the version used by kallisto.

warrenmcg avatar Jun 08 '18 23:06 warrenmcg

ah, somehow I missed this issue:

https://github.com/pachterlab/kallisto/issues/171#issuecomment-387706627

I think we should really file something with the rhdf5 people. the way that kallisto works is that hdf5 is built into the binary

pimentel avatar Jun 09 '18 22:06 pimentel

@warrenmcg For my purposes installing hdf5 1.8.x seems untenable. I am working on an institutional cluster and would need to install it alongside 1.10 within my home directory. I foresee many headaches getting the environment configured correctly. . .

channsoden avatar Jun 10 '18 01:06 channsoden

@pimentel, what version of hdf5 is built into the prebuilt binaries that are downloadable? How could a user find this out? In the short-term, I think the only we can get @channsoden up and running as quickly as possible is to give him a pre-built binary built on hdf5 lib version 1.8.x. Could I send @channsoden my binary of kallisto, for example, which I presume is built on hdf5 lib 1.8.x?

warrenmcg avatar Jun 11 '18 21:06 warrenmcg

The kallisto binary we distribute always uses hdf5 version 1.8.15

pmelsted avatar Jun 11 '18 22:06 pmelsted

Thanks for the info @pmelsted!

So, that argues against my idea that it’s a clash of hdf5 libs and leaves us with the weird issue described over at kallisto. @channsoden, could you confirm which version of R you are running? We can collect all of this and open an issue on rhdf5

warrenmcg avatar Jun 12 '18 03:06 warrenmcg

I am indeed running 3.4.3. Here's my sessionInfo():

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /home/christopher/anaconda3/lib/R/lib/libRblas.so
LAPACK: /home/christopher/anaconda3/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.28.0 RevoUtils_10.0.8     RevoUtilsMath_10.0.1

loaded via a namespace (and not attached):
 [1] httr_1.3.1      compiler_3.4.3  R6_2.2.2        tools_3.4.3    
 [5] withr_2.1.2     curl_3.2        memoise_1.1.0   git2r_0.21.0   
 [9] digest_0.6.15   devtools_1.13.5

channsoden avatar Jun 12 '18 23:06 channsoden

I've been working with Mike Smith of rhdf5 to try to get it to use the system libraries, but I have so far not been able to get that installed. I have also been trying some other workarounds to no avail. Eventually I got fed up and just installed it within a Docker container. Success!

The image based on Jupyter's r-notebook, and is hosted on Docker Cloud. Use docker run -p 8888:8888 -v /path/to/kallisto/results:/home/jovyan/kallisto channsoden/sleuth to start the container. Then copy the token and point your browser to http://localhost:8888/?tocken=[TOKEN]. The directory mounted to /home/jovyan/kallisto is shared between the host and container filesystems, so changes or output written there is easy to grab.

channsoden avatar Jun 14 '18 22:06 channsoden