[bat_int] hcla dataset cannot be processed

Open KaiWaldrant opened this issue 2 years ago • 1 comments

Describe the bug Processing the cellxgene dataset hcla with the batch integration dataset processor an error is raised:

Traceback (most recent call last):
  File "/tmp/nxf.DNtRDk7Ba7/.viash_script.sh", line 60, in
    adata_with_hvg = compute_batched_hvg(input, n_hvgs=par['hvgs'])
  File "/tmp/nxf.DNtRDk7Ba7/.viash_script.sh", line 49, in compute_batched_hvg
    hvg_list = scib.pp.hvg_batch(
  File "/usr/local/lib/python3.10/site-packages/scib/preprocessing.py", line 504, in hvg_batch
    sc.pp.highly_variable_genes(
  File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 469, in highly_variable_genes
    hvg = _highly_variable_genes_single_batch(
  File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 248, in _highly_variable_genes_single_batch
    df['mean_bin'] = pd.cut(
  File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 293, in cut
    fac, bins = _bins_to_cuts(
  File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 421, in _bins_to_cuts
    raise ValueError(
ValueError: Bin edges must be unique: array([      -inf, 0.00013226, 0.00014937, 0.00014937, 0.00016693,
       0.00016958, 0.00016958, 0.00018138, 0.0001983 , 0.0001983 ,
       0.00020949, 0.00020964, 0.00021124, 0.00030184, 0.00034767,
       0.00037922, 0.00048056, 0.00062685, 0.00096363, 0.007547  ,
              inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg

https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/4L4a4swu0PnnpT

To Reproduce Steps to reproduce the behavior:

bash src/tasks/batch_integration/resources_scripts/process_datasets.sh

Feb 05 '24 13:02 KaiWaldrant

Another way of reproducing the issue:

aws s3 sync "s3://openproblems-data/resources/datasets/cellxgene_census/hcla/log_cp10k/" "resources/datasets/cellxgene_census/hcla/log_cp10k/"

viash run src/tasks/batch_integration/process_dataset/config.vsh.yaml -- --input resources/datasets/cellxgene_census/hcla/log_cp10k/dataset.h5ad --output_dataset dataset.h5ad --output_solution solution.h5ad

Or debug with:

## VIASH START
par = {
    'input': 'resources/datasets/cellxgene_census/hcla/log_cp10k/dataset.h5ad',
    'hvgs': 2000,
    'obs_label': 'cell_type',
    'obs_batch': 'batch',
    'subset_hvg': False,
    'output_dataset': 'dataset.h5ad',
    'output_solution': 'solution.h5ad'
}
meta = {
    "config": "target/nextflow/batch_integration/process_dataset/.config.vsh.yaml",
    "resources_dir": "src/common/helper_functions"
}
## VIASH END

  Traceback (most recent call last):
    File "/viash_automount/tmp//viash-run-process_dataset-HETEzD.py", line 60, in <module>
      adata_with_hvg = compute_batched_hvg(input, n_hvgs=par['hvgs'])
    File "/viash_automount/tmp//viash-run-process_dataset-HETEzD.py", line 49, in compute_batched_hvg
      hvg_list = scib.pp.hvg_batch(
    File "/usr/local/lib/python3.10/site-packages/scib/preprocessing.py", line 504, in hvg_batch
      sc.pp.highly_variable_genes(
    File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 469, in highly_variable_genes
      hvg = _highly_variable_genes_single_batch(
    File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 248, in _highly_variable_genes_single_batch
      df['mean_bin'] = pd.cut(
    File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 293, in cut
      fac, bins = _bins_to_cuts(
    File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 421, in _bins_to_cuts
      raise ValueError(
  ValueError: Bin edges must be unique: array([      -inf, 0.27518785, 0.27518785, 0.27518785, 0.27662799,
         0.2801334 , 0.2801334 , 0.30019245, 0.31369418, 0.31369418,
         0.32693949, 0.33407092, 0.35471782, 0.55181584, 0.59382758,
         0.62713194, 0.74299753, 0.9963228 , 1.50465243, 5.59722102,
                inf]).
  You can drop duplicate edges by setting the 'duplicates' kwarg
  Files and logs are stored at '/tmp/viash_process_dataset4742220387422529871'

Mar 15 '24 12:03 rcannood