openproblems-v2
openproblems-v2 copied to clipboard
[bat_int] hcla dataset cannot be processed
Describe the bug
Processing the cellxgene dataset hcla with the batch integration dataset processor an error is raised:
Traceback (most recent call last):
File "/tmp/nxf.DNtRDk7Ba7/.viash_script.sh", line 60, in
adata_with_hvg = compute_batched_hvg(input, n_hvgs=par['hvgs'])
File "/tmp/nxf.DNtRDk7Ba7/.viash_script.sh", line 49, in compute_batched_hvg
hvg_list = scib.pp.hvg_batch(
File "/usr/local/lib/python3.10/site-packages/scib/preprocessing.py", line 504, in hvg_batch
sc.pp.highly_variable_genes(
File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 469, in highly_variable_genes
hvg = _highly_variable_genes_single_batch(
File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 248, in _highly_variable_genes_single_batch
df['mean_bin'] = pd.cut(
File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 293, in cut
fac, bins = _bins_to_cuts(
File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 421, in _bins_to_cuts
raise ValueError(
ValueError: Bin edges must be unique: array([ -inf, 0.00013226, 0.00014937, 0.00014937, 0.00016693,
0.00016958, 0.00016958, 0.00018138, 0.0001983 , 0.0001983 ,
0.00020949, 0.00020964, 0.00021124, 0.00030184, 0.00034767,
0.00037922, 0.00048056, 0.00062685, 0.00096363, 0.007547 ,
inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg
https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/4L4a4swu0PnnpT
To Reproduce Steps to reproduce the behavior:
bash src/tasks/batch_integration/resources_scripts/process_datasets.sh
Another way of reproducing the issue:
aws s3 sync "s3://openproblems-data/resources/datasets/cellxgene_census/hcla/log_cp10k/" "resources/datasets/cellxgene_census/hcla/log_cp10k/"
viash run src/tasks/batch_integration/process_dataset/config.vsh.yaml -- --input resources/datasets/cellxgene_census/hcla/log_cp10k/dataset.h5ad --output_dataset dataset.h5ad --output_solution solution.h5ad
Or debug with:
## VIASH START
par = {
'input': 'resources/datasets/cellxgene_census/hcla/log_cp10k/dataset.h5ad',
'hvgs': 2000,
'obs_label': 'cell_type',
'obs_batch': 'batch',
'subset_hvg': False,
'output_dataset': 'dataset.h5ad',
'output_solution': 'solution.h5ad'
}
meta = {
"config": "target/nextflow/batch_integration/process_dataset/.config.vsh.yaml",
"resources_dir": "src/common/helper_functions"
}
## VIASH END
Traceback (most recent call last):
File "/viash_automount/tmp//viash-run-process_dataset-HETEzD.py", line 60, in <module>
adata_with_hvg = compute_batched_hvg(input, n_hvgs=par['hvgs'])
File "/viash_automount/tmp//viash-run-process_dataset-HETEzD.py", line 49, in compute_batched_hvg
hvg_list = scib.pp.hvg_batch(
File "/usr/local/lib/python3.10/site-packages/scib/preprocessing.py", line 504, in hvg_batch
sc.pp.highly_variable_genes(
File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 469, in highly_variable_genes
hvg = _highly_variable_genes_single_batch(
File "/usr/local/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 248, in _highly_variable_genes_single_batch
df['mean_bin'] = pd.cut(
File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 293, in cut
fac, bins = _bins_to_cuts(
File "/usr/local/lib/python3.10/site-packages/pandas/core/reshape/tile.py", line 421, in _bins_to_cuts
raise ValueError(
ValueError: Bin edges must be unique: array([ -inf, 0.27518785, 0.27518785, 0.27518785, 0.27662799,
0.2801334 , 0.2801334 , 0.30019245, 0.31369418, 0.31369418,
0.32693949, 0.33407092, 0.35471782, 0.55181584, 0.59382758,
0.62713194, 0.74299753, 0.9963228 , 1.50465243, 5.59722102,
inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Files and logs are stored at '/tmp/viash_process_dataset4742220387422529871'