symphonypy icon indicating copy to clipboard operation
symphonypy copied to clipboard

question about batch

Open Flu09 opened this issue 1 year ago • 10 comments

Hello, I have 3 studies which I want to annotate using a built reference. I wonder if what I am doing is correct. I label transfered from the built reference for each dataset. I integrated the 3 studies by Seurat and harmony in R using seurat v5. but I started here in symphonypy from counts and followed the tutorial. Should I label transfer for the whole object and not one dataset at at time? would the batch corrected object help at all?

Flu09 avatar Aug 21 '24 03:08 Flu09

Hi, @Flu09!

First of all, if you're more familiar with R, it's better to use the original Symphony: https://github.com/immunogenomics/symphony

Secondly, you can explicitly put information about batches during label transfer using key argument (it's better to do it this way — and the results should be similar to the label transfer for individual batches): sp.tl.map_embedding(adata_query=adata_query, adata_ref=adata_ref, key=batch_key)

Overall Symphony performance on Seurat-corrected expressions wasn't benchmarked, so we can't say if it will give some meaningful results.

serjisa avatar Aug 21 '24 09:08 serjisa

I see thank you so much. I have this error. Do you have any suggestions?

sp.tl.map_embedding(adata_query=sample, adata_ref=adata) 538 out of 3000 genes from the reference are missing in the query dataset or have zero std in the reference, their expressions in the query will be set to zero Traceback (most recent call last): File "", line 1, in File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/symphonypy/tools.py", line 336, in map_embedding _map_query_to_ref( File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/symphonypy/_utils.py", line 278, in _map_query_to_ref t = _adjust_for_missing_genes( File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/symphonypy/_utils.py", line 240, in _adjust_for_missing_genes X = adata[:, use_genes_list[use_genes_list_present]].X File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/anndata/_core/anndata.py", line 591, in X _subset(self._adata_ref.X, (self._oidx, self._vidx)), File "/usr/lib64/python3.9/functools.py", line 888, in wrapper return dispatch(args[0].class)(*args, **kw) File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/anndata/_core/index.py", line 165, in _subset_spmatrix return a[subset_idx] File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scipy/sparse/_index.py", line 68, in getitem return self._get_sliceXarray(row, col) File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scipy/sparse/_csr.py", line 326, in _get_sliceXarray return self._major_slice(row)._minor_index_fancy(col) File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scipy/sparse/_compressed.py", line 768, in _minor_index_fancy csr_column_index1(k, idx, M, N, self.indptr, self.indices, ValueError: Output dtype not compatible with inputs.

Flu09 avatar Aug 21 '24 13:08 Flu09

Hi @Flu09! I'm so sorry that you are encountering this bug! What's the datatype of your sparse matrix adata_query.X in the example above?

potulabe avatar Aug 21 '24 16:08 potulabe

float 64 for both the reference and the samples. I think they need to be converted to float32 and the column of the celltype to catergory?

print(adata.obs['cell_type_high_resolution'].dtype) object adata.X <1353075x33538 sparse matrix of type '<class 'numpy.float64'>' with 4457926739 stored elements in Compressed Sparse Row format> sample.X <3057x38152 sparse matrix of type '<class 'numpy.float64'>' with 4187950 stored elements in Compressed Sparse Row format>

Flu09 avatar Aug 21 '24 17:08 Flu09

Eh, float64 seems to be OK, I was just hoping that it's connected this bug with np.float16: https://stackoverflow.com/questions/40046118/why-cant-i-assign-data-to-part-of-sparse-matrix-in-the-first-try

potulabe avatar Aug 21 '24 17:08 potulabe

@Flu09 Don't you mind sharing the least subsample of data to reproduce the error? Probably it could be a couple of cells per dataset.

potulabe avatar Aug 21 '24 23:08 potulabe

Probably related to https://github.com/scverse/anndata/issues/1349?

potulabe avatar Aug 22 '24 02:08 potulabe

I can try preparing some data to share. changing both reference and sample to float32 solved the previous issue.

New error message below

sp.tl.map_embedding(adata_query=sample, adata_ref=adata)
538 out of 3000 genes from the reference are missing in the query dataset or have zero std in the reference, their expressions in the query will be set to zero
>>> 
>>> # Mapping UMAP coordinates
>>> sp.tl.ingest(adata_query=sample, adata_ref=adata)
/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/umap/umap_.py:1943: UserWarning: n_jobs value -1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
TypeError: float() argument must be a string or a number, not 'csr_matrix'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/symphonypy/tools.py", line 238, in ingest
    ing.map_embedding(method)
  File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scanpy/tools/_ingest.py", line 499, in map_embedding
    self._obsm['X_umap'] = self._umap_transform()
  File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scanpy/tools/_ingest.py", line 488, in _umap_transform
    return self._umap.transform(self._obsm['rep'])
  File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/umap/umap_.py", line 3028, in transform
    indices, dists = self._knn_search_index.query(
  File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/pynndescent/pynndescent_.py", line 1696, in query
    query_data = np.asarray(query_data).astype(np.float32, order="C")
ValueError: setting an array element with a sequence.
>>> 
>>> # Labels prediction
>>> sp.tl.transfer_labels_kNN(
...     adata_query=sample,
...     adata_ref=adata,
...     ref_labels=["leiden", "cell_type_high_resolution"],
... )
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/symphonypy/tools.py", line 411, in transfer_labels_kNN
    knn.fit(adata_ref.obsm[ref_basis], adata_ref.obs[ref_labels])
  File "/home/x/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/anndata/_core/aligned_mapping.py", line 196, in __getitem__
    return self._data[key]
KeyError: 'X_pca_harmony'
>>> 

Flu09 avatar Aug 22 '24 12:08 Flu09

@Flu09 I'm so sorry, could you please share a small subset of your data :(

potulabe avatar Aug 24 '24 16:08 potulabe

And the versions of anndata and scanpy packages which you are using

potulabe avatar Aug 24 '24 16:08 potulabe