BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

[inhomogeneous shape unresolved] [Colab] ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Open semmyk-research opened this issue 1 year ago • 1 comments

Using BERTopic for some research task, I am getting: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part

Is there plan for BERTopic supporting numpy >1.23.5 and numba > 0.56.4? This will make it easier to use BERTopic in Colab without having to downgrade persistently manually (!pip install numba==0.56.4): especially for those without Colab.pro Thanks.

[Noted] Related issue: closed | [#1697, #1602, #1309] Related issue: open | [#1814, #1799, #1684, #1584, #1421, #1269] SO: https://stackoverflow.com/a/76504825

PS: I gather that a walkaround is to downgrade numpy! in issue #1421: @aaron-imani hinted at np.average() in _guided_topic_modelling(). @MaartenGR indicated that setting numba to 0.56.4 or earlier should ideally fix the issue

Thanks @MaartenGr for your insightful engagement in previous issues and suggestions.

Environment: colab.research.google.com
NB: (as at 26 April 2024, Colab runs python: 3.10.12, bertopic: 0.16.1, numpy: 1.25.2, numba: 0.58.1)

BERTopic:

# Install BERTopic
!pip install bertopic
# Load libraries
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
from sentence_transformers import SentenceTransformer

[Code snippet]

# Train BERTopic

topic_model_03 = BERTopic(
    verbose=True,
    min_topic_size= 5, #10,  #12,  #15,  ## the higher, the lower the clusters/topics
    nr_topics = 4, #5, ## reduce the initial number of topics
    #seed_topic_list=seed_topic_list,  ##ValueError: ...
    n_gram_range = (1,3), ## n-gram range for the CountVectorizer
    zeroshot_topic_list=zeroshot_topic_list,
    zeroshot_min_similarity=.35,  #.55,  #.75,  #.85,
    embedding_model="thenlper/gte-small",  ## pass string directly to sbert sentence-transformers models
    #umap_model = umap_model, ## dimensionality
    ctfidf_model = ctfidf_model,
    representation_model=KeyBERTInspired(),
    ) 

The seed topics look like this {redacted in part}

### Try with Guided Representation
seed_topic_list = [["sustainability", "...", "sustain", "...", "..."],
                   ["climate change", "climate", "...", "...", "...", "...", "ozone"],
                   ["social justice", "social", "...", "...", "...", "...", "...", "..."],
                   ["net zero", "..."]
                   ]

## fit and transform
#topics_03, probs_03 = topic_model_03.fit_transform(docs)
## visualise documents' topics spread
#topic_model_03.visualize_documents(docs)

semmyk-research avatar Apr 26 '24 12:04 semmyk-research

Is there plan for BERTopic supporting numpy >1.23.5 and numba > 0.56.4?

I believe this is only the case for the specific use of seeded topic modeling. All other instances of BERTopic should support these versions. Having said that, I would indeed prefer to update this line:

https://github.com/MaartenGr/BERTopic/blob/127e794f5630bc0d48071f012b07e9e41dd7d8ba/bertopic/_bertopic.py#L3762

To something that still does a weighted average but without the need to use weights parameter, which is where the main issue lies (I think). Preferably, I would want to remove the use of np.average altogether since it is just giving a lot of issues with numba. Instead, doing the weighted average manually by multiplying the embeddings, adding the seed_topic_embeddings, and dividing by 4 might a straightforward solution.

MaartenGr avatar Apr 30 '24 07:04 MaartenGr