[inhomogeneous shape unresolved] [Colab] ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
Using BERTopic for some research task, I am getting:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part
Is there plan for BERTopic supporting numpy >1.23.5 and numba > 0.56.4?
This will make it easier to use BERTopic in Colab without having to downgrade persistently manually (!pip install numba==0.56.4): especially for those without Colab.pro
Thanks.
[Noted] Related issue: closed | [#1697, #1602, #1309] Related issue: open | [#1814, #1799, #1684, #1584, #1421, #1269] SO: https://stackoverflow.com/a/76504825
PS: I gather that a walkaround is to downgrade numpy!
in issue #1421: @aaron-imani hinted at np.average() in _guided_topic_modelling(). @MaartenGR indicated that setting numba to 0.56.4 or earlier should ideally fix the issue
Thanks @MaartenGr for your insightful engagement in previous issues and suggestions.
Environment: colab.research.google.com
NB: (as at 26 April 2024, Colab runs python: 3.10.12, bertopic: 0.16.1, numpy: 1.25.2, numba: 0.58.1)
BERTopic:
# Install BERTopic
!pip install bertopic
# Load libraries
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
from sentence_transformers import SentenceTransformer
[Code snippet]
# Train BERTopic
topic_model_03 = BERTopic(
verbose=True,
min_topic_size= 5, #10, #12, #15, ## the higher, the lower the clusters/topics
nr_topics = 4, #5, ## reduce the initial number of topics
#seed_topic_list=seed_topic_list, ##ValueError: ...
n_gram_range = (1,3), ## n-gram range for the CountVectorizer
zeroshot_topic_list=zeroshot_topic_list,
zeroshot_min_similarity=.35, #.55, #.75, #.85,
embedding_model="thenlper/gte-small", ## pass string directly to sbert sentence-transformers models
#umap_model = umap_model, ## dimensionality
ctfidf_model = ctfidf_model,
representation_model=KeyBERTInspired(),
)
The seed topics look like this {redacted in part}
### Try with Guided Representation
seed_topic_list = [["sustainability", "...", "sustain", "...", "..."],
["climate change", "climate", "...", "...", "...", "...", "ozone"],
["social justice", "social", "...", "...", "...", "...", "...", "..."],
["net zero", "..."]
]
## fit and transform
#topics_03, probs_03 = topic_model_03.fit_transform(docs)
## visualise documents' topics spread
#topic_model_03.visualize_documents(docs)
Is there plan for BERTopic supporting numpy >1.23.5 and numba > 0.56.4?
I believe this is only the case for the specific use of seeded topic modeling. All other instances of BERTopic should support these versions. Having said that, I would indeed prefer to update this line:
https://github.com/MaartenGr/BERTopic/blob/127e794f5630bc0d48071f012b07e9e41dd7d8ba/bertopic/_bertopic.py#L3762
To something that still does a weighted average but without the need to use weights parameter, which is where the main issue lies (I think). Preferably, I would want to remove the use of np.average altogether since it is just giving a lot of issues with numba. Instead, doing the weighted average manually by multiplying the embeddings, adding the seed_topic_embeddings, and dividing by 4 might a straightforward solution.