Visualization issues with dimensionality reduction and clustering #2388

JacobHamelMottiez · 2025-07-03T15:20:41Z

JacobHamelMottiez
Jul 3, 2025

Hi,

I am have been using BERTopic for some time with datamapplot with a dataset of around 3200 abstracts. I just realized however that to train my BERTopic model, I used a umap_model with n_components = 2, the same as my reduced_embeddings, necessary for visualization purposes in datamapplot. I read that it was advised to keep at least n_components = 5 when you wanted to train your BERTopic model (when I manually inspected the clustered with n_components = 2 they seemed fine, but I would like to stick to the best practices when using BERTopic). However, when I specify n_components = 5, I get points which are far away from their main cluster (see the picture below). I tried varying the parameters of umap and hdbscan without great success.

My main question is the following :
Is this a normal behavior and can it be justified to train a BERTopic model with a umap_model with n_component = 2? If not, what would you advise to change? (you will find the code I used below)

Thank you very much!

#Pre-calculate embeddings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

#Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=15, n_components= 2, min_dist=0.0, metric='cosine', random_state=30).fit_transform(embeddings)

#Define sub-models
from hdbscan import HDBSCAN
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=30)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

#Define a custom list of stopwords (or use an extended one)
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  
  representation_model=representation_model,
  vectorizer_model=vectorizer_model,
  
  top_n_words=10,
  verbose=True
)

#Train model
topics, probs = topic_model.fit_transform(docs, embeddings)

#Labels
x = topic_model.get_document_info(docs)
y = x.Name
labels =y.values
labels.shape
labels

excluded_topic = str(topic_info.Name[0]) 
clean_labels= [item.replace(excluded_topic, "Unlabelled") for item in labels] 

#Datamapplot visualisation
fig, ax = datamapplot.create_plot(
    reduced_embeddings,
    clean_labels,
    color_label_arrows = True,
    label_over_points=True, 
    dynamic_label_size=True,
    title = f"BERTopic",
    point_size=3, 
    marker_type="p", 
    force_matplotlib=True, 
    min_font_size= 10
    )

Answered by MaartenGr

Jul 7, 2025

Yes, but only if you feel that it actually improves on the resulting topics that are created. 5D is generally better than 2D, but definitely not always. There's no free lunch there, so additional testing is needed for specific use cases.

View full answer

MaartenGr · 2025-07-04T13:22:23Z

MaartenGr
Jul 4, 2025
Maintainer

I get points which are far away from their main cluster

Just to be sure I understand correctly, did you mean that not all clusters are tight groups? If so, then that wouldn't necessarily be a bad thing considering you lose information going from 5D to 2D. 2D really is nothing more than a proxy of the original data, so it would make sense that training on 2D "looks" better even though it might not be. Have you checked whether those points that you refer to still make sense to be in the cluster?

3 replies

JacobHamelMottiez Jul 5, 2025
Author

You understood correctly Maarten! The points in the divided groups (e.g. light green top-left and bottom right in the picture) make sense to be in the same cluster. So if I understand correctly, it is better to stick to 5D, even if the visualization "looks" worse?

MaartenGr Jul 7, 2025
Maintainer

Yes, but only if you feel that it actually improves on the resulting topics that are created. 5D is generally better than 2D, but definitely not always. There's no free lunch there, so additional testing is needed for specific use cases.

Answer selected by JacobHamelMottiez

JacobHamelMottiez Jul 14, 2025
Author

Great, I understand! Thank you for taking the time to explain Maarten!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Visualization issues with dimensionality reduction and clustering #2388

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Visualization issues with dimensionality reduction and clustering #2388

Uh oh!

JacobHamelMottiez Jul 3, 2025

Replies: 1 comment · 3 replies

Uh oh!

MaartenGr Jul 4, 2025 Maintainer

Uh oh!

Uh oh!

JacobHamelMottiez Jul 5, 2025 Author

Uh oh!

MaartenGr Jul 7, 2025 Maintainer

Uh oh!

JacobHamelMottiez Jul 14, 2025 Author

JacobHamelMottiez
Jul 3, 2025

Replies: 1 comment 3 replies

MaartenGr
Jul 4, 2025
Maintainer

JacobHamelMottiez Jul 5, 2025
Author

MaartenGr Jul 7, 2025
Maintainer

JacobHamelMottiez Jul 14, 2025
Author