Visualization issues with dimensionality reduction and clustering #2388
-
Hi, I am have been using BERTopic for some time with datamapplot with a dataset of around 3200 abstracts. I just realized however that to train my BERTopic model, I used a My main question is the following : Thank you very much! #Pre-calculate embeddings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)
#Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=15, n_components= 2, min_dist=0.0, metric='cosine', random_state=30).fit_transform(embeddings)
#Define sub-models
from hdbscan import HDBSCAN
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=30)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
#Define a custom list of stopwords (or use an extended one)
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
vectorizer_model=vectorizer_model,
top_n_words=10,
verbose=True
)
#Train model
topics, probs = topic_model.fit_transform(docs, embeddings)
#Labels
x = topic_model.get_document_info(docs)
y = x.Name
labels =y.values
labels.shape
labels
excluded_topic = str(topic_info.Name[0])
clean_labels= [item.replace(excluded_topic, "Unlabelled") for item in labels]
#Datamapplot visualisation
fig, ax = datamapplot.create_plot(
reduced_embeddings,
clean_labels,
color_label_arrows = True,
label_over_points=True,
dynamic_label_size=True,
title = f"BERTopic",
point_size=3,
marker_type="p",
force_matplotlib=True,
min_font_size= 10
) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Just to be sure I understand correctly, did you mean that not all clusters are tight groups? If so, then that wouldn't necessarily be a bad thing considering you lose information going from 5D to 2D. 2D really is nothing more than a proxy of the original data, so it would make sense that training on 2D "looks" better even though it might not be. Have you checked whether those points that you refer to still make sense to be in the cluster? |
Beta Was this translation helpful? Give feedback.
Yes, but only if you feel that it actually improves on the resulting topics that are created. 5D is generally better than 2D, but definitely not always. There's no free lunch there, so additional testing is needed for specific use cases.