Add missing comma under Exploration subsection #2374

angelonazzaro · 2025-06-03T19:39:34Z

What does this PR do?

This PR adds a missing comma when instantiating BERTopic in the Exploration section, under the Outliers reduction page.

Fixes

Add a missing comma when instantiating BERTopic in the Exploration section, under the Outliers reduction page. The comma is inserted between the passing of vectorizer_model and calculate_probabilities arguments.

Before Fix:

from umap import UMAP
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Prepare data, extract embeddings, and prepare sub-models
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
vectorizer_model = CountVectorizer(stop_words="english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# We reduce our embeddings to 2D as it will allows us to quickly iterate later on
reduced_embeddings = UMAP(n_neighbors=10, n_components=2,
                          min_dist=0.0, metric='cosine').fit_transform(embeddings)

# Train our topic model
topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model,
                       vectorizer_model=vectorizer_model calculate_probabilities=True, nr_topics=40)
topics, probs = topic_model.fit_transform(docs, embeddings)

AFTER FIX

from umap import UMAP
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Prepare data, extract embeddings, and prepare sub-models
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
vectorizer_model = CountVectorizer(stop_words="english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# We reduce our embeddings to 2D as it will allows us to quickly iterate later on
reduced_embeddings = UMAP(n_neighbors=10, n_components=2,
                          min_dist=0.0, metric='cosine').fit_transform(embeddings)

# Train our topic model
topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model,
                       vectorizer_model=vectorizer_model, calculate_probabilities=True, nr_topics=40)
topics, probs = topic_model.fit_transform(docs, embeddings)

Before submitting

[ ✅ ] This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes (if applicable)?
Did you write any new necessary tests?

MaartenGr · 2025-06-06T10:07:08Z

Awesome, thank you for the PR and the extensive description. It is highly appreciated.

Add missing comma under Exploration subsection

ed37c70

MaartenGr merged commit e85c8bb into MaartenGr:master Jun 6, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add missing comma under Exploration subsection #2374

Add missing comma under Exploration subsection #2374

Uh oh!

angelonazzaro commented Jun 3, 2025

Uh oh!

MaartenGr commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

Add missing comma under Exploration subsection #2374

Add missing comma under Exploration subsection #2374

Uh oh!

Conversation

angelonazzaro commented Jun 3, 2025

What does this PR do?

Fixes

Before Fix:

AFTER FIX

Before submitting

Uh oh!

MaartenGr commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!