Skip to content

Conversation

angelonazzaro
Copy link
Contributor

What does this PR do?

This PR adds a missing comma when instantiating BERTopic in the Exploration section, under the Outliers reduction page.

Fixes

Add a missing comma when instantiating BERTopic in the Exploration section, under the Outliers reduction page. The comma is inserted between the passing of vectorizer_model and calculate_probabilities arguments.

Before Fix:

from umap import UMAP
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Prepare data, extract embeddings, and prepare sub-models
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
vectorizer_model = CountVectorizer(stop_words="english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# We reduce our embeddings to 2D as it will allows us to quickly iterate later on
reduced_embeddings = UMAP(n_neighbors=10, n_components=2,
                          min_dist=0.0, metric='cosine').fit_transform(embeddings)

# Train our topic model
topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model,
                       vectorizer_model=vectorizer_model calculate_probabilities=True, nr_topics=40)
topics, probs = topic_model.fit_transform(docs, embeddings)

AFTER FIX

from umap import UMAP
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Prepare data, extract embeddings, and prepare sub-models
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
vectorizer_model = CountVectorizer(stop_words="english")
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# We reduce our embeddings to 2D as it will allows us to quickly iterate later on
reduced_embeddings = UMAP(n_neighbors=10, n_components=2,
                          min_dist=0.0, metric='cosine').fit_transform(embeddings)

# Train our topic model
topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model,
                       vectorizer_model=vectorizer_model, calculate_probabilities=True, nr_topics=40)
topics, probs = topic_model.fit_transform(docs, embeddings)

Before submitting

  • [ ✅ ] This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

@MaartenGr
Copy link
Owner

Awesome, thank you for the PR and the extensive description. It is highly appreciated.

@MaartenGr MaartenGr merged commit e85c8bb into MaartenGr:master Jun 6, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants