Topic Modeling Using BERTopic on Newsgroup Dataset: Python Implementation
We go step by step from creating a google collab workspace to visualizing the cluster of topics in a group of text documents. The explanation and code will be kept to a minimum and concepts will be conveyed from layman's perspective. The article's goal is to make it easy for anyone to apply BERTOPIC Algorithm using the free google collab workspace, and extract the essential topics and cluster them using their customized text document.
We are now ready to run our program. We start by installing the bertopic package. This allows our workspace to get the required functions and libraries that will be required to run our BERTOPIC algorithm.
Before we start with any analysis and modeling it is very important to get a basic understanding of the dataset. The dataset we will be using here is the Newsgroup dataset. The dataset is fairly easy to understand. It contains a list of 20 newsgroups. There are 20 text files, each having a collection of a specific type of message or news. You can scroll down and on the right side, you will find all the 20 text files.
from sklearn.datasets import fetch_20newsgroups: This line imports the fetch_20newsgroups function from the sklearn.datasets module. This function allows you to download the “20 Newsgroups” dataset.
docs = fetch_20newsgroups(subset=’all’, remove=(‘headers’, ‘footers’, ‘quotes’))[‘data’]: This line fetches the dataset using the fetch_20newsgroups function with the following parameters:
subset=’all’: This specifies that you want to fetch all the documents from the dataset. Alternatively, you could use other options like ‘train’ or ‘test’ to fetch only the training or test subset of the dataset.
remove=(‘headers’, ‘footers’, ‘quotes’): This specifies that you want to remove certain parts from the documents, namely the headers, footers, and quotes. These parts are typically removed to focus on the main content of the documents.
[‘data’]: This accesses the ‘data’ key in the returned dictionary, which contains the actual text data of the documents.
The fetched documents are then assigned to the variable docs, which will contain a list of text strings representing the individual documents in the dataset.
Now, the ‘docs’ variable contains a list of cleaned messages spread across the 20 groups. By using the ‘len’ function we see that there are precisely 18846 messages spread across the 20 text files. We can print the value of one of the message by using docs()
from bertopic import BERTopic: This line imports the BERTopic class from the bertopic library. It is like fetching the functionalities of the algorithm from a list of functions and algorithms available in the entire package.
topic_model = BERTopic(language=”english”, calculate_probabilities=True, verbose=True): This line creates an instance of the BERTopic class with the following parameters. It is like creating a copy of the function so that we can perform the same work on our custom dataset.
language=”english”: Specifies the language of the documents. In this case, it is set to English.
calculate_probabilities=True: Specifies whether to calculate the probabilities of topics for each document. Setting it to True will enable the calculation of topic probabilities.
verbose=True: Enables verbose mode, which provides additional information during the topic modeling process.
topics, probs = topic_model.fit_transform(docs): This line fits the topic_model to the document data. The fit_transform method takes the docs as input and performs topic modeling. It returns two outputs:
topics: A list that contains the assigned topic for each document in docs.
probs: A list that contains the corresponding probability values for each topic assignment.
In BERTopic, the probabilities of topic assignments are calculated based on the document representations obtained from the BERT language model. Here’s an overview of how the probabilities are calculated.
Consider the following three example documents:
- Document 1: “I love hiking in the mountains.”
- Document 2: “I enjoy swimming in the ocean.”
- Document 3: “I like playing soccer with my friends.”
Step 1: Preprocessing BERTopic preprocesses the input documents by performing text cleaning, tokenization, and encoding. This involves converting the text into numerical representations. For example, after tokenization, the first document might be represented as [“I”, “love”, “hiking”, “in”, “the”, “mountains”].
Step 2: BERT Embeddings BERTopic uses a pre-trained BERT model to obtain contextualized embeddings for each document. These embeddings capture the semantic meaning of the text by considering the context in which each word appears. For example, the BERT embeddings for Document 1 might be [0.2, 0.4, 0.6, 0.1, 0.3, 0.5] (these are arbitrary values for illustration).
Step 3: Dimensionality Reduction BERTopic applies UMAP (Uniform Manifold Approximation and Projection) to reduce the dimensionality of the document embeddings while preserving the local structure of the data. This reduces the embeddings from, for example, 6 dimensions to 2 dimensions for visualization purposes.
Step 4: Clustering BERTopic applies the HDBSCAN algorithm to cluster the reduced embeddings. Let’s say the clustering identifies two clusters: Cluster 1 and Cluster 2. Document 1 and Document 3 are assigned to Cluster 1, while Document 2 is assigned to Cluster 2.
Step 5: Topic Assignment BERTopic assigns a topic label to each document based on the cluster it belongs to. For example, Cluster 1 might be labeled as “Outdoor Activities” and Cluster 2 as “Water Sports.” Therefore, Document 1 and Document 3 are assigned the topic label “Outdoor Activities,” while Document 2 is assigned the topic label “Water Sports.”
Step 6: Probability Calculation The probabilities of topic assignments are calculated based on the distances between the document embeddings and the centroid of each topic cluster. Let’s assume that Document 1’s embedding is closer to the centroid of Cluster 1 than Cluster 2. This means that Document 1 has a higher probability of being assigned to the “Outdoor Activities” topic compared to the “Water Sports” topic. Similarly, Document 2 has a higher probability of being assigned to the “Water Sports” topic. The probabilities are normalized so that they sum up to 1 across all topics.
In summary, BERTopic processes the documents, obtains BERT embeddings, reduces the dimensionality, performs clustering, assigns topics, and calculates probabilities of topic assignments.
From here, we can see that there are 216 total topic groups detected with Topic with ID 0 being the most frequently occurring. The Topic with ID -1 will be negated.
In BERTopic, the similarity between topics is typically calculated using a similarity coefficient, which is a measure of the similarity between two sets. The similarity coefficient compares the intersection of two sets to their union and provides a value between 0 and 1, where 0 indicates no similarity and 1 indicates complete similarity. When generating a heatmap in BERTopic, the similarity between topics is often represented by the values in the heatmap cells. Higher values indicate a stronger similarity or association between the corresponding topics.
We can reduce the number of topics by following the code below:
The following images are generated after reducing the number of topics.
The article will be updated soon for a clearer explanation and deep dive into the code. Notebook Link