BERTopic for Topic Modeling with CORDIS data.#

BERTopic can be considered the current (2022) state of the art in topic modeling. You’ll find the corresponding paper here. It’s advantage lies in a clever use of sentence transfomers as well as dimensionality reduction and clustering (per default UMAP and HDBSCAN). Sentence transformers allow to encode natural language efficiently (also very large amounts). UMAP and HDBSCAN are two high-performance algorithms. The autor Maarten Grootendorst released a well documented and increasingly used package that implements all steps including useful visualization and representation tool.

In this tutorial we will use the approach to identify topics in CORDIS data (EU FP and H2020 project results). This is a basic-application tutorial adjusted to work for “smaller data” (500 summaries) following this tutorial.

Also: We are going to use a GPU enabled instance… You get very far with Google Colab (clear legal first)

# Start by installing the package (in quite mode)
!pip install bertopic -q
# Colab specific widget handling
from google.colab import output
output.enable_custom_widget_manager()
# Load packages for the analysis
import pandas as pd #handling / opening data
import random #create random years (this table does not have clear years)

from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

from sklearn.feature_extraction.text import CountVectorizer
# Load report-data
reports = pd.read_csv('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/cordis-h2020reports.gz')
years = pd.to_datetime(reports.lastUpdateDate)
set([y.year for y in years])
# creating "fake years" for this tutorial...don't do that in a real analysis :-)
reports['year'] = [random.choice(range(2010,2018)) for _ in range(len(reports))]
reports['summary']

We need to specify a few things to make the approach work in our setting. This will involve:

  • Use a custom vectorizer that will remove stop-words (e.g. the, and, to, I)

  • Tweak UMAP and HDBSCAN to produce more and more specific clusters (check BERTopic FAQ and documentation)

  • Request use fo n-grams from BERTopics for “reporting”

  • use of specialized allenai-specter trasformer pretrained to deal with scientific text

# custom vectorizer to get rid of stopwords
vectorizer_model = CountVectorizer(stop_words="english")

# lower n_neighbors=3 value thatn standard 5 and lower n_components=3
umap_model = UMAP(n_neighbors=3, n_components=3, 
                  min_dist=0.0, metric='cosine', random_state=42)

# resuce min_cluster_size and min_samples
hdbscan_model = HDBSCAN(min_cluster_size=20, metric='euclidean', 
                        cluster_selection_method='eom', prediction_data=True, min_samples=3)

# specify all custom models and n_grams
topic_model = BERTopic(verbose=True, 
                       embedding_model="allenai-specter", 
                       n_gram_range=(2, 3), 
                       hdbscan_model=hdbscan_model, 
                       umap_model=umap_model,
                       vectorizer_model=vectorizer_model)
# Run the modelnig
topics, _ = topic_model.fit_transform(reports['summary']); len(topic_model.get_topic_info())

the object topics is a vector with the cluster-numbers that can be used in other analysis…

Below some built-in ways for exploring the results

topic_model.get_topic_info().head(10)
topic_model.visualize_barchart(top_n_topics=9, height=200)
topic_model.visualize_topics(top_n_topics=50)
topic_model.visualize_hierarchy(top_n_topics=50, width=800)
# dynamic analysis with "fake years"
topics_over_time = topic_model.topics_over_time(reports['summary'], topics, reports['year'])
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20, width=900, height=500)

You can also use BERTopic to generate embeddings and use them in other analysis…for instance some supervised task. However, it is probably easier to go directly to SBERT (sentence transformers)

# create embeddings

docs = topic_model.embedding_model.embed_documents(reports['summary'])