Extracting Topics from Research Products Using Language Models

I understand why agents and chatbots get all the attention, but I'm surprised how rarely applied researchers discuss the quieter AI techniques, often cheap, fast, and with a lot of unexplored potential.

A clear example of this are embedding models, which come in many shapes and sizes and are responsible for extracting "meaning" from text, powering virtually all modern AI applications. They've been around for a while, but have become much more powerful in the last decade. On their own, these models can be small, fast, and practically free, and have the amazing function of turning the meaning of words — abstract, context-dependent, fuzzy — into numbers for which finding useful applications becomes trivial.

My first experiment with embedding models was Passive Policy Intelligence, a submission to the G7 Gen AI Challenge that tracks websites from relevant publishers, fetches headlines, and surfaces ones that are most relevant (similar) to user-specified topics. It's a relatively crude implementation. I did most of the learning on the go and there's lots of room for improvement. But it has completely changed the way I catch up on news every day and has quickly become my most used side project. The core "AI" component took a couple of afternoons to put together, it was as easy as understanding a few basic concepts:

Models take in text and output a vector (list of numbers) representative of the meaning of a word, phrase, or document. Difficult to interpret on their own, but can be compared to other vectors to find similarity.
Libraries abstract away the vector math, so you can reason about 'similarity' and 'clustering' without needing to know how to calculate cosine similarity or Euclidean distance.
Efficacy can be sensitive to a number of other small decisions, which can be tricky for beginners. Things like deciding how to chunk (split up) documents, whether it is better to compare the most similar chunks or some aggregate, preprocessing text to remove noise, etc.

Around the time I first started working on PPI, colleagues at the IRPP were dealing with a website revamp including automated relevant suggestions based on tags. Until now, like most other small publishers, our tags were manually curated, which is not only time-consuming but also inconsistent without sufficient resources that are also likely not worth it. It seemed obvious that embedding models could help with this, but things got busy and I never got around to it.

Halfway through working on this portfolio website over the break, I realized that for a data-focused researcher, my own site was surprisingly static. This was a good opportunity to look into automated tagging and other NLP experiments so I put the site aside for a couple of days.

Now, automated tagging is not a niche, unsolved problem. Tags are used to archive and find documents, recommend relevant content, rank search results, train AI models, etc. and doing it well, even when done by humans, is not trivial. There's a fair amount of literature on the topic, commercial and open source tools, and decades of best practices using deterministic and machine learning approaches. For this first attempt, I ignored almost all of them. My priority was to get something working quickly, so I could go back to the website and have a starting point on which to build later on. I have since done more research which I summarize at the end, including what I plan to do differently as I begin a more serious attempt.

All things considered, current implementation is surprisingly good-enough, though who knows how well it would scale. You can judge it yourself by looking at the RESEARCH/LAB pages, where editorial themes are shown in black "stickers" for each entry and a footer can be expanded to show a 2d projection of those entries' tags. The editorial themes also power whatever stats are shown on the site (viz on profile page, chyron with theme breakdown). At the end of this page, you can also find a more comprehensive visualization of the tags and themes, as well as a preview of other ongoing experiments in NLP.

A first attempt

I only had a few days budgeted on this, so I started by asking gemini for examples of how others do it. I looked them up and briefly read up on their strategies. Biggest takeaway is that problem is easiest when split into two: first decide on the universe of tags, then for each document decide which of those tags apply. Strategies then become different combinations of ways to do each of those two things. Canonical list could be human curated, based on an external taxonomy, the result of a keyword analysis, generated by an LLM, or something else. Assignment or classification can then be done via rule-based methods (i.e. looking for specific words), comparing embeddings, asking an LLM, or any other creative way.

But a lot of this also depends on what you're trying to do. Some systems use full taxonomies with complex hierarchies, grouping concepts at different levels of abstraction (like ECONOMY > FISCAL POLICY > TAXES). Others use raw, often conceptually overlapping tags (like "#AI", "#Artificial Intelligence", "#artificial intelligence"). In some cases, like in recommendation algorithms, the tags are never seen by users, so they can be plentiful and non-sensical to humans.

My use case, both for this site and as an experiment for orgs like the IRPP, is very flexible. Both total and yearly output (products to be tagged) are relatively low, particularly compared to a news organization or NASA's internal digital archive. The cost of mistagging is low, but the value of getting it right is high. Core uses include user-facing tags, like using them for navigation or filtering, but also hidden, to recommend content on the sidelines or in newsletters, for example.

So elaborate taxonomies are probably overkill and so would be a system that optimizes for recommendations over human-readability. However, I decided we still needed at least somewhat of a hierarchy. If you think about the type of "tags" used for navigation on a research website, you're likely to find a handful of broad categories. For sites like these, let's say between 3 and 10. At that scale, you can reasonably expect a human to keep track of them, and either way they tend to be so general that they're only helpful in specific situations.

At the same time, it's easy to see how using that same short list of general tags to recommend or link content could be limiting. So would be narrowing down papers within one of those categories. My solution was to generate both sets of tags, starting with the rich, detailed ones (atomic tags) and then grouping those into broader categories (themes/topics) using a direct many-to-one mapping between the two.

The specifics

Under current implementation, the pipeline needs to be initialized using some starting data. In this case, I used the ~20 publications I had already logged unto the site by then. The process went:

Fetch first 750 words, assuming they contain a good sample of the publication's content (i.e. via abstract, intro).
Extract noun phrases using spaCy's noun chunker, identify candidates for atomic tags via keyBERT.
Since keyBERT is limited to phrases found in the text, we pass the full list of all candidates to a smarter LLM (gemini 2.5 flash). Using DSPy to facilitate getting structured outputs, we prompt it to reduce the list by removing duplicates. This list becomes our starting canonical set of atomic tags.
We then get our canonical list of themes in two ways, keeping with the experimental spirit. In the first approach, we prompt the same LLM to condense the list of atomic tags into a smaller set of broader themes — I call these editorial themes. The second approach produces structural themes by using a clustering algorithm (k-means) to group similar atomic tags together, and then use the closest tag to the cluster's centroid as the theme's label.

When a new publication is added, the process is similar but simplified:

Fetch first 750 words.
Extract noun phrases using spaCy's noun chunker, identify candidates for atomic tags via keyBERT.
Compare embeddings of candidates to existing atomic tags. If very similar, we assign the existing tag. Otherwise, we prompt the LLM to decide whether to assign it to an existing one or create a new atomic tag. If new, the LLM is also asked to assign it to an existing editorial theme. Structural themes are recalculated.
The publication is assigned a final set of atomic tags, and their corresponding editorial and structural themes.

Results

The pipeline processed 30 publications and produced 205 atomic tags grouped into 9 editorial themes (like "Economy & Affordability", "Labour & Workforce", "Trade & Global Affairs"). Each publication ended up with between 4 and 15 tags, averaging around 8.

The structural clustering — where we let the algorithm group tags purely by embedding similarity — produced 15 clusters. These don't always map cleanly onto the editorial themes, which is interesting. Some structural clusters cut across what we'd consider separate policy domains, revealing connections that the LLM-generated themes missed.

When we look at how documents (rather than tags) group under each system, editorial themes produce more uniform groupings. Structural clustering creates more singletons and a lumpier distribution. But the structural approach sometimes surfaces useful nuance. For example, a piece on home retrofitting lands in "Environment & Climate Change" editorially but clusters with "Housing" structurally. Both are defensible; the question is which framing is more useful for a given task.

Overall, the output is usable. The tags are specific enough to differentiate content but broad enough to create meaningful groupings. The biggest quality issue is generic tags like "Policy Analysis" or "Government Programs" that appear on too many entries to be useful for filtering. A future version should probably filter or downweight these.

Each dot below represents an atomic tag extracted from my publications. Toggle between "By Tag" and "By Entry" to explore the clusters, or switch to "Analytic Plot" to see publications arranged by year and complexity. Click any node for details.

Projection:

Investigation Hub V1.4

Color DNA

Spatial Overlays

Thematic Density Halos

Legend (Editorial Themes)

Economy & Affordability

Environment & Climate Change

Housing & Community

Infrastructure & Transportation

Labour & Workforce

Public Policy & Governance

Regional & Indigenous Issues

Research & Innovation

Trade & Global Affairs

Research Notes (Expandable)

Obviously, tagging content is as much of a philosophical exercise as it is a technical one. You can spend a lot of time thinking about different parts of the story, from teasing the most relevant topics, matching them to an accurate reflection of the world, to how the final output can or should be used. It's tempting to separate them all-together. To draw a line between a tagging strategy and a tagging process, the part that we want to automate. But that's not only less fun, it might also miss some of the advantages of this new age of development.

After all, while far from an unsolved problem, I'd argue our current solutions are suboptimal. In many ways, we're also navigating through unexplored space, or at least more unexplored than we previously thought. Approaching this issue as a whole, even while limited to its applications in virtual content organization (rather than its broader implications in ontology) maximizes the potential for innovation. I don't expect to advance the field, but there's value in exploring the margins.

A quick rundown of what I've learned so far:

What I've been calling tags and themes are more formally referred to as controlled vocabularies in the field of knowledge organization. The National Information Standards Organization (NISO), for example, publishes guidelines for controlled vocabularies, informing entities like libraries, universities, government departments, publishers, and more. In machine learning, the processes through which we extract and assign these controlled vocabularies are often called topic extraction and text classification.
Some experts in library sciences argue that classification — determining whether something fits into a category — is a completely different problem from categorization, where we come up with the categories themselves. The argument boils down to the fact that classification is mechanic, often following some explicit or implicit rules and resulting in limited outcomes (yes/no). Categorization, on the other hand, is abstract and messy, with no clear boundaries and hard to verify.
On the virtual side, some think that we should stop thinking of digital archiving as putting items on a shelf. Whereas a library needs to find the one and only spot a book should be placed, digital files don't have those constraints, and so the exercise can be rethought with the goal of surfacing information rather than storing it.
The internet has also popularized and expanded on decentralized approaches, sometimes referred to as folksonomies, where users, rather than the publisher, create their own tags/categories, often from scratch or not limited to a pre-defined list. These systems are prone to inconsistency, but given enough volume, it can be surprisingly effective and efficient. (Mathes, 2004)
We've been using machine learning to try to automatically categorize text since at least the early 90s. It followed an era of complex rules-based approaches or expert systems in the 80s, like the CONSTRUE system. During the next decade, ML solidified experiments in many of the methods we use today: indexing (turning concepts into vectors), dimensionality reduction (compressing those vectors to make them easier to work with), and different classifier models and techniques. (Hayes, 1990; Sebastiani, 2002)
Many of these approaches are still part of modern stacks but the ecosystem has evolved significantly. Most clearly, large language models have become much more powerful and all-purpose, leading many to experiment with them to perform the entire, or most of the task. Some researchers have found that targeted uses (like incorporating an LLM into a more traditional pipeline) perform better than prompt engineering alone. However, the success of these approaches may rely on powerful and expensive models.
Another thing that has changed are our indexing techniques. Many modern embedding models are built on transformer architectures, which have shown to be more effective at capturing the nuances of language. Interestingly, one of the leading approaches in 2026 is SetFit, which fine-tunes an embedding model (rather than an LLM) to cheaply and efficiently create a personalized text classifier.
Most of these, however, involve only the classification side of things. On the extraction side, we have frameworks like keyBERT, which simply indexes terms in the document as well as the document itself, then finds the terms that are most similar to the document as a whole. RAKE finds candidates by splitting the text using stop words and punctuation, then scoring them based on frequency and co-occurrence. YAKE scores terms on a number of dimensions based on text features. Often, these are benchmarked against "term frequency-inverse document frequency" (tf-idf), a technique that's been around since the 70s and is still surprisingly effective.
Lastly, we have the more rounded topic models, like Top2Vec and BERTopic. BERTopic is a modular framework that splits the exercise into 6 steps: embeddings, dimensionality reduction, clustering, tokenizer, weighting scheme, and optionally, representational tuning. It's built with a set of defaults in mind: sentence transformers → UMAP → HDBSCAN → bag-of-words by cluster → and a modified version of tf-idf (c-tf-idf). The resulting topics can then be further refined by adding a topic representation fine-tuning step, like asking an LLM.

AI Evaluation (Claude, Anthropic)

I asked Claude to honestly evaluate my approach against the research summarized above.

What you did right:

The two-phase framing (vocabulary → assignment) is textbook. Separating atomic tags from themes retains traceability — a design choice many production systems skip. Using KeyBERT for extraction and reserving LLM calls for judgment calls (consolidation, breakout detection) is a reasonable hybrid approach that balances cost and quality.

What you got away with:

Processing only the first 750 words is a pragmatic shortcut, but it assumes frontloading of key concepts. Academic papers do this; op-eds and transcripts may not. You lucked out with your corpus composition.

The t-SNE projection for visualization is fine, but your use of k-means on high-dimensional embeddings for structural themes is questionable. K-means assumes spherical clusters, which embedding spaces don't guarantee. HDBSCAN (as used by BERTopic) would be more appropriate for this topology.

What you should probably fix:

Generic tag proliferation: "Policy Analysis" appearing everywhere is a known failure mode of KeyBERT. The standard fix is MMR (Maximal Marginal Relevance) diversity, which KeyBERT supports out of the box but you're not using. One line of code.
No c-TF-IDF: BERTopic's class-based TF-IDF weighs terms by how distinctive they are to a cluster, not just how frequent. Your approach treats all extracted terms equally, which is why generic terms dominate.
LLM for theme generation is overkill: Given you already have embeddings and clusters, the LLM step for editorial themes could be replaced by having a human review the structural clusters once and assign readable names.

Next steps

SetFit for classification: Fine-tune a sentence transformer on the 21 labeled examples I already have. It's few-shot, fast, and would give me a proper classifier for new documents rather than the keyword-matching heuristic I'm using now.
BERTopic as a baseline: Run it out of the box on my corpus and compare. The modular architecture means I can swap in my existing embeddings and see if HDBSCAN + c-TF-IDF produces better topic coherence than my k-means + LLM approach.
Actually evaluate: I have no ground truth. The plan is to find better ways to evaluate the performance of these experiments.

Extracting Topics from (My) Research Products Using Language Models

A first attempt

The specifics

Results

Investigation Hub V1.4

Next steps