RICARDO CHEJFEC
// LAB
Work in Progress: Draft Write-up

Extracting Topics from (My) Research Products Using Language Models

Can off-the-shelf NLP tools automatically categorize policy research?

Lab / Experiment · Jan 2026 · 4 min read

I have about 20 policy publications from the past five years, reports, op-eds, methodologies, dashboards. I wanted to see if I could automatically extract topics and themes from them using the cheap and accessible NLP tools available today.

The goal wasn't a production system. It was a proof of concept: could a small organization (a think tank, an advocacy group, a research shop) use this approach to automate research tagging without expensive custom solutions or dedicated data science staff?

The answer, tentatively: yes, mostly. The pipeline runs locally, costs nothing beyond compute time, and produces results that are interesting if not perfect.

What the pipeline does

The system processes each publication's abstract or summary, extracts keyword candidates using KeyBERT, then uses an LLM (Gemini) to consolidate synonyms into a clean vocabulary. Those keywords get embedded into vectors, projected to 2D coordinates using t-SNE, and clustered into themes.

The result: each publication gets tagged with a consistent set of topics, and the keywords themselves can be visualized on a scatter plot where semantic neighbors cluster together. That's what you see in the footer of this site.

What you see on this site

The footer scatter plot shows ~100 extracted keywords positioned by semantic similarity. Hover over any entry on a collection page and its keywords light up, revealing its "semantic neighborhood."

The domain breakdown on the Profile page uses the same extracted categories, which are the output of the pipeline (not hand-curated labels).

Honest limitations

The pipeline can hallucinate keywords that aren't in the source text. Some clusters are too broad, others too narrow. I've tuned the LLM prompts and embedding parameters by hand based on what "looks right" (no rigorous validation).

This is an experiment, not a product. But it shows what's possible with a week of work and models you can run on a laptop.

Pipeline Steps

  1. Chunk documents (abstracts preferred)
  2. Extract keywords via KeyBERT
  3. Consolidate synonyms via LLM
  4. Embed with sentence-transformers
  5. Project to 2D via t-SNE
  6. Cluster into ~10 themes

Tech Stack

  • Embeddings: all-MiniLM-L6-v2
  • Keywords: KeyBERT
  • LLM: Gemini 2.5 Flash (via DSPy)
  • Projection: t-SNE
  • Clustering: K-Means (k=10)
  • Visualization: Svelte + Canvas

Interactive Pivot: The Semantic Switchboard

Projection:

Investigation Hub V1.4

Color DNA
Spatial Overlays
Legend (Editorial Themes)
Economy & Affordability
Environment & Climate Change
Housing & Community
Infrastructure & Transportation
Labour & Workforce
Public Policy & Governance
Regional & Indigenous Issues
Research & Innovation
Trade & Global Affairs

Navigating the Switchboard

  • Cluster DNA: In Tag View, toggle "Color DNA" to switch between editorial themes (curated) and structural clusters (purely algorithmic).
  • Deep Dive: Click any node to open the Selection DNA panel, revealing specific metrics (Audience, Intent, Methodology) and a full list of related publications.
  • Hybridity: Hover over an entry in the map to see its "Thematic Tethers"—lines verifying its connection to multiple semantic neighborhoods.
  • Analytic Plot: Pivot to the Analytic Plot to see publications arranged by year, length, or complexity. Purple nodes represent multiple publications sharing the same coordinates.