I have about 20 policy publications from the past five years, reports, op-eds, methodologies, dashboards. I wanted to see if I could automatically extract topics and themes from them using the cheap and accessible NLP tools available today.
The goal wasn't a production system. It was a proof of concept: could a small organization (a think tank, an advocacy group, a research shop) use this approach to automate research tagging without expensive custom solutions or dedicated data science staff?
The answer, tentatively: yes, mostly. The pipeline runs locally, costs nothing beyond compute time, and produces results that are interesting if not perfect.
What the pipeline does
The system processes each publication's abstract or summary, extracts keyword candidates using KeyBERT, then uses an LLM (Gemini) to consolidate synonyms into a clean vocabulary. Those keywords get embedded into vectors, projected to 2D coordinates using t-SNE, and clustered into themes.
The result: each publication gets tagged with a consistent set of topics, and the keywords themselves can be visualized on a scatter plot where semantic neighbors cluster together. That's what you see in the footer of this site.
What you see on this site
The footer scatter plot shows ~100 extracted keywords positioned by semantic similarity. Hover over any entry on a collection page and its keywords light up, revealing its "semantic neighborhood."
The domain breakdown on the Profile page uses the same extracted categories, which are the output of the pipeline (not hand-curated labels).
Honest limitations
The pipeline can hallucinate keywords that aren't in the source text. Some clusters are too broad, others too narrow. I've tuned the LLM prompts and embedding parameters by hand based on what "looks right" (no rigorous validation).
This is an experiment, not a product. But it shows what's possible with a week of work and models you can run on a laptop.