Topic Modeling of Wikidata Items via fastText and WikiProjects taxonomy

The code in this notebook documents an example of how we might predict topics for a given Wikipedia article based upon its associated Wikidata item. In practice, the training dataset used will be much larger (all articles that are part of a WikiProject on English Wikipedia) and further model adjustments might be made.


fastText Model

The fastText classificationn model is a simple linear model that learns embeddings for each vocabulary word (in this case, Wikidata properties and values), averages those embeddings together for a given document, and then learns a mulitnomial logisitic classifier overtop this document embedding. In practice, it is very quick and often matches (or exceeds) the performance of more complex approaches.

