Topic Modeling of Wikidata Items via fastText and WikiProjects taxonomy

The code in this notebook documents an example of how we might predict topics for a given Wikipedia article based upon its associated Wikidata item. In practice, the training dataset used will be much larger (all articles that are part of a WikiProject on English Wikipedia) and further model adjustments might be made.

Data

fastText Model

The fastText classificationn model is a simple linear model that learns embeddings for each vocabulary word (in this case, Wikidata properties and values), averages those embeddings together for a given document, and then learns a mulitnomial logisitic classifier overtop this document embedding. In practice, it is very quick and often matches (or exceeds) the performance of more complex approaches.

Imports, parameters, etc.

Train model

Example datapoints

Collect statistics

Full statistics

Top-level categories only

Recall

Precision

F1