This project aims to productionize a Wikidata-based topic prediction model for the ORES environment. The initial proposal and descriptions for the project can be found here.

This project was undertaken as a part of the 2020 Summer Outreachy internship. I would like to thank my amazing mentors Isaac Johnson and Aaron Halfaker for their guidance and support throughout the project.

The coding part of the project was done in two parts:

  1. Preprocess the Wikidata dump and learn the word embeddings for relevant PIDs and QIDs using Fasttext. Find my work here.
  2. Train a supervised model using GBC on article embeddings (average word embeddings of all the words in that article) of some labeled Wikidata items. Find my work here.

Phase - 1: Preprocessing and learning the word embeddings

[edit]

mwtext library already has a pipeline that preprocesses the Wikipedia dumps and learns embeddings for preprocessed wikitext. We had to add a new utility to this library so that it supported Wikidata. The new utility does the following in its preprocessing step:

  1. Filter out irrelevant Wikidata items that are in the dump. We filtered items that belong to at least one of the following categories.
    1. Wikidata items that are redirects (e.g., (Q18511155))
    2. Wikidata items with no sitelinks to any Wikipedia (e.g. Q47586969)
    3. Wikidata items that sitelinks to Wikipedia pages that aren't articles, i.e-Wikipedia pages with a non-zero namespace (e.g. Q8207058)
  2. Extract relevant information
For each Wikidata item that isn't filtered out, the utility extracts and returns a list of PIDs and QIDs corresponding to that item. See Topic_Classification_of_Wikidata_Items

At the end of the preprocessing step, we have a bunch of Wikidata items with their corresponding lists of Properties and values. Treating these PIDs and QIDs as words, we use Fasttext to learn the embeddings for each such IDs.

Phase-2: Training a classifier

[edit]

For each item in the training dataset, a list of PIDs and QIDs are extracted. The article embedding is calculated by taking the average of word embeddings (obtained in Phase-1) for those IDs. We then train a Gradient Boosting Classifier model with the following hyperparameters:

 n_estimators = 150
 max_depth = 5
 max_features = log2
 learning_rate = 0.1 

While all the existing models in drafttopic for Wikitext use Gradient Boosting Classifier, the experimental API for Wikidata was trained using Fasttext. We conducted several experiments to compare the performance of these two classifiers and decide which one works best for Wikidata. All the results of the experiments can be found here.

Experiment-1: Evaluation using cross validation on training dataset

[edit]

The first phase of the experiment evaluated the performance of models that were trained by varying different factors:

  • vocabulary size: the number of most frequent words (and their embeddings) that were retained in Phase-1. vocab size of 10000, 50000, and 100000 were used.
  • classifier: Gradient Boosting vs Fasttext
  • training samples: balanced vs imbalanced.
  • size of training dataset. ~64000 and ~256000 were used.

The statistics are the aggregate of the results obtained after a five fold cross validation on the training dataset.

  1. Classes with higher population rate, e.g. Culture.Biography.Biography*, did pretty good with all the models and had a very high precision/recall rates. Classes with rare occurances, e.g. STEM.Mathematics, had pretty poor performance throughout.
  2. Increasing the vocab size from 10k to 50k had some improvement (2-3% overall) in the performance, for both balanced and imbalanced dataset. Further increasing the vocab size to 100k didn't show a significant performance boost for the model trained using balanced datasets.
  3. The performance of models trained using a balanced dataset and scaled by the population rate of the classes seems poor compared to those trained using an imbalanced dataset. But the next phase of the experiment shows that this contrast in the statistics might not be very reliable.
  4. Fasttext models trains comparatively faster than GBC (a few seconds vs almost an hour) and seems to have a better performance as shown by the statistics obtained after five-fold cross validation.

Experiment-2: Evaluation using a separate imbalanced testing dataset

[edit]

Next, we collected ~150k Wikidata items that weren't used for the training process. We then evaluated the performances of following four models on this dataset.

  • Gradient Boosting model that was trained using ~64000 balanced dataset.
  • Gradient Boosting model that was trained using ~64000 imbalanced dataset.
  • Fasttext model that was trained using ~64000 balanced dataset.
  • Fasttext model that was trained using ~64000 imbalanced dataset.
Classifier Trained on recall precision f1 accuracy roc_auc pr_auc
Fasttext 63961, unbalanced dataset (micro=0.794, macro=0.655) (micro=0.813, macro=0.737) (micro=0.801, macro=0.688) (micro=0.966, macro=0.985) (micro=0.969, macro=0.959) (micro=0.84, macro=0.69)
Fasttext 63944, balanced dataset (micro=0.791, macro=0.69) (micro=0.8, macro=0.681) (micro=0.792, macro=0.675) (micro=0.965, macro=0.984) (micro=0.967, macro=0.961) (micro=0.833, macro=0.686)
Gradient Boosting 63961, unbalanced dataset (micro=0.775, macro=0.614) (micro=0.83, macro=0.725) (micro=0.798, macro=0.66) (micro=0.968, macro=0.985) (micro=0.966, macro=0.951) (micro=0.83, macro=0.642)
Gradient Boosting 63944, balanced dataset (micro=0.789, macro=0.674) (micro=0.805, macro=0.7) (micro=0.792, macro=0.679) (micro=0.964, macro=0.984) (micro=0.966, macro=0.962) (micro=0.828, macro=0.664)

It was observed that all four models had similar performances, and there wasn't any one such model that did considerably well compared to others. The other factor that was weighed in was the training time, for which Fasttext was way ahead of Gradient Boosting. However, using Fasttext would mean a lot of additions to the existing revscoring architecture. Since we didn't have a strong reason to prefer Fastext over Gradient Boosting in terms of performance -- we decided to make use of the existing utilities for training by sticking to Gradient Boosting, instead of updating them for Fasttext.

Gradient Boosting classifier with the following parameters was used to train the final model:

Hyper parameters
  • n_estimators = 150
  • max_depth = 5
  • max_features = log2
  • learning_rate = 0.1
Size of training dataset 63944, balanced samples
Vocab Size 10000
Embeddings dimension 50

Statistics for Gradient Boosting model after five-fold cross validation on the balanced training dataset

[edit]

Overall performance (scaled by population rates):

recall (micro=0.719, macro=0.621)
precision (micro=0.7, macro=0.554)
f1 (micro=0.703, macro=0.571)
accuracy (micro=0.978, macro=0.99)
roc_auc (micro=0.956, macro=0.951)
pr_auc (micro=0.721, macro=0.543)


Label wise statistics (scaled by population rates):

topic n TP FP TN FN recall precision f1 pr_auc
Culture.Biography.Biography* 16670 15762 464 908 46810 0.946 0.929 0.937 0.952
Culture.Biography.Women 4110 3125 679 985 59155 0.76 0.504 0.606 0.589
Culture.Food and drink 1318 613 126 705 62500 0.465 0.371 0.413 0.352
Culture.Internet culture 2966 1948 140 1018 60838 0.657 0.517 0.578 0.549
Culture.Linguistics 1466 934 56 532 62422 0.637 0.852 0.729 0.656
Culture.Literature 5367 3996 404 1371 58173 0.745 0.619 0.676 0.726
Culture.Media.Books 1974 1560 136 414 61834 0.79 0.609 0.688 0.659
Culture.Media.Entertainment 1733 857 162 876 62049 0.495 0.429 0.459 0.433
Culture.Media.Films 2295 1896 122 399 61527 0.826 0.829 0.828 0.813
Culture.Media.Media* 14383 11572 1135 2811 48426 0.805 0.67 0.731 0.813
Culture.Media.Music 2583 2027 247 556 61114 0.785 0.807 0.795 0.818
Culture.Media.Radio 1156 857 44 299 62744 0.741 0.711 0.726 0.741
Culture.Media.Software 1750 685 307 1065 61887 0.391 0.094 0.152 0.094
Culture.Media.Television 2230 1510 176 720 61538 0.677 0.68 0.678 0.664
Culture.Media.Video games 2147 1758 54 389 61743 0.819 0.732 0.773 0.801
Culture.Performing arts 1334 741 116 593 62494 0.555 0.478 0.514 0.414
Culture.Philosophy and religion 2702 1074 285 1628 60957 0.397 0.472 0.431 0.339
Culture.Sports 5925 5186 249 739 57770 0.875 0.929 0.901 0.933
Culture.Visual arts.Architecture 2648 1867 230 781 61066 0.705 0.672 0.688 0.673
Culture.Visual arts.Comics and Anime 1508 1007 140 501 62296 0.668 0.416 0.513 0.558
Culture.Visual arts.Fashion 1199 669 98 530 62647 0.558 0.242 0.338 0.215
Culture.Visual arts.Visual arts* 6070 4131 554 1939 57320 0.681 0.566 0.618 0.666
Geography.Geographical 3464 2226 359 1238 60121 0.643 0.7 0.67 0.698
Geography.Regions.Africa.Africa* 6449 4664 414 1785 57081 0.723 0.462 0.564 0.639
Geography.Regions.Africa.Central Africa 1145 697 83 448 62716 0.609 0.244 0.349 0.321
Geography.Regions.Africa.Eastern Africa 1114 704 56 410 62774 0.632 0.262 0.371 0.253
Geography.Regions.Africa.Northern Africa 1280 774 108 506 62556 0.605 0.321 0.42 0.343
Geography.Regions.Africa.Southern Africa 1244 859 81 385 62619 0.691 0.411 0.515 0.514
Geography.Regions.Africa.Western Africa 1142 774 75 368 62727 0.678 0.297 0.413 0.277
Geography.Regions.Americas.Central America 1331 707 87 624 62526 0.531 0.569 0.55 0.494
Geography.Regions.Americas.North America 7625 5064 1169 2561 55150 0.664 0.682 0.673 0.726
Geography.Regions.Americas.South America 1532 1082 142 450 62270 0.706 0.681 0.693 0.691
Geography.Regions.Asia.Asia* 11647 8432 835 3215 51462 0.724 0.715 0.719 0.756
Geography.Regions.Asia.Central Asia 1086 671 70 415 62788 0.618 0.306 0.41 0.462
Geography.Regions.Asia.East Asia 2717 1727 241 990 60986 0.636 0.665 0.65 0.625
Geography.Regions.Asia.North Asia 2076 1336 163 740 61705 0.644 0.579 0.609 0.55
Geography.Regions.Asia.South Asia 2366 1612 135 754 61443 0.681 0.839 0.752 0.708
Geography.Regions.Asia.Southeast Asia 1721 1059 119 662 62104 0.615 0.668 0.641 0.557
Geography.Regions.Asia.West Asia 2160 1473 129 687 61655 0.682 0.794 0.734 0.662
Geography.Regions.Europe.Eastern Europe 3533 2472 234 1061 60177 0.7 0.771 0.733 0.71
Geography.Regions.Europe.Europe* 12939 9372 1810 3567 49195 0.724 0.642 0.681 0.744
Geography.Regions.Europe.Northern Europe 4221 2571 601 1650 59122 0.609 0.643 0.626 0.644
Geography.Regions.Europe.Southern Europe 2438 1565 268 873 61238 0.642 0.673 0.657 0.618
Geography.Regions.Europe.Western Europe 3076 1934 417 1142 60451 0.629 0.657 0.643 0.63
Geography.Regions.Oceania 2638 1859 138 779 61168 0.705 0.839 0.766 0.75
History and Society.Business and economics 3502 1544 569 1958 59873 0.441 0.315 0.367 0.248
History and Society.Education 2243 1113 255 1130 61446 0.496 0.489 0.493 0.4
History and Society.History 3172 1154 360 2018 60412 0.364 0.403 0.382 0.315
History and Society.Military and warfare 3238 1677 296 1561 60410 0.518 0.622 0.565 0.521
History and Society.Politics and government 4590 2406 329 2184 59025 0.524 0.731 0.611 0.603
History and Society.Society 2971 897 166 2074 60807 0.302 0.48 0.371 0.318
History and Society.Transportation 3629 2615 169 1014 60146 0.721 0.809 0.762 0.712
STEM.Biology 2916 2237 91 679 60937 0.767 0.948 0.848 0.816
STEM.Chemistry 1270 690 138 580 62536 0.543 0.294 0.382 0.27
STEM.Computing 1968 828 332 1140 61644 0.421 0.182 0.254 0.149
STEM.Earth and environment 1627 918 114 709 62203 0.564 0.594 0.579 0.522
STEM.Engineering 2195 1284 141 911 61608 0.585 0.596 0.591 0.51
STEM.Libraries & Information 1174 605 87 569 62683 0.515 0.203 0.291 0.238
STEM.Mathematics 1137 307 107 830 62700 0.27 0.068 0.109 0.125
STEM.Medicine & Health 1726 769 180 957 62038 0.446 0.499 0.471 0.398
STEM.Physics 1219 448 107 771 62618 0.368 0.168 0.23 0.126
STEM.STEM* 16449 12609 2766 3840 44729 0.767 0.477 0.588 0.768
STEM.Space 1365 932 47 433 62532 0.683 0.795 0.735 0.686
STEM.Technology 3648 1396 424 2252 59872 0.383 0.219 0.279 0.213

Statistics for Gradient Boosting model on an imbalanced testing dataset

[edit]

Overall performance:

recall (micro=0.789, macro=0.674)
precision (micro=0.805, macro=0.7)
f1 (micro=0.792, macro=0.679)
accuracy (micro=0.964, macro=0.984)
roc_auc (micro=0.966, macro=0.962)
pr_auc (micro=0.828, macro=0.664)


Label wise statistics:

topic n TP FP FN TN recall precision f1 pr_auc
Culture.Biography.Biography* 47623 46074 1566 1549 100684 0.967 0.967 0.967 0.983
Culture.Biography.Women 5861 4753 2808 1108 141204 0.811 0.629 0.708 0.73
Culture.Food and drink 925 451 255 474 148693 0.488 0.639 0.553 0.462
Culture.Internet culture 1214 853 193 361 148466 0.703 0.815 0.755 0.565
Culture.Linguistics 2638 1821 162 817 147073 0.69 0.918 0.788 0.772
Culture.Literature 4860 3618 1017 1242 143996 0.744 0.781 0.762 0.817
Culture.Media.Books 1557 1309 229 248 148087 0.841 0.851 0.846 0.817
Culture.Media.Entertainment 1373 669 336 704 148164 0.487 0.666 0.563 0.542
Culture.Media.Films 4515 4025 249 490 145109 0.891 0.942 0.916 0.895
Culture.Media.Media* 19238 16841 3654 2397 126981 0.875 0.822 0.848 0.917
Culture.Media.Music 7302 6284 1378 1018 141193 0.861 0.82 0.84 0.888
Culture.Media.Radio 801 639 122 162 148950 0.798 0.84 0.818 0.822
Culture.Media.Software 412 163 308 249 149153 0.396 0.346 0.369 0.297
Culture.Media.Television 2777 1967 423 810 146673 0.708 0.823 0.761 0.785
Culture.Media.Video games 934 786 86 148 148853 0.842 0.901 0.87 0.862
Culture.Performing arts 1041 602 427 439 148405 0.578 0.585 0.582 0.594
Culture.Philosophy and religion 3837 1826 917 2011 145119 0.476 0.666 0.555 0.518
Culture.Sports 23849 21760 1540 2089 124484 0.912 0.934 0.923 0.953
Culture.Visual arts.Architecture 4261 3223 952 1038 144660 0.756 0.772 0.764 0.729
Culture.Visual arts.Comics and Anime 801 540 403 261 148669 0.674 0.573 0.619 0.628
Culture.Visual arts.Fashion 299 181 243 118 149331 0.605 0.427 0.501 0.349
Culture.Visual arts.Visual arts* 6876 5181 2255 1695 140742 0.753 0.697 0.724 0.78
Geography.Geographical 8125 5746 1673 2379 140075 0.707 0.774 0.739 0.792
Geography.Regions.Africa.Africa* 3154 2318 1530 836 145189 0.735 0.602 0.662 0.672
Geography.Regions.Africa.Central Africa 261 176 256 85 149356 0.674 0.407 0.508 0.499
Geography.Regions.Africa.Eastern Africa 183 129 149 54 149541 0.705 0.464 0.56 0.432
Geography.Regions.Africa.Northern Africa 487 325 266 162 149120 0.667 0.55 0.603 0.531
Geography.Regions.Africa.Southern Africa 498 352 333 146 149042 0.707 0.514 0.595 0.528
Geography.Regions.Africa.Western Africa 250 180 254 70 149369 0.72 0.415 0.526 0.365
Geography.Regions.Americas.Central America 1329 799 401 530 148143 0.601 0.666 0.632 0.583
Geography.Regions.Americas.North America 23096 16939 4747 6157 122030 0.733 0.781 0.757 0.835
Geography.Regions.Americas.South America 2712 2128 807 584 146354 0.785 0.725 0.754 0.77
Geography.Regions.Asia.Asia* 20162 15600 3425 4562 126286 0.774 0.82 0.796 0.849
Geography.Regions.Asia.Central Asia 274 173 207 101 149392 0.631 0.455 0.529 0.505
Geography.Regions.Asia.East Asia 4626 3321 938 1305 144309 0.718 0.78 0.748 0.715
Geography.Regions.Asia.North Asia 2128 1496 535 632 147210 0.703 0.737 0.719 0.705
Geography.Regions.Asia.South Asia 6412 4831 599 1581 142862 0.753 0.89 0.816 0.824
Geography.Regions.Asia.Southeast Asia 2353 1523 452 830 147068 0.647 0.771 0.704 0.661
Geography.Regions.Asia.West Asia 4534 3390 610 1144 144729 0.748 0.848 0.794 0.801
Geography.Regions.Europe.Eastern Europe 7013 5654 1049 1359 141811 0.806 0.844 0.824 0.814
Geography.Regions.Europe.Europe* 31173 25152 8462 6021 110238 0.807 0.748 0.776 0.852
Geography.Regions.Europe.Northern Europe 11246 7877 2617 3369 136010 0.7 0.751 0.725 0.797
Geography.Regions.Europe.Southern Europe 5424 3935 1431 1489 143018 0.725 0.733 0.729 0.774
Geography.Regions.Europe.Western Europe 7773 5941 1958 1832 140142 0.764 0.752 0.758 0.785
Geography.Regions.Oceania 6396 4901 610 1495 142867 0.766 0.889 0.823 0.841
History and Society.Business and economics 3785 1753 1144 2032 144944 0.463 0.605 0.525 0.479
History and Society.Education 3007 1790 932 1217 145934 0.595 0.658 0.625 0.594
History and Society.History 4139 1690 1042 2449 144692 0.408 0.619 0.492 0.49
History and Society.Military and warfare 5523 3219 922 2304 143428 0.583 0.777 0.666 0.69
History and Society.Politics and government 10973 6959 1247 4014 137653 0.634 0.848 0.726 0.767
History and Society.Society 3107 1011 355 2096 146411 0.325 0.74 0.452 0.445
History and Society.Transportation 5836 4529 567 1307 143470 0.776 0.889 0.829 0.843
STEM.Biology 13125 11974 300 1151 136448 0.912 0.976 0.943 0.949
STEM.Chemistry 511 269 312 242 149050 0.526 0.463 0.493 0.457
STEM.Computing 999 439 247 560 148627 0.439 0.64 0.521 0.452
STEM.Earth and environment 1745 1175 288 570 147840 0.673 0.803 0.733 0.683
STEM.Engineering 2164 1409 383 755 147326 0.651 0.786 0.712 0.66
STEM.Libraries & Information 207 108 160 99 149506 0.522 0.403 0.455 0.337
STEM.Mathematics 140 34 143 106 149590 0.243 0.192 0.215 0.104
STEM.Medicine & Health 1988 999 542 989 147343 0.503 0.648 0.566 0.518
STEM.Physics 326 128 226 198 149321 0.393 0.362 0.376 0.253
STEM.STEM* 22642 20834 12464 1808 114767 0.92 0.626 0.745 0.924
STEM.Space 853 639 72 214 148948 0.749 0.899 0.817 0.81
STEM.Technology 1753 624 529 1129 147591 0.356 0.541 0.429 0.401


Although the statistics obtained from the balanced dataset doesn't look very good, the imbalanced dataset is a closer representation of the distribution of the actual data. The performance of the model on imbalanced data is considerably better and was taken into account while judging the performance of the model that will finally be used in production.