Generating Word2Vec Embeddings

  • Tried to generate a test set with countries and their continents using SPARQL, found it easier to generate it from the wikipedia article.
  • Evaluated all possible combinations of the form country1:continent1::country2:X, and got the accuracy as raw Hits@1: 14.4%, Hits@3: 26.41%, Hits@10: 41.71%. Since a country can have only one possible continent, the filtered accuracy is the same.
  • I also tried a filtered accuracy evaluation, by checking for continent1:country1:continent2:X, but given each continent has about 50 countries, the correct values of analogy barely comes within the first 10 predictions of X.

Generating Word2Vec Embeddings

  • Tried to generate an analogy test set by querying country domain relations in SPARQL.
    • The query, select distinct ?currency, ?country where { ?currency dbo:usingCountry ?country } in a few instances returns United_States_Dollar in the country column. For example United_States_Dollar Afghan_afghani is one such pair. Still running on this test set.
  • Using only the capital-country section of the Google Analogy Test Set,
    • Without the article-entity being tagged, it could guess analogies (of the form, Hanoi : Vietnam :: Madrid : ? ) with Hits@1 is 86.28% and Hits@3 is 92.54% Hits@5 is 93.79%
    • With article-entity tagged, Hits@1: 88.31% and Hits@3: 94.09%. and Hits@5: 95.31%

Generating Word2Vec Embeddings

  • Trying to find a way to tag entities not linked in Wikipedia articles, for example many a times I found ‘The New York Times’ without any link to it.
  • Till now I have generated three sets of embedding and these are the evaluation results I got on the Google Analogy Test Set.
    1. Tagging entities in an article using the surface forms local to the article, each paragraph as presented on Wikipedia is input as a sentence to for training the Word2Vec embeddings.
      • Matching the most relevant entity, 68.59%
      • Checking for the answer in the top 3 most relevent entities, 79.92%
    2. Tagging the article-entity using globally obtained surface forms additionally tagging pronouns, each paragraph is treated as a sentence.
      • 71.085% and 82.39% respectively.
    3. Same as the previous, but treating each article as a sentence.
      • 71.075% and 82.36% respectively.
  • The test set is a bit general, for example, in the currencies analogy the test set expects India to return Rupee as the answer, while my embeddings returns Indian_Rupee as the answer. Additionally most sections don’t really evaluate the embeddings.
  • I intend to generate a test set using entity-relations extracted from DBpedia using SPARQL queries.

Generating Word2Vec Embeddings

  • Obtained 2 sets of embeddings till now.
    1. With article entity tagged using a global dictionary and the tagging other entities in the article using the locally found anchor text.
    2. The previous corpus with their pronouns tagged.
  • Determining the gender of all dbo:Person entities took around 6 hours.
  • Will try to finish tagging the set of most referenced entities and generate their embeddings as soon as possible.
  • I have treated each paragraph as a sentence while training the Word2Vec model, next I’ll try training by treating each article as a sentence.

Generating Word2Vec Embeddings


  • Tagging pronouns
  • Tagging a set of most referenced entities.


  • Pronoun tagging is in process.
  • Wrote the code to update most referenced entities.


  • Joblibs would freeze without crashing while tagging.


  • Will post the evaluation for 3 sets of embeddings in the next blog post. The three sets being:
    1. Replacing all instances of anchor text obtained locally within an article, replacing the article’s entity using anchor text from the whole dump.
    2. With pronouns resolved.
    3. Tagging a set of most referenced entities.





Generating Word2Vec Embeddings

Tagging the article’s entity

python output/ reads all the anchor text in the clean text files present in the output directory and generates a dictionary of it. This dictionary is used to tag the surface forms of an article’s entity in within the article.

Obama was born in <a href="Honolulu">Honolulu, Hawaii</a>, two years after the territory was admitted to the Union as the 50th state. Raised largely in <a href="Hawaii">Hawaii</a>, Obama also spent one year of his childhood in <a href="Washington%20%28state%29">Washington State</a> and four years in <a href="Indonesia">Indonesia</a>.

The following text is obtained after tagging.

resource/Barack_Obama was born in resource/Honolulu, two years
after the territory was admitted to the Union as the 50th state.
Raised largely in resource/Hawaii, resource/Barack_Obama also
spent one year of his childhood in resource/Washington_(state)
and four years in resource/Indonesia.

The string Obama was tagged as resource/Barack_Obama because the dictionary generated has the following entries.

{'44th President of the United States', 'Barack', 
'Barack ""Hussein"" Obama', 'Barack "Barry" Obama', 
'Barack H. Obama', 'Barack H. Obama II', 'Barack Hussein Obama', 
'Barack Hussein Obama II', 'Barack Obama', 'Barack Obama II',
'Former President Barack Obama', 'Mr Obama', 'Obama', ...}

The tagging is in progress, should take a bit longer than 3.5 hours. An issue that I fixed is that in yesterday’s tagged corpus was that a few surface forms were tagged incorrectly due to the Wikitionary tags preserved along with the links.

Tagging Pronouns in dbo:Person articles

In order to determine the gender of the entity I implemented a pronoun counter script, and such that,

 mascForms = ['he', 'him', 'his']
 femiForms = ['she', 'her', 'hers']

Using the counts of these pronoun categories to determine the gender of the entity, I will update the anchor dictionary for dbo:Person entities such that the dictionary contains the appropriate pronouns. For example, 'he' will be appended to the anchor text dictionary of Barack_Obama.
The resulting corpus will be trained apart from the previously obtained corpus.

Generating Word2Vec Embeddings

Firstly, python --links -o output wikipedia-dump.xml clears the xml markup while preserving the links and saves the plain text files in the output directory.

Following this, python output generates a local dictionary for each article, of the form dictionary[entity] = [anchorText, label], and tags the strings that match the anchorText or label with the corresponding entity name. For example, after running the WikiExtractor python script the following string is obtained.

Anarchism is a <a href="political%20philosophy">political philosophy</a> that advocates <a href="self-governance">self-governed</a> societies based on voluntary institutions. These are often described as <a href="stateless%20society">stateless societies</a>, although several authors have defined them more specifically as institutions based on non-<a href="Hierarchy">hierarchical</a> <a href="Free%20association%20%28communism%20and%20anarchism%29">free associations</a>. Anarchism holds the <a href="state%20%28polity%29">state</a> to be undesirable, unnecessary, and harmful.

After the WikiDetector python script is run we obtain,

resource/Anarchism is a resource/Political_philosophy that advocates resource/Self-governance societies based on voluntary institutions. These are often described as resource/Stateless_society, although several authors have defined them more specifically as institutions based on non-resource/Hierarchy resource/Free_association_(communism_and_anarchism). resource/Anarchism holds the resource/State_(polity) to be undesirable, unnecessary, and harmful.

Initially I had intended for WikiDetector to process the Wikipedia dump file directly, but the python script on an average processed about 2 articles/s. On @tsoru’s suggestion I parallelized the code using the Processes and delayed methods implemented in the Python joblib module, after modifying the code to run on the clean text processed by WikiExtractor, the whole dump was tagged within 3.5 hours. Using this dump as the baseline I am training Word2Vec embeddings using the tagged corpus.

Following this I intend generate Word2Vec embeddings on the corpus with pronoun resolution implemented as discussed in the previous blog post.


Increasing the coverage further

Tagging Pronouns

An entity is often referenced by its pronoun in its Wikipedia article, in order to increase the coverage of the entities we would like to replace the pronouns in the article with the appropriate entity names. The pronouns are categorized as male (he, his), female (she, her) and plural (they, their). The pronoun group with the highest occurrence can be correctly used to determine the gender of the entity in question.

Name He + his % She + her % They + their %
Abraham Lincoln 2.42 0.11 0.3
Albert Einstein 3.47 0.13 0.45
Barack Obama 2.08 0.09 0.21
Chelsea FC 0.22 0 1.02
Elephant 0.19 0.14 1.47
Elizabeth II 0.53 4.16 0.4
Game of Thrones 0.26 0.19 0.39
Mother Teresa 0.46 3.88 0.22
Causality 0.25 0.01 0.25
Pink Floyd 1.32 0.02 1.43
Edsger W. Dijkstra 2.99 0.02 0.22

After determining the gender of the entity, the corresponding pronoun group can be tagged with the entity name of the title.

Checking for dbo:Person

The pronoun should be replaced if the entity is a member of the dbo:Person class. The articles for which the substitution needs to occur are selected from a list of all entities classified as dbo:Person, this list is generated using SPARQL.

Generating Word2Vec Embeddings [5/5]

Text Replacement

text = re.sub(r"\b%s\b" % surfaceForm, candidateEntity, text, re.IGNORECASE)
text = re.sub(r"\b%s\b" % surfaceForm, candidateEntity, text, flags = re.IGNORECASE)

The lower one fixed the issue. works properly now, all the surface forms are replaced by their corresponding entities.

Entity Tagging

For each article in the Wikipedia Dump, a new dictionary is generated by using the labels and anchor texts of the article.

Using this dictionary, all the surface forms within the article are replaced by their entity named.

A complete ‘global’ dictionary is generated by using all the anchor text in the Wikipedia dump. This dictionary is used to replace the surface forms of the article title within its own article. For example, in the article for Barack Obama, after a few mentions he simply referred to as Obama, but ‘Obama’ is the anchor text for Barack_Obama is some other article. The global dictionary captures this and ‘Obama’ can be correctly tagged as entity/Barack_Obama in the Barack Obama article.

 But the propouns in the article where Barack Obama is referred as ‘He’ is not covered. If and article can be checked if it’s about a person, then the pronouns can be tagged according as well.

Clearing xml is used to clear xml markup from the tagged Wikipedia dump.

Training Word2vec embeddings

The Word2vec embeddings for wikipedia entites can be trained using