Random Vector Accumulator

  • The RVA code took 1898 seconds to run on 12487761 words (0.67% of the corpus), while Word2Vec took only 200 seconds.
    • method of generating sparse random vectors needs to be made more efficient.
    • when I checked the cpu usage while generating Word2Vec embeddings across all the workers, it was 100%. But for RVA it never crossed 30%. For some reason multiprocessing library of python is unable to utilize the 100% of the cpu.
  • Visualizing the RVA embeddings using t-SNE imported from sklearn, the RVA was able to place a few similar concepts together. The embeddings in the image are of dimension 200 and generated using window size 10.figure_1
Advertisements

Random Indexing

If I’m focusing only on entity embeddings, then can I simply neglect generating embeddings for the words which aren’t entities?

Random Indexing

Random Indexing

  • Completed vanilla random indexing implementation.
  • Will need to parallelize the code and tune the hyperparameters.

Increasing Coverage

  • Testing only the embeddings of the entities tagged by the WikiDetector.py, since the entities are tagged only when the entity link is explicitly present in the article, or if it is the entity that represents the article.

Raw Text Embeddings

  • Trying to modify the code to work on raw text such that the end result of tagging is the same as it was in the Wikipedia dump.

Generating Word2Vec Embeddings

  • Tried to generate a test set with countries and their continents using SPARQL, found it easier to generate it from the wikipedia article.
  • Evaluated all possible combinations of the form country1:continent1::country2:X, and got the accuracy as raw Hits@1: 14.4%, Hits@3: 26.41%, Hits@10: 41.71%. Since a country can have only one possible continent, the filtered accuracy is the same.
  • I also tried a filtered accuracy evaluation, by checking for continent1:country1:continent2:X, but given each continent has about 50 countries, the correct values of analogy barely comes within the first 10 predictions of X.

Generating Word2Vec Embeddings

  • Tried to generate an analogy test set by querying country domain relations in SPARQL.
    • The query, select distinct ?currency, ?country where { ?currency dbo:usingCountry ?country } in a few instances returns United_States_Dollar in the country column. For example United_States_Dollar Afghan_afghani is one such pair. Still running on this test set.
  • Using only the capital-country section of the Google Analogy Test Set,
    • Without the article-entity being tagged, it could guess analogies (of the form, Hanoi : Vietnam :: Madrid : ? ) with Hits@1 is 86.28% and Hits@3 is 92.54% Hits@5 is 93.79%
    • With article-entity tagged, Hits@1: 88.31% and Hits@3: 94.09%. and Hits@5: 95.31%

Generating Word2Vec Embeddings

  • Trying to find a way to tag entities not linked in Wikipedia articles, for example many a times I found ‘The New York Times’ without any link to it.
  • Till now I have generated three sets of embedding and these are the evaluation results I got on the Google Analogy Test Set.
    1. Tagging entities in an article using the surface forms local to the article, each paragraph as presented on Wikipedia is input as a sentence to for training the Word2Vec embeddings.
      • Matching the most relevant entity, 68.59%
      • Checking for the answer in the top 3 most relevent entities, 79.92%
    2. Tagging the article-entity using globally obtained surface forms additionally tagging pronouns, each paragraph is treated as a sentence.
      • 71.085% and 82.39% respectively.
    3. Same as the previous, but treating each article as a sentence.
      • 71.075% and 82.36% respectively.
  • The test set is a bit general, for example, in the currencies analogy the test set expects India to return Rupee as the answer, while my embeddings returns Indian_Rupee as the answer. Additionally most sections don’t really evaluate the embeddings.
  • I intend to generate a test set using entity-relations extracted from DBpedia using SPARQL queries.