Random Vector Accumulator

  • Tried making the code more memory efficient.
  • Trained only location based embeddings.
  • Created pipeline to output evaluation with the hyperparameters as input.
  • Adding functions to find nearest vector for the analogy implementation.
  • Began working on the NN implementation.

Working of the code.

WikiDetector and WikiExtractor are used to clear the wikicode from the Wikipedia dump preserving only regular text and the article titles, additionally the entites are tagged with suffix ‘resource/’. Each word has 2 vectors associated with it, index vector (hyperdimensional sparse vector) and the lexical memory vector (summation of the index vectors of context words). RVA.py searches for the entities tagged with suffix ‘resource/’. Firstly, the index vector of the title entity is multiplied with a weight and summed with the lexical memory vector of the entity which is initialized with a zero vector. A context window is also defined, words within this window on either side of the entity are assigned an index vector and are summed up with the lexical memory vector of the entity. The index vectors of  entities lying within a larger context window are multiplied with a weight and summed with the lexical memory vector. The sparse vectors are implemented using the Sparse data structure which stores only the non-zero values of the sparse vector. But as the number of non-zero entries increase it’s size surpasses that of an equivalent numpy.array, hence the memory vectors are numpy.arrays since they aren’t necessarily sparse.

Since the two sets of vectors are generated and the dictionary of these embeddings grows, the memory usage grows fast initially but slows down as the occurrence of new words decreases. On a limited subset of the Wikipedia corpus, the following tsne plots were obtained.

figure_1-2
(Repost) Embeddings of the first 0.67% of the corpus.
figure_1-9
Zooming into a cluster.
500
500 dimensional embeddings of capital-country pairs.

The summation of index vectors still requires normalization. The parallel implementation of the RVA takes longer than the single thread, maybe because the dictionary proxy implemented in the multiprocessing module is slower than using a regular dictionary.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s