Random Vector Accumulator

  • Tried making the code more memory efficient.
  • Trained only location based embeddings.
  • Created pipeline to output evaluation with the hyperparameters as input.
  • Adding functions to find nearest vector for the analogy implementation.
  • Began working on the NN implementation.

Working of the code.

WikiDetector and WikiExtractor are used to clear the wikicode from the Wikipedia dump preserving only regular text and the article titles, additionally the entites are tagged with suffix ‘resource/’. Each word has 2 vectors associated with it, index vector (hyperdimensional sparse vector) and the lexical memory vector (summation of the index vectors of context words). RVA.py searches for the entities tagged with suffix ‘resource/’. Firstly, the index vector of the title entity is multiplied with a weight and summed with the lexical memory vector of the entity which is initialized with a zero vector. A context window is also defined, words within this window on either side of the entity are assigned an index vector and are summed up with the lexical memory vector of the entity. The index vectors of  entities lying within a larger context window are multiplied with a weight and summed with the lexical memory vector. The sparse vectors are implemented using the Sparse data structure which stores only the non-zero values of the sparse vector. But as the number of non-zero entries increase it’s size surpasses that of an equivalent numpy.array, hence the memory vectors are numpy.arrays since they aren’t necessarily sparse.

Since the two sets of vectors are generated and the dictionary of these embeddings grows, the memory usage grows fast initially but slows down as the occurrence of new words decreases. On a limited subset of the Wikipedia corpus, the following tsne plots were obtained.

figure_1-2
(Repost) Embeddings of the first 0.67% of the corpus.
figure_1-9
Zooming into a cluster.
500
500 dimensional embeddings of capital-country pairs.

The summation of index vectors still requires normalization. The parallel implementation of the RVA takes longer than the single thread, maybe because the dictionary proxy implemented in the multiprocessing module is slower than using a regular dictionary.

 

 

Advertisements

RVA

  • Saving embeddings in a dictionary, and updating the shelve database in batches seems to work best.
  • After many trials the code to generate the embeddings finally works on the whole corpus without running out of memory.
  • I’ll upload the analogy evaluation results once the all the embeddings are generated.

Random Vector Accumulator

  • Tried running RVA on a single processor and it processed 0.67% corpus in 25 seconds. But runs out of memory as the corpus size grows.
  • Using the python module shelve to store and access the index dictionary from the disk. RVA_single.py uses only one processor and takes 180 seconds, which is faster than Word2Vec, which takes 200 seconds when run on 8 processors.
  • The code is currently processing the whole corpus. Though I think the difference in time between RVA_single.py and RVA.py. is due to the multiprocessing proxy dict being slower than the regular python dictionary.

Random Vector Accumulator

  • Running out of memory when run on the whole corpus.
  • First generating the index vectors and saving them to disk using dbm and then generating the embeddings might take a bit longer but should use less memory.
  • Since the embeddings are weighted sums of index vectors generating the lexical memory vectors in batches and adding the batch embeddings later on should also fix the issue.

Random Vector Accumulator

  • RVA.py took 780 seconds to generate 5000 dimensional entity embeddings on 0.67% of the corpus when executed using cython, Word2Vec takes 200 seconds to generate 300 dimensional embeddings.
  • Storing only the position and value of non-zero components of the embeddings should be more memory efficient than storing individual components.

Random Vector Accumulator

  • Giving a weight to title-entity index vector results in better clustering of similar concepts. Comparison of the first 1000 embeddings with weight=0 and weight=3.figure_1figure_2
  • Using a faster random number generating method, 5000 dimensional vectors with window size 2, running on 8 cores took 1623 seconds on 1,24,87,761 words (0.67% of the corpus). This is the plot of the first 4000 entities.figure_1-2
  • Zooming in on a few clusters.figure_1-1figure_1-9figure_1-6.png
  • Still haven’t been able to solve the issue of cpu usage.
  • It takes word2vec 18 seconds to generate 2000 dimensional embeddings on 1,10,922 words while the RVA takes 10 seconds, both running on 8 cores. At least in terms of the dimension of the embeddings, the RVA scales better in terms of processing time.
  • Trying to get the code to execute using Cython or PyPy to improve the run time.
    • When trying to execute in cython I always get ​”[Errno 21] Is a directory”, thought the code runs fine using python.