Giving a weight to title-entity index vector results in better clustering of similar concepts. Comparison of the first 1000 embeddings with weight=0 and weight=3.
Using a faster random number generating method, 5000 dimensional vectors with window size 2, running on 8 cores took 1623 seconds on 1,24,87,761 words (0.67% of the corpus). This is the plot of the first 4000 entities.
Zooming in on a few clusters.
Still haven’t been able to solve the issue of cpu usage.
It takes word2vec 18 seconds to generate 2000 dimensional embeddings on 1,10,922 words while the RVA takes 10 seconds, both running on 8 cores. At least in terms of the dimension of the embeddings, the RVA scales better in terms of processing time.
Trying to get the code to execute using Cython or PyPy to improve the run time.
When trying to execute in cython I always get ”[Errno 21] Is a directory”, thought the code runs fine using python.