This excellent article by Omer Levy, Yoav Goldberg and Ido Dagan give an amazingly detailed view on what is really behind Word2Vec method and GloVe success. I found it while reading comments on a Radim Rehurek’s post about Word2Vec.
Word2vec is one of the method which compete with our proprietary approach at eXenSa (the NCISC algorithm we use in eXenGine), even if it’s limited to words and documents. It produces very good embeddings.
What is truly impressive in the Levy et al. article is that they have dissected the algorithm to separate preprocessing choices (such as the word window or the count to weight transformation, which is basically Pointwise Mutual Information), algorithm parameters (for SVD, they show that removing the eigenvalues is always a good choice), and post-processing choices (for instance the distance computation for analogy tasks).
The only parameters they didn’t test is the number of target features. This is a bit odd, since that’s one of the selling point of NCISC, which performs very well even with 1/10 of the features required for Word2Vec, but can be easily explained since they already had to test something like 1000 different combinations.
As far as I can say, the most important piece of information from this article is that the choice of the method (raw PPMI, SVD, Word2Vec or Gradient Descent) does not play the biggest role in the final performance. It’s not really a surprise, since the main advantage of NN is in there non-linearity, and for text data, you basically already have too much dimensions to require dimensionality expansion.