Does SGNS (Word2Vec) encodes word frequency ?
I’ve been wondering for a while about the behaviour of SGNS vs SVD/PMI with regards to the difference of performance on analogy questions. (Have a look at Omer Levy’s TACL article for a deep comparison of different methods).
One of my hypothesis is that, somehow, SGNS encodes the frequency of the words in its embedding. I think (haven’t tested it yet) that using frequency could help for many analogy questions, since one would expect, on average, that the frequency ratio between A and B is the same than between C and D.
One of the thing that made me think that SGNS could encode frequency is when I look at the neighbours of rare words :
With SVD (in parenthesis, the count from context with window size = 2):
1.0 | iffy | (279) |
0.809346555885 | nitpicky | (125) |
0.807352614797 | miffed | (69) |
0.804748781838 | shaky | (934) |
0.804201693021 | sketchy | (527) |
0.797617651846 | clunky | (372) |
0.794685053096 | dodgy | (797) |
0.792423714522 | fishy | (494) |
0.7876010544 | listy | (211) |
0.786528559774 | picky | (397) |
0.78497044497 | underexposed | (73) |
0.784392371301 | unsharp | (130) |
0.77507297907 | choppy | (691) |
0.770029271436 | nit-picky | (90) |
0.763106516724 | fiddly | (43) |
0.762200309444 | muddled | (369) |
0.761961572477 | wonky | (196) |
0.761783868043 | disconcerting | (226) |
0.760421351856 | neater | (121) |
0.759557240261 | dissapointed | (32) |
Variation : 0.1 – 3 x the query count
With SGNS :
1.0 | iffy | (279) |
0.970626 | nitpicky | (125) |
0.968823 | bitey | (265) |
0.968336 | rediculous | (104) |
0.967125 | far-fetched | (262) |
0.964707 | counter-intuitive | (185) |
0.964179 | presumptuous | (126) |
0.963679 | disputable | (179) |
0.963537 | usefull | (183) |
0.96175 | clunky | (372) |
0.96166 | counterintuitive | (203) |
0.961654 | un-encyclopedic | (101) |
0.961331 | worrisome | (213) |
0.960878 | self-explanatory | (156) |
0.960516 | unecessary | (143) |
0.960142 | nit-picky | (80) |
0.959044 | wordy | (413) |
0.958482 | disconcerting | (226) |
0.958218 | disingenuous | (534) |
0.958188 | off-putting | (104) |
Variation : 0.3 – 2 x the query count
That could actually be quite understandable. Since the update procedure of SGNS is online, the time of the update does have an importance (vectors at time t in the learning are not the same than at time t+s). So two rarely occuring words, if they occur roughly at the same time in the corpus, will tend to be updated by a few similar vectors (positively and negatively, by the way), and thus be highly influenced by the same direction of the data.
In other words, the rarer the word, the more its direction “lags behind” the rest of the embeddings, and in some way, the embedding slightly encodes the “age” of the words, and thus their frequency.
Here are its results on the same query with NCISC, with even more variation than SVD, with some apparently good high frequency candidates (debatable, unclear, unsatisfactory, questionable, and maybe wrong) :
1.0 | iffy | (279) |
0.840267098681 | debatable | (711) |
0.835281154019 | keepable | (67) |
0.83013430458 | disputable | (179) |
0.821933628995 | sketchy | (527) |
0.813930181981 | unsatisfactory | (1176) |
0.812291787551 | unclear | (4445) |
0.804978094441 | nitpicky | (125) |
0.804038235737 | us-centric | (170) |
0.802013089227 | dodgy | (797) |
0.798211347285 | salvagable | (118) |
0.797837578971 | shaky | (934) |
0.797040400162 | counter-intuitive | (185) |
0.79600914212 | ambigious | (41) |
0.791063976199 | offputting | (33) |
0.789877149245 | questionable | (6541) |
0.789543971526 | notible | (78) |
0.786470266284 | unconvincing | (567) |
0.781320751203 | wrong | (12041) |
0.779762440213 | clunky | (372) |
Variation : 0.1 – 43 x query count