Mostly linguistically computational | Adventure in collaborative filtering, information retrieval, matrix factorization and other stuff

Does SGNS (Word2Vec) encodes word frequency ?

Posted on 2015/07/16

I’ve been wondering for a while about the behaviour of SGNS vs SVD/PMI with regards to the difference of performance on analogy questions. (Have a look at Omer Levy’s TACL article for a deep comparison of different methods).

One of my hypothesis is that, somehow, SGNS encodes the frequency of the words in its embedding. I think (haven’t tested it yet) that using frequency could help for many analogy questions, since one would expect, on average, that the frequency ratio between A and B is the same than between C and D.

One of the thing that made me think that SGNS could encode frequency is when I look at the neighbours of rare words :

With SVD (in parenthesis, the count from context with window size = 2):

1.0	iffy	(279)
0.809346555885	nitpicky	(125)
0.807352614797	miffed	(69)
0.804748781838	shaky	(934)
0.804201693021	sketchy	(527)
0.797617651846	clunky	(372)
0.794685053096	dodgy	(797)
0.792423714522	fishy	(494)
0.7876010544	listy	(211)
0.786528559774	picky	(397)
0.78497044497	underexposed	(73)
0.784392371301	unsharp	(130)
0.77507297907	choppy	(691)
0.770029271436	nit-picky	(90)
0.763106516724	fiddly	(43)
0.762200309444	muddled	(369)
0.761961572477	wonky	(196)
0.761783868043	disconcerting	(226)
0.760421351856	neater	(121)
0.759557240261	dissapointed	(32)

Variation : 0.1 – 3 x the query count

With SGNS :

1.0	iffy	(279)
0.970626	nitpicky	(125)
0.968823	bitey	(265)
0.968336	rediculous	(104)
0.967125	far-fetched	(262)
0.964707	counter-intuitive	(185)
0.964179	presumptuous	(126)
0.963679	disputable	(179)
0.963537	usefull	(183)
0.96175	clunky	(372)
0.96166	counterintuitive	(203)
0.961654	un-encyclopedic	(101)
0.961331	worrisome	(213)
0.960878	self-explanatory	(156)
0.960516	unecessary	(143)
0.960142	nit-picky	(80)
0.959044	wordy	(413)
0.958482	disconcerting	(226)
0.958218	disingenuous	(534)
0.958188	off-putting	(104)

Variation : 0.3 – 2 x the query count

That could actually be quite understandable. Since the update procedure of SGNS is online, the time of the update does have an importance (vectors at time t in the learning are not the same than at time t+s). So two rarely occuring words, if they occur roughly at the same time in the corpus, will tend to be updated by a few similar vectors (positively and negatively, by the way), and thus be highly influenced by the same direction of the data.

In other words, the rarer the word, the more its direction “lags behind” the rest of the embeddings, and in some way, the embedding slightly encodes the “age” of the words, and thus their frequency.

Here are its results on the same query with NCISC, with even more variation than SVD, with some apparently good high frequency candidates (debatable, unclear, unsatisfactory, questionable, and maybe wrong) :

1.0	iffy	(279)
0.840267098681	debatable	(711)
0.835281154019	keepable	(67)
0.83013430458	disputable	(179)
0.821933628995	sketchy	(527)
0.813930181981	unsatisfactory	(1176)
0.812291787551	unclear	(4445)
0.804978094441	nitpicky	(125)
0.804038235737	us-centric	(170)
0.802013089227	dodgy	(797)
0.798211347285	salvagable	(118)
0.797837578971	shaky	(934)
0.797040400162	counter-intuitive	(185)
0.79600914212	ambigious	(41)
0.791063976199	offputting	(33)
0.789877149245	questionable	(6541)
0.789543971526	notible	(78)
0.786470266284	unconvincing	(567)
0.781320751203	wrong	(12041)
0.779762440213	clunky	(372)

Variation : 0.1 – 43 x query count

Spark trick shot : executor-level accumulator

Posted on 2015/06/23

At eXenSa, we are currently working on a super fast way to preprocess large amounts of documents. It’s based on a combination of our new generation Count-Min Sketch (not the one I’ve posted about before, but an even better version) and a new trick we’ve devised just this week-end.

Before I explain this trick, here is how counting is usually done with standard data structures in map-reduce, and why we have chosen to try another way. Usually here is what happens :

Take a huge collection of documents
Split it into thousands of partitions
Now, in a convert a document into a collection of counts
Reduce the counts (sum by word)

From Earth to Heaven

Posted on 2015/06/22

Since my first attempt to compute “similarity paths” between Wikipedia pages on wikinsights.org, we’ve made some progress.

But first, a word about this “similarity path”, because a lot of people completely miss the point about it :

Usually, people use the Wikipedia graph to find path between pages. Unfortunately, Wikipedia is so well linked, that most path are just made of two or three hops. You can, for instance, play with wikidistrict.com to find the shortest path between two articles : from Earth to Heaven with wikidistrict is a short path : Earth -> Deity -> Heaven. Of course there are other paths (a huge number of paths actually).

Using the latent models built by NCISC, we can, however, recreate a graph as we want, you just have to decide

the model (semantic model, in-link model, out-link model, in-out-link model, out-in-link model)
the depth of the neighbours for each node

With this, we are now able to find our way in a much finer grained rewritten version of Wikipedia. For instance, Earth to Heaven gives :

Earth -> Moon -> Mercury (planet) -> Jupiter -> Sun -> Aurora -> Atmospheric diffraction -> Atmospheric optics -> Earth’s shadow -> Night sky -> Phases of Venus -> Counter-Earth -> History of the Center of the Universe -> Pythagorean astronomical system -> Astrological age -> History of astrology -> Zoroaster -> Asha -> Amesha Spenta -> Creator deity -> Heaven

The interesting thing here is the moment where we switch from atronomie to astrology. It happens with “Counter-Earth“, an hypothesis about another planet orbiting the sun in counter-phase with the earth.

What I show here is not in wikinsights yet, you’ll have to wait a bit before it comes live, but don’t worry it won’t be too long. We have a few novelties to add to our Wikinsights demo, with a set of improved models.

Some other paths:

Harry Potter to Jacques Chirac

Harry Potter -> Harry Potter and the Deathly Hallows -> Cerebus the Aardvark -> Anarky (comic book) -> Publication history of Anarky -> Anarky -> Operation Mindfuck -> Political strategy -> Political privacy -> Yellow Ribbon campaign (Fiji) -> Reaction to the 2005–06 Fijian political crisis -> Sitiveni Rabuka -> 1987 Fijian coups d’état -> Presidential Council (Benin) -> Hubert Maga -> Haitian general election, 2006 -> Jacques Chirac

Charleston, South Carolina -> Greece

Charleston, South Carolina -> History of Charleston, South Carolina -> History of the Southern United States -> History of Georgia (U.S. state) -> History of the United States -> United Kingdom–United States relations -> Decolonization -> International relations of the Great Powers (1814–1919) -> History of Europe -> Ottoman Greece -> Background of the Greek War of Independence -> Megali Idea -> Draft:Island of Cyprus -> Cyprus -> Greece

A word about Word2Vec

Posted on 2015/05/31

This excellent article by Omer Levy, Yoav Goldberg and Ido Dagan give an amazingly detailed view on what is really behind Word2Vec method and GloVe success. I found it while reading comments on a Radim Rehurek’s post about Word2Vec.

Word2vec is one of the method which compete with our proprietary approach at eXenSa (the NCISC algorithm we use in eXenGine), even if it’s limited to words and documents. It produces very good embeddings.

What is truly impressive in the Levy et al. article is that they have dissected the algorithm to separate preprocessing choices (such as the word window or the count to weight transformation, which is basically Pointwise Mutual Information), algorithm parameters (for SVD, they show that removing the eigenvalues is always a good choice), and post-processing choices (for instance the distance computation for analogy tasks).

The only parameters they didn’t test is the number of target features. This is a bit odd, since that’s one of the selling point of NCISC, which performs very well even with 1/10 of the features required for Word2Vec, but can be easily explained since they already had to test something like 1000 different combinations.

As far as I can say, the most important piece of information from this article is that the choice of the method (raw PPMI, SVD, Word2Vec or Gradient Descent) does not play the biggest role in the final performance. It’s not really a surprise, since the main advantage of NN is in there non-linearity, and for text data, you basically already have too much dimensions to require dimensionality expansion.

Count-Min-Log : Strange effect

Posted on 2014/11/30

For those who followed the previous steps, I was mentionning in my previous post that there was a very odd behaviour with Count-Min-Log with MAX sampling when the number of layers in the sketch was high.

This is even more obvious when looking at relative errors by quantile with a fixed width (I draw it with a width of 64 bits in 8 8-bits counters) because it’s where the effect is the most visible) :

Mean Relative Error per Quantile (lowest frequency on the left) with fixed width

Relative Error curves shows a down bump starting to be visible on the 18th quantile, with a sketch height of 9 and then move to quantiles for lower frequency.

For now I only suspect this phenomenon appears with MAX sampling, but I need to rerun the tests without MAX sampling to be sure. From there, two solutions : either the MAX sampling is completely bogus, or the idea of sampling with another value is good, but it needs to be modified to correctly estimate the overrepresentation due to the position of an element in the distribution.

On a side note, it’s clear to me that it shouldn’t be too difficult to adapt MAX sampling to classical linear counting. So I’m probably going to give it a try in the next couple of weeks.

And I’m still looking for help to compute the bounds of the precision and the probability. I don’t have a clue where to start. Also, I could use some help to find the right way to compute the overestimation factor from the distribution of the sketch values for a given element.

Count-Min-Log : Relative effects of Sketch width and height

Posted on 2014/11/28

I’ve had a particularly exciting and busy week, but I did manage to find some time to pursue the empirical evaluation of the idea of using log approximate counting to improve performance of the Count-Min Sketch on the Relative Error, particulary on the low frequency items.

Last week after writing my previous post I wrote an email to Ted Dunning, asking him for some advice and feedback about my idea. He very kindly replied, and advised me to perform some verification on the behaviour of my algorithm along the axis of the width and height of the sketch.

He also politely suggested that my graph were really crappy, thanks to my inappropriate use of MS Excel to produce them. I apologize to everyone I may have hurt by doing such an awful thing, but I was in a hury and so excited that I had to do with what was available to me at the time. This week I learned enough R and ggplot to present you some fantastically interesting data in an eye-pleasing.

A few words about the setting : this time I’ve used a super tiny Zipf(1.1) synthetic corpus : vocabulary = 1024 , words = 1048576 .

Indeed, I’ve switched from a handmade Zipf sampler to the Apache Commons ZipfDistribution to eliminate a part of the possible glitches in the data, and this distribution is much slower than my implementation (but it’s probably way more accurate)

I’ve tested extensively with sketch widths ranging from 2 to 65536, and heights from 1 to 32.

Mean Relative Error of Count-Min vs. Count-Min-Log

The first obvious observation is that Count-Min-Log is only interesting under high pressure (i.e.) the total size of the sketch is much smaller that the ideal perfect counts storage size that one would need to count the frequency of the 1024 terms occuring in the corpus (supposing you have a mapping between terms and an index).

The maximum gain is apparently obtained when the size of the sketch is 32 times smaller than the vocabulary size, and the height is clearly a very important factor in Count-Min-Log. As a consequence, if it is right to project these results to bigger settings (I don’t know if it’s right), you would get a Mean Relative Error about 350 times smaller using a Count-Min-Log Sketch of 125MB to count 1 billion elements (for instance trigrams), than using a Count-Min of the same size. To obtain the same result with a Count-Min, you would have to use a 4GB sketch.

Mean Relative Error : quantile with lowest frequency items

Focusing on the first quantile (the vocabulary with the least occurrences) show a very interesting picture of the behaviour of the Count-Min-Log. The effect of the MAX sampling that underestimates the increases in the count to compensate for the collisions seems to be maximum with a given width/height ratio, which seems to be very low.

MAX sampling plays an important role in the gain seen using Count-Min-Log. Without it, the gain would be probably about 5 or 10 in the best situations. The idea of MAX sampling comes from the fact that the impact of hash collision is much more important for low frequency items than for high frequency items, by using the value of the cell with the maximum value when flipping the coin to decide whether to increment the Log counter. If an item has a low frequency, then it is very likely that the maximum value is much bigger than the min value, whereas for high frequency item, the min and the max should be very close. With MAX sampling, the increment is skipped more often, wich underestimates the count for items which would be overestimated.

So it can be understood that having more layers in the sketch is a good thing because it helps to estimate the position of the item in the distribution. However, it is maybe only true if the collisions are numerous enough.

Next post will be a test of more heights and more fine grained widths, let’s say 2,3,4…100 ?

Count-Min-Log : some graphs

Posted on 2014/11/22

I’ve made a few experiments to have a look at the behaviour of the Count-Min-Log Sketch with Max sampling. Here are some interesting comparisons between classical Count-Min (the graphs show both the usual CM algorithm and the Conservative Update version), and Count-Min-Log.

For the sake of simplicity, I’ve chosen to limit to one setting in terms of memory use : in every sketch we use the same number of bits (here, 4 vectors of 4000*32 bits). I have also just shown the result of 8 bits Count-Min-Log (i.e. with base 1.095) with or without the progressive base (PRG).

Let’s start by the results under heavy pressure. Here we have 8 millions events in an initial vocabulary of 640,000 elements. The values are plotted for each decile of the vocabulary (i.e. on the left is the first 10th of the vocabulary with the lowest number of occurrences).

Average Relative Error per decile for high pressure setting.

Root mean square error per decile for high pressure setting.

Now for the low pressure setting, still with 4 vectors of 4000*32bits for everyone, but this time the initial vocabulary is only 10000 and the number of events is about 100k.

Average Relative Error per decile for low pressure setting.

Root mean square error per decile for low pressure setting.

Conclusion, for equal memory footprint, Count-Min-Log seems to clearly outperform classical Count-Min for high pressure settings, both for RMSE and ARE, with the exception of the RMSE for the top 10% highest frequency events. Under low pressure, it is clear that top 20-30% highest frequency items counts are significantly degraded.

I think that for any NLP-related task, Count-Min-Log is always a clear winner. For other applications, where the highest frequency are the most interesting, the gain may be negative, given the results on the high pressure setting. It will depend whether you care about ARE or RMSE.

Count-Min-Log Sketch for improved Average Relative Error on low frequency events

Posted on 2014/11/20

Count-Min sketch is a great algorithm, but it has a tendency to overestimate low frequency items when dealing with highly skewed data, at least it’s the case on zipfian data. Amit Goyal had some nice ideas to work around this, but I’m not that fund of the whole scheme he sets up to reduce the frequency of the rarest cells.

I’m currently thinking about a whole new and radical way to deal with text mining, and I’ve been hitting repeatedly a wall while trying to figure out how I could be able to count the occurrences of a very very high number of different elements. The point is, what I’m interested in is abolutely not to count accurately the number of occurrences of high frequency items, but to get the order of magnitude.

I’ve come to realize that, using Count-Min (or, for that matter, Count-Min with conservative update), I obtain an almost exact count for high frequency items, but a completely off value when looking at low frequency (or zero-frequency) items.

To deal with that, I’ve realized that would should try to weight our available bits. In Count-Min, counting from 0 to 1 is valued as much as counting from 2147483647 to 2147483648, while, frankly, no one cares whether it’s 2147483647 or just 2000000000, right ?

So I’ve introduced an interesting variation to the standard Count-Min Sketch, a variation which could, I think, be of high value for many uses, including mine. The idea is very simple, instead of using 32bits counters to count from 0 to 4294967295, we can use 8 bits counters to count from 0 to (approximately) 4294967295. This way, we can put 4 times more counters with the same number of bits, the risk of collisions is four times less, and thus, low frequency items are much less overestimated.

If you’re wondering how you can use a 8 bit counter to count from 0 to (approximately) 4294967295, just read this. It’s a log count using random sampling to increment the counter.

But it’s not all, first we can use the same very clever improvement to the standard Count-Min sketch which is called conservative update.

And additionaly, we can add a sample bias to the log-increment rule, which basically state that, if we’re at the bottom of the distribution, then probably we are going to be overestimated, then we should undersample.

Finally, we can have a progressive log base (for instance when the raw counter value is 1, the next value will be 1+1^1.1, but when the raw counter is 2, the next value will be 1+1^1.1+1^1.2). Zipfian distribution is not a power law, so why not go one step further..

The results are quite interesting. Obviously the absolute error is way worst than the absolute error of a standard Count-Min. The Average Relative Error, however, is about 10 times less with different settings.

Here is the code :

import scala.util.Random
import scala.util.hashing.MurmurHash3

/**
 * Created by guillaume on 10/29/14.
 */
class CountMinLogSketch(w:Int,k:Int,conservative:Boolean, exp:Double, maxsample:Boolean, progressive:Boolean, nBits:Int) {
  val store = Array.fill(k)(Array.fill(w)(0.toChar))
  var totalCount = 0.0
  val cMax = math.pow(2.0,nBits.toDouble) - 1.0

  def value(c:Int, exp:Double):Double = {
    // For exp = 2, 0 -> 0, 1 -> 1, 2 -> 2, 3 -> 4,
    if (c == 0)
      0.0
    else
      math.pow(exp,c.toDouble-1.0)
  }

  def fullValue(c:Int, exp: Double):Double = {
    // For exp = 2, 0 -> 0, 1 -> 1, 2 -> 3, 3 -> 7
    if (c <= 1)
      value(c, exp)
    else
      (1.0 - value(c+1, exp)) / (1.0 - exp)
  }

  def randomLog(c:Int, exp:Double) : Boolean = {
    // Return true with probability 1/(exp^(c+1))
    // For exp = 2, 0 -> 1, 1 -> 1/2, 2 -> 1/4
    val pIncrease = 1.0/(fullValue(c+1, getExp(c+1))-fullValue(c, getExp(c)))
    Random.nextDouble() < pIncrease
  }

  def getExp(c:Int) : Double = {
    if (progressive)
      1.0 + ((exp - 1.0)*(c.toDouble - 1.0) / cMax)
    else
      exp
  }

  def getCount(s:String):Double = {
    val cl = (0 until k).map { i =>
      store(i)(hash(s, i, w))
    }
    val c = cl.min
    fullValue(c,getExp(c))
  }

  def getProbability(s:String):Double = {
    val v = getCount(s)
    if (v > 0)
      getCount(s) / totalCount
    else
      0
  }

  def increaseCount(s:String):(Boolean,Double) = {
    totalCount += 1
    val v = (0 until k).map { i =>
      store(i)(hash(s, i, w))
    }
    val vmin = v.min
    val c = if (maxsample) v.max else vmin
    assert(c <= cMax)
    val increase = randomLog(c, 0.0)
    if (increase) {
      (0 until k).map { i =>
        val nc = v(i)
        if (!conservative || (vmin == nc)) {
          store(i)(hash(s, i, w)) = (nc + 1).toChar
        }
      }
      (increase, fullValue(vmin + 1, getExp(vmin + 1)))
    } else (false,fullValue(vmin, getExp(vmin)))
  }
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

import scala.util.Random

import scala.util.hashing.MurmurHash3

/**

* Created by guillaume on 10/29/14.

*/

class CountMinLogSketch(w:Int,k:Int,conservative:Boolean, exp:Double, maxsample:Boolean, progressive:Boolean, nBits:Int) {

val store = Array.fill(k)(Array.fill(w)(0.toChar))

var totalCount = 0.0

val cMax = math.pow(2.0,nBits.toDouble) - 1.0

def value(c:Int, exp:Double):Double = {

// For exp = 2, 0 -> 0, 1 -> 1, 2 -> 2, 3 -> 4,

if (c == 0)

0.0

else

math.pow(exp,c.toDouble-1.0)

}

def fullValue(c:Int, exp: Double):Double = {

// For exp = 2, 0 -> 0, 1 -> 1, 2 -> 3, 3 -> 7

if (c <= 1)

value(c, exp)

else

(1.0 - value(c+1, exp)) / (1.0 - exp)

}

def randomLog(c:Int, exp:Double) : Boolean = {

// Return true with probability 1/(exp^(c+1))

// For exp = 2, 0 -> 1, 1 -> 1/2, 2 -> 1/4

val pIncrease = 1.0/(fullValue(c+1, getExp(c+1))-fullValue(c, getExp(c)))

Random.nextDouble() < pIncrease

}

def getExp(c:Int) : Double = {

if (progressive)

1.0 + ((exp - 1.0)*(c.toDouble - 1.0) / cMax)

else

exp

}

def getCount(s:String):Double = {

val cl = (0 until k).map { i =>

store(i)(hash(s, i, w))

}

val c = cl.min

fullValue(c,getExp(c))

}

def getProbability(s:String):Double = {

val v = getCount(s)

if (v > 0)

getCount(s) / totalCount

else

0

}

def increaseCount(s:String):(Boolean,Double) = {

totalCount += 1

val v = (0 until k).map { i =>

store(i)(hash(s, i, w))

}

val vmin = v.min

val c = if (maxsample) v.max else vmin

assert(c <= cMax)

val increase = randomLog(c, 0.0)

if (increase) {

(0 until k).map { i =>

val nc = v(i)

if (!conservative || (vmin == nc)) {

store(i)(hash(s, i, w)) = (nc + 1).toChar

}

(increase, fullValue(vmin + 1, getExp(vmin + 1)))

} else (false,fullValue(vmin, getExp(vmin)))

}

And the results (1.1 million events, 60K elements, zipfian 1.01, the sketches are 8 vectors of 5000*32 bits)


Sketch	Conservative	Max Sample	Progressive	Base	RMSE	ARE
Count-Min	NO	-	-	-	0.160	11.930
Count-Min	YES	-	-	-	0.072	6.351
Count-MinLog	YES	YES	NO	1.045 (9 bits)	0.252	0.463
Count-MinLog	YES	YES	YES	1.045 (9 bits)	0.437	0.429
Count-MinLog	YES	YES	NO	1.095 (8 bits)	0.449	0.524
Count-MinLog	YES	YES	YES	1.095 (8 bits)	0.245	0.372
Count-MinLog	YES	YES	NO	1.19 (7 bits)	0.582	0.616
Count-MinLog	YES	YES	YES	1.19 (7 bits)	0.677	0.405

Distributed Protocol for Social Networks : never more needed than now

Posted on 2014/10/24

I still have a Firefox t-shirt labelled “Take back the web”. I think this goal hasn’t been more important than today, with the rise of so many social networks.

I’ve already written my opinion on this here and here, and I’ve read this on Slashdot recently, that seems to back it a little :

With the failure of email on the spam-fighting front, the weaknesses of DNS to properly enforce trust in network communication, I think it’s time to build an ecosystem of social communication, not based on a company or even a free software (as did Diaspora), but on a set of protocols that could allow messages, information, search, affinities, trust and money to be distributed and exchanged between identities on the web.

The ideas from virtual distributed money can maybe be reused to some extent, but other methods, for instance to deal with distributed affinities or search needs to be invented.

For once, public intervention may be a good idea. I’m pretty sure the EU would be wise to invest a lot of money in this envdeavour, to break free from our US friends and their companies.

Brand new version of Wikinsights.org

Posted on 2014/06/25

We’ve worked hard to make WikInsights more appealing and fun to use. Now it has a graphical interface with a graph view of the neighbours of Wikipedia.

Of course, the graph is dynamically generated and expose the most relevant neighbours along various dimensions, it’s not a view of the Wikipedia page graph.

A very fun thing to do in wikinsights is to compute the similarity path between two pages. You can discover the path between the Green Lantern (movie) and a Lantern :

Wikinsights - Lantern - Green Lantern

Have fun with it, and enjoy discovering new and relevant pages.