novembre 2014

Count-Min-Log : Strange effect

Posted on 2014/11/30

For those who followed the previous steps, I was mentionning in my previous post that there was a very odd behaviour with Count-Min-Log with MAX sampling when the number of layers in the sketch was high.

This is even more obvious when looking at relative errors by quantile with a fixed width (I draw it with a width of 64 bits in 8 8-bits counters) because it’s where the effect is the most visible) :

Mean Relative Error per Quantile (lowest frequency on the left) with fixed width

Relative Error curves shows a down bump starting to be visible on the 18th quantile, with a sketch height of 9 and then move to quantiles for lower frequency.

For now I only suspect this phenomenon appears with MAX sampling, but I need to rerun the tests without MAX sampling to be sure. From there, two solutions : either the MAX sampling is completely bogus, or the idea of sampling with another value is good, but it needs to be modified to correctly estimate the overrepresentation due to the position of an element in the distribution.

On a side note, it’s clear to me that it shouldn’t be too difficult to adapt MAX sampling to classical linear counting. So I’m probably going to give it a try in the next couple of weeks.

And I’m still looking for help to compute the bounds of the precision and the probability. I don’t have a clue where to start. Also, I could use some help to find the right way to compute the overestimation factor from the distribution of the sketch values for a given element.

Count-Min-Log : Relative effects of Sketch width and height

Posted on 2014/11/28

I’ve had a particularly exciting and busy week, but I did manage to find some time to pursue the empirical evaluation of the idea of using log approximate counting to improve performance of the Count-Min Sketch on the Relative Error, particulary on the low frequency items.

Last week after writing my previous post I wrote an email to Ted Dunning, asking him for some advice and feedback about my idea. He very kindly replied, and advised me to perform some verification on the behaviour of my algorithm along the axis of the width and height of the sketch.

He also politely suggested that my graph were really crappy, thanks to my inappropriate use of MS Excel to produce them. I apologize to everyone I may have hurt by doing such an awful thing, but I was in a hury and so excited that I had to do with what was available to me at the time. This week I learned enough R and ggplot to present you some fantastically interesting data in an eye-pleasing.

A few words about the setting : this time I’ve used a super tiny Zipf(1.1) synthetic corpus : vocabulary = 1024 , words = 1048576 .

Indeed, I’ve switched from a handmade Zipf sampler to the Apache Commons ZipfDistribution to eliminate a part of the possible glitches in the data, and this distribution is much slower than my implementation (but it’s probably way more accurate)

I’ve tested extensively with sketch widths ranging from 2 to 65536, and heights from 1 to 32.

Mean Relative Error of Count-Min vs. Count-Min-Log

The first obvious observation is that Count-Min-Log is only interesting under high pressure (i.e.) the total size of the sketch is much smaller that the ideal perfect counts storage size that one would need to count the frequency of the 1024 terms occuring in the corpus (supposing you have a mapping between terms and an index).

The maximum gain is apparently obtained when the size of the sketch is 32 times smaller than the vocabulary size, and the height is clearly a very important factor in Count-Min-Log. As a consequence, if it is right to project these results to bigger settings (I don’t know if it’s right), you would get a Mean Relative Error about 350 times smaller using a Count-Min-Log Sketch of 125MB to count 1 billion elements (for instance trigrams), than using a Count-Min of the same size. To obtain the same result with a Count-Min, you would have to use a 4GB sketch.

Mean Relative Error : quantile with lowest frequency items

Focusing on the first quantile (the vocabulary with the least occurrences) show a very interesting picture of the behaviour of the Count-Min-Log. The effect of the MAX sampling that underestimates the increases in the count to compensate for the collisions seems to be maximum with a given width/height ratio, which seems to be very low.

MAX sampling plays an important role in the gain seen using Count-Min-Log. Without it, the gain would be probably about 5 or 10 in the best situations. The idea of MAX sampling comes from the fact that the impact of hash collision is much more important for low frequency items than for high frequency items, by using the value of the cell with the maximum value when flipping the coin to decide whether to increment the Log counter. If an item has a low frequency, then it is very likely that the maximum value is much bigger than the min value, whereas for high frequency item, the min and the max should be very close. With MAX sampling, the increment is skipped more often, wich underestimates the count for items which would be overestimated.

So it can be understood that having more layers in the sketch is a good thing because it helps to estimate the position of the item in the distribution. However, it is maybe only true if the collisions are numerous enough.

Next post will be a test of more heights and more fine grained widths, let’s say 2,3,4…100 ?

Count-Min-Log : some graphs

Posted on 2014/11/22

I’ve made a few experiments to have a look at the behaviour of the Count-Min-Log Sketch with Max sampling. Here are some interesting comparisons between classical Count-Min (the graphs show both the usual CM algorithm and the Conservative Update version), and Count-Min-Log.

For the sake of simplicity, I’ve chosen to limit to one setting in terms of memory use : in every sketch we use the same number of bits (here, 4 vectors of 4000*32 bits). I have also just shown the result of 8 bits Count-Min-Log (i.e. with base 1.095) with or without the progressive base (PRG).

Let’s start by the results under heavy pressure. Here we have 8 millions events in an initial vocabulary of 640,000 elements. The values are plotted for each decile of the vocabulary (i.e. on the left is the first 10th of the vocabulary with the lowest number of occurrences).

Average Relative Error per decile for high pressure setting.

Root mean square error per decile for high pressure setting.

Now for the low pressure setting, still with 4 vectors of 4000*32bits for everyone, but this time the initial vocabulary is only 10000 and the number of events is about 100k.

Average Relative Error per decile for low pressure setting.

Root mean square error per decile for low pressure setting.

Conclusion, for equal memory footprint, Count-Min-Log seems to clearly outperform classical Count-Min for high pressure settings, both for RMSE and ARE, with the exception of the RMSE for the top 10% highest frequency events. Under low pressure, it is clear that top 20-30% highest frequency items counts are significantly degraded.

I think that for any NLP-related task, Count-Min-Log is always a clear winner. For other applications, where the highest frequency are the most interesting, the gain may be negative, given the results on the high pressure setting. It will depend whether you care about ARE or RMSE.

Count-Min-Log Sketch for improved Average Relative Error on low frequency events

Posted on 2014/11/20

Count-Min sketch is a great algorithm, but it has a tendency to overestimate low frequency items when dealing with highly skewed data, at least it’s the case on zipfian data. Amit Goyal had some nice ideas to work around this, but I’m not that fund of the whole scheme he sets up to reduce the frequency of the rarest cells.

I’m currently thinking about a whole new and radical way to deal with text mining, and I’ve been hitting repeatedly a wall while trying to figure out how I could be able to count the occurrences of a very very high number of different elements. The point is, what I’m interested in is abolutely not to count accurately the number of occurrences of high frequency items, but to get the order of magnitude.

I’ve come to realize that, using Count-Min (or, for that matter, Count-Min with conservative update), I obtain an almost exact count for high frequency items, but a completely off value when looking at low frequency (or zero-frequency) items.

To deal with that, I’ve realized that would should try to weight our available bits. In Count-Min, counting from 0 to 1 is valued as much as counting from 2147483647 to 2147483648, while, frankly, no one cares whether it’s 2147483647 or just 2000000000, right ?

So I’ve introduced an interesting variation to the standard Count-Min Sketch, a variation which could, I think, be of high value for many uses, including mine. The idea is very simple, instead of using 32bits counters to count from 0 to 4294967295, we can use 8 bits counters to count from 0 to (approximately) 4294967295. This way, we can put 4 times more counters with the same number of bits, the risk of collisions is four times less, and thus, low frequency items are much less overestimated.

If you’re wondering how you can use a 8 bit counter to count from 0 to (approximately) 4294967295, just read this. It’s a log count using random sampling to increment the counter.

But it’s not all, first we can use the same very clever improvement to the standard Count-Min sketch which is called conservative update.

And additionaly, we can add a sample bias to the log-increment rule, which basically state that, if we’re at the bottom of the distribution, then probably we are going to be overestimated, then we should undersample.

Finally, we can have a progressive log base (for instance when the raw counter value is 1, the next value will be 1+1^1.1, but when the raw counter is 2, the next value will be 1+1^1.1+1^1.2). Zipfian distribution is not a power law, so why not go one step further..

The results are quite interesting. Obviously the absolute error is way worst than the absolute error of a standard Count-Min. The Average Relative Error, however, is about 10 times less with different settings.

Here is the code :

import scala.util.Random
import scala.util.hashing.MurmurHash3

/**
 * Created by guillaume on 10/29/14.
 */
class CountMinLogSketch(w:Int,k:Int,conservative:Boolean, exp:Double, maxsample:Boolean, progressive:Boolean, nBits:Int) {
  val store = Array.fill(k)(Array.fill(w)(0.toChar))
  var totalCount = 0.0
  val cMax = math.pow(2.0,nBits.toDouble) - 1.0

  def value(c:Int, exp:Double):Double = {
    // For exp = 2, 0 -> 0, 1 -> 1, 2 -> 2, 3 -> 4,
    if (c == 0)
      0.0
    else
      math.pow(exp,c.toDouble-1.0)
  }

  def fullValue(c:Int, exp: Double):Double = {
    // For exp = 2, 0 -> 0, 1 -> 1, 2 -> 3, 3 -> 7
    if (c <= 1)
      value(c, exp)
    else
      (1.0 - value(c+1, exp)) / (1.0 - exp)
  }

  def randomLog(c:Int, exp:Double) : Boolean = {
    // Return true with probability 1/(exp^(c+1))
    // For exp = 2, 0 -> 1, 1 -> 1/2, 2 -> 1/4
    val pIncrease = 1.0/(fullValue(c+1, getExp(c+1))-fullValue(c, getExp(c)))
    Random.nextDouble() < pIncrease
  }

  def getExp(c:Int) : Double = {
    if (progressive)
      1.0 + ((exp - 1.0)*(c.toDouble - 1.0) / cMax)
    else
      exp
  }

  def getCount(s:String):Double = {
    val cl = (0 until k).map { i =>
      store(i)(hash(s, i, w))
    }
    val c = cl.min
    fullValue(c,getExp(c))
  }

  def getProbability(s:String):Double = {
    val v = getCount(s)
    if (v > 0)
      getCount(s) / totalCount
    else
      0
  }

  def increaseCount(s:String):(Boolean,Double) = {
    totalCount += 1
    val v = (0 until k).map { i =>
      store(i)(hash(s, i, w))
    }
    val vmin = v.min
    val c = if (maxsample) v.max else vmin
    assert(c <= cMax)
    val increase = randomLog(c, 0.0)
    if (increase) {
      (0 until k).map { i =>
        val nc = v(i)
        if (!conservative || (vmin == nc)) {
          store(i)(hash(s, i, w)) = (nc + 1).toChar
        }
      }
      (increase, fullValue(vmin + 1, getExp(vmin + 1)))
    } else (false,fullValue(vmin, getExp(vmin)))
  }
}

import scala.util.Random

import scala.util.hashing.MurmurHash3

/**

* Created by guillaume on 10/29/14.

class CountMinLogSketch(w:Int,k:Int,conservative:Boolean, exp:Double, maxsample:Boolean, progressive:Boolean, nBits:Int) {

val store = Array.fill(k)(Array.fill(w)(0.toChar))

var totalCount = 0.0

val cMax = math.pow(2.0,nBits.toDouble) - 1.0

def value(c:Int, exp:Double):Double = {

// For exp = 2, 0 -> 0, 1 -> 1, 2 -> 2, 3 -> 4,

if (c == 0)

0.0

else

math.pow(exp,c.toDouble-1.0)

}

def fullValue(c:Int, exp: Double):Double = {

// For exp = 2, 0 -> 0, 1 -> 1, 2 -> 3, 3 -> 7

if (c <= 1)

value(c, exp)

else

(1.0 - value(c+1, exp)) / (1.0 - exp)

}

def randomLog(c:Int, exp:Double) : Boolean = {

// Return true with probability 1/(exp^(c+1))

// For exp = 2, 0 -> 1, 1 -> 1/2, 2 -> 1/4

val pIncrease = 1.0/(fullValue(c+1, getExp(c+1))-fullValue(c, getExp(c)))

Random.nextDouble() < pIncrease

}

def getExp(c:Int) : Double = {

if (progressive)

1.0 + ((exp - 1.0)*(c.toDouble - 1.0) / cMax)

else

exp

}

def getCount(s:String):Double = {

val cl = (0 until k).map { i =>

store(i)(hash(s, i, w))

}

val c = cl.min

fullValue(c,getExp(c))

}

def getProbability(s:String):Double = {

val v = getCount(s)

if (v > 0)

getCount(s) / totalCount

else

}

def increaseCount(s:String):(Boolean,Double) = {

totalCount += 1

val v = (0 until k).map { i =>

store(i)(hash(s, i, w))

}

val vmin = v.min

val c = if (maxsample) v.max else vmin

assert(c <= cMax)

val increase = randomLog(c, 0.0)

if (increase) {

(0 until k).map { i =>

val nc = v(i)

if (!conservative || (vmin == nc)) {

store(i)(hash(s, i, w)) = (nc + 1).toChar

}

(increase, fullValue(vmin + 1, getExp(vmin + 1)))

} else (false,fullValue(vmin, getExp(vmin)))

}

And the results (1.1 million events, 60K elements, zipfian 1.01, the sketches are 8 vectors of 5000*32 bits)


Sketch	Conservative	Max Sample	Progressive	Base	RMSE	ARE
Count-Min	NO	-	-	-	0.160	11.930
Count-Min	YES	-	-	-	0.072	6.351
Count-MinLog	YES	YES	NO	1.045 (9 bits)	0.252	0.463
Count-MinLog	YES	YES	YES	1.045 (9 bits)	0.437	0.429
Count-MinLog	YES	YES	NO	1.095 (8 bits)	0.449	0.524
Count-MinLog	YES	YES	YES	1.095 (8 bits)	0.245	0.372
Count-MinLog	YES	YES	NO	1.19 (7 bits)	0.582	0.616
Count-MinLog	YES	YES	YES	1.19 (7 bits)	0.677	0.405

Mostly linguistically computational

Adventure in collaborative filtering, information retrieval, matrix factorization and other stuff

Month: novembre 2014

Count-Min-Log : Strange effect

Count-Min-Log : Relative effects of Sketch width and height

Count-Min-Log : some graphs

Count-Min-Log Sketch for improved Average Relative Error on low frequency events