Spark trick shot : executor-level accumulator

At eXenSa, we are currently working on a super fast way to preprocess large amounts of documents. It’s based on a combination of our new generation Count-Min Sketch (not the one I’ve posted about before, but an even better version) and a new trick we’ve devised just this week-end.

Before I explain this trick, here is how counting is usually done with standard data structures in map-reduce, and why we have chosen to try another way. Usually here is what happens :

  1. Take a huge collection of documents
  2. Split it into thousands of partitions
  3. Now, in a convert a document into a collection of counts
  4. Reduce the counts (sum by word)

From Earth to Heaven

Since my first attempt to compute “similarity paths” between Wikipedia pages on wikinsights.org, we’ve made some progress.

But first, a word about this “similarity path”, because a lot of people completely miss the point about it :

Usually, people use the Wikipedia graph to find path between pages. Unfortunately, Wikipedia is so well linked, that most path are just made of two or three hops. You can, for instance, play with wikidistrict.com to find the shortest path between two articles : from Earth to Heaven with wikidistrict is a short path : Earth -> Deity -> Heaven. Of course there are other paths (a huge number of paths actually).

Using the latent models built by NCISC, we can, however, recreate a graph as we want, you just have to decide

  • the model (semantic model, in-link model, out-link model, in-out-link model, out-in-link model)
  • the depth of the neighbours for each node

With this, we are now able to find our way in a much finer grained rewritten version of Wikipedia. For instance, Earth to Heaven gives :

Earth -> Moon -> Mercury (planet) -> Jupiter -> Sun -> Aurora -> Atmospheric diffraction -> Atmospheric optics -> Earth’s shadow -> Night sky -> Phases of Venus -> Counter-Earth -> History of the Center of the Universe -> Pythagorean astronomical system -> Astrological age -> History of astrology -> Zoroaster -> Asha -> Amesha Spenta -> Creator deity -> Heaven

The interesting thing here is the moment where we switch from atronomie to astrology. It happens with “Counter-Earth“, an hypothesis about another planet orbiting the sun in counter-phase with the earth.

What I show here is not in wikinsights yet, you’ll have to wait a bit before it comes live, but don’t worry it won’t be too long. We have a few novelties to add to our Wikinsights demo, with a set of improved models.

Some other paths:

  • Harry Potter to Jacques Chirac

Harry Potter -> Harry Potter and the Deathly Hallows -> Cerebus the Aardvark -> Anarky (comic book) -> Publication history of Anarky -> Anarky -> Operation Mindfuck -> Political strategy -> Political privacy -> Yellow Ribbon campaign (Fiji) -> Reaction to the 2005–06 Fijian political crisis -> Sitiveni Rabuka -> 1987 Fijian coups d’état -> Presidential Council (Benin) -> Hubert Maga -> Haitian general election, 2006 -> Jacques Chirac

  • Charleston, South Carolina -> Greece

Charleston, South Carolina -> History of Charleston, South Carolina -> History of the Southern United States -> History of Georgia (U.S. state) -> History of the United States -> United Kingdom–United States relations -> Decolonization -> International relations of the Great Powers (1814–1919) -> History of Europe -> Ottoman Greece -> Background of the Greek War of Independence -> Megali Idea -> Draft:Island of Cyprus -> Cyprus -> Greece