Amazing results of NCISC on Netflix

Last week, I’ve played a little bit with the Netflix prize dataset and NC-ISC. The results are amazingly better than the current state of the art.

Short version

Netflix dataset contains ratings of movies by user. We try to predict unseen good ratings (a few ratings are kept aside in a separate dataset, and we try to see if we can predict that a given user would have rated those particular movies).

I’ve compared state-of-the-art results with my method, NC-ISC. With the best settings, I can predict 24% of the unseen ratings in the first 50 results, compared to 19% for the current best methods. Considering the fact that a non personalized prediction, based on popularity only, gives a 10% score, what we achieved is a 55% improvement over the current state of the art.

This is a huge gain.

And it could be applied to other recommendation problems : music, jobs, social network, products in e-commerce and social shopping, etc.

A bit of context

The Netflix Prize was an open competition held in 2009. The dataset consist of 100M ratings (1-5) given by 400K+ user on 17K+ movies. The original goal was to predict “unseen” ratings with the highest possible accuracy. The measure used was the RMSE (root mean square error) on the hidden ratings.

netflix-recommendations

The RMSE, however, is a very incomplete evaluation measure  (and also it cannot be estimated on the results of NC-ISC, which is one of the reason we haven’t spent much time to compare our results with the state of the art). Indeed, the common task for a recommender system is not to predict the rating you would give a given movie, but to provide you a list of K movies that you could potentially like.

As a result, more and more effort has been put on the so-called top-K recommender task, and as a side effect on learning to rank instead of learning to rate.

I’ve stumbled recently on an interesting paper : Alternating Least Squares for Personalized Ranking by Gabor Takacs and Domonkos Tikk, from RecSys 2012 (which has been covered here), and found myself happy to find a paper with results I could compare to (often, either the metric, or the dataset is at odds with what I can produce/access).

The experiment

So I ran NC-ISC on the same dataset they used in the paper (just keep the top ratings), which reduce the train dataset to 22M ratings and the probe to 400K ratings. At first the results were very disappointing. The main reason for this is that NC-ISC inherently discards any bias on popularity. And the fact is, ranking Movies is a very popularity-biased problem (below we can see that 15000 movies have very few ratings in the test set, while a few movies have 12000 ratings) :

Distribution Ratings Netflix

However, while NC-ISC cannot perform rating prediction natively, and discards popularity in both its learning and its ranking, it’s quite easy to put popularity back into the learning, and back into the ranking. Putting it back in the ranking simply means to multiply the score produced by NC-ISC by each movie popularity. Putting popularity back into the learning means decreasing the unbiasing factor in NC-ISC.

The results

And here we come (I’ve added my results as an overlay to the original paper figure). Just a disclaimer : while I’ve reproduced results from the popularity baseline, I didn’t have the time to check the implicit ALS results (I’ve got an implementation here in Octave that take hours to run), so I’m not 100% confident I haven’t missed something in the reproduction of the experiment.

NetFlix-Recall@50

I’ve added a mark for 100 features, which wasn’t in the original paper, probably since there results tended to stop improving.

Performance

A short note about NCISC performance versus ImplicitALS, RankALS and RankSGD. First, while SGD takes on average 30 iterations to converge, ALS and NCISC takes between 5 and 10 iterations.

Regarding iteration time, while Takacs and Tikk give their timings for Y!Music data, I could only work top Netflix ratings dataset. However, the two datasets have approximately the same number of ratings (22M) and approximately the same total number of elements (17700 + 480189 = 497889 for Netflix, 249012+296111= 545123 for Y!Music).

Finally, I’ve run my unoptimized algorithm using 1 core of a Core i7 980 (cpumark score 8823 for 6 cores), while they have used 1 core of a Xeon E5405  (2894 for 4 cores) to run their unoptimized RankALS and ImplicitALS and optimized RankSGD algorithms. The rescaling factor to account for the processor difference is thus exactly 2 (!)

Here are my rescaled timings and the original timings of the other methods (remember it’s not an optimized version, though):

method #5 #10 #20 #50 #100
NC-ISC 1 2 4 10 25
ImplicitALS 82 102 152 428 1404
RankALS 167 190 243 482 1181
RankSGD 21 25 30 42 65
RankSGD2 26 32 40 61 102

 Time (seconds) per iteration for various # of features.

Speedup range from x3 to x167 per iteration (for SGD you should account for the increased number of iterations required). I’ve generally observed that my optimized implementations of NC-ISC run 10 to 20 times faster than my Matlab implementation.

eXenSa / Pivot / eXenGine

I just realized that I haven’t posted a single word about the fate of our SalesAdviser product (SaaS recommendations for e-commerce). I’ve had this project to publish a video of me explaining why we failed at it, but I think I won’t have time to devote to this before long. So let’s write it here.

We’ve decided to stop the SalesAdviser product in July, 2014 because it didn’t catch up commercially speaking. Main reasons for this failure are :

  • Too much focus on the technology for a business that wasn’t so much technological : NC-ISC is a great stuff, no doubt about that, but if you have the best engine in the world, maybe you shouldn’t try to sell cars for elderly people. They don’t care about the technology, they don’t understand it, and they won’t use it to its potential.
  • No beta-test customer / no customer in the direct network. We had a few indirect connections with e-commerce people, but nobody directly involved in using/testing our service to help us drive it in the right direction.
  • A Business Model that seemed to hurt our customers feelings : one of the great ideas (we thought) for SalesAdviser was to be paid on a percentage of the sales made through the recommendations. And I think “percentage of the sales” is a repellant for the people in e-commerce

Whatever the real reasons are, we didn’t make it with this project. So we decided to pivot toward an “engine maker” business model, and we started by doing some consulting, and writing a complete data processing workflow with NCISC at the core.

It took us between mid-October and mid-February to reach the beta status for eXenGine and deploy a demo with the wikipedia analysis.

Now we have our first customers, and we are looking forward new ones who want to use a better engine than all their competitors.

New blog

Dotclear has served me well for a few years, but this last month I have been unable to log into the admin section, which is a bit of a problem…

So I’ve moved to the great WordPress, and I’m quite happy with that.

More features, more results on wikipedia

While we are working to make eXenGine more efficient, more expressive and, more specifically more sellable, I keep playing with the results of NCISC on Wikipedia. Here is an example of the neighbours of GenSim (a well known python toolkit for text mining/natural language processing by Radim Řehůřek), this time on the document/word model with 90 features :

  • OpenNLP : 0.9343325
  • Semantically-Interlinked Online Communities : 0.92996
  • GML Application Schemas : 0.9280249
  • XML Metadata Interchange : 0.926392
  • Languageware : 0.924556
  • Visual modeling : 0.92413026
  • GeoSPARQL : 0.9231725
  • Clairlib : 0.92301136
  • Heurist : 0.92240715
  • VisTrails : 0.9223076
  • Embedded RDF : 0.92183596
  • NetworkX : 0.9217739
  • UIMA : 0.9217461
  • Software Ideas Modeler : 0.92173994
  • List of Unified Modeling Language tools : 0.9215104
  • UModel : 0.92115307
  • SmartQVT : 0.9210423
  • Linked data : 0.9207388
  • Natural Language Toolkit : 0.92064124

It’s far from perfect, one major cause being that we still don’t use bigrams or skip-grams in the input transformation (i.e. we use plain old style bag of words), but it clearly shows the power of NCISC.

You can compare with the results provided by LSA using GenSim itself, in this post, a comment gives the top 10 results on a 500 features model trained on wikipedia:

  • Snippet (programming)
  • Prettyprint
  • Smalltalk
  • Plagiarism detection
  • D (programming language)
  • Falcon (programming language)
  • Intelligent code completion
  • OCaml
  • Explicit semantic analysis

When using the document/document interlink model (90 features), we obtain different, very good results:

  • Jubatus : 0.98519164
  • Scikit-learn : 0.98126435
  • Feature Selection Toolbox : 0.97841007
  • Structure mining : 0.97700834
  • ADaMSoft : 0.9755426
  • Mallet (software project) : 0.97395116
  • Shogun (toolbox) : 0.9718502
  • CRM114 (program) : 0.96751755
  • Weka (machine learning) : 0.96635485
  • Clairlib : 0.9659551
  • Document retrieval : 0.96506816
  • Oracle Data Mining : 0.96350515
  • Approximate string matching : 0.96212476
  • Bayesian spam filtering : 0.96208096
  • Dlib : 0.9620419
  • GSP Algorithm : 0.96116054
  • Discounted cumulative gain : 0.9606682
  • ELKI : 0.9604578
  • NeuroSolutions : 0.96015286
  • Waffles (machine learning) : 0.9597046
  • Information extraction : 0.9588822
  • Latent semantic mapping : 0.95838064
  • ScaLAPACK : 0.9563968
  • Learning Based Java : 0.95608145
  • Relevance feedback : 0.9559279
  • Web search query : 0.9558407
  • Grapheur : 0.9556832
  • LIBSVM : 0.95526296
  • Entity linking : 0.95325243

Much better than the 40 features version displayed on our demo site,

  • Java GUI for R : 0.995283
  • Decision table : 0.99500793
  • Programming domain : 0.9946934
  • Scikit-learn : 0.994619
  • Feature model : 0.9941954
  • Dlib : 0.9938446
  • Pattern directed invocation programming language : 0.9937961
  • Compiler correctness : 0.9937197
  • On the Cruelty of Really Teaching Computer Science : 0.99280906
  • Structure mining : 0.99274904
  • Hidden algebra : 0.99265593
  • OMDoc : 0.9925574
  • Institute for System Programming : 0.99253553
  • Radial tree : 0.9924036
  • Partial evaluation : 0.9922968
  • Jubatus : 0.99167436
  • Query optimization : 0.9916388
  • Effect system : 0.99141824

 

Fun (and relevant) results on Wikipedia for multi-documents queries

We are settings things up to improve on our prototype demo site : http://demowiki.exensa.net and we’re looking for ideas.

While playing around with our system, and wondering how to push a fun activity (like on Wikidistrict) on the website. I tried to see what would come out of a multi-document query, and the results are very nice and fun :

Debian + Religion neighbours (70 neighbours documents/words model):

  • Ten Commandments of Computer Ethics : 0.8062684
  • Internet linguistics : 0.79933107
  • Translation memory : 0.7954829
  • Free software movement : 0.7887893
  • Machine translation : 0.7826216
  • E-Sword : 0.77988267
  • Technical translation : 0.7798575
  • New English Translation : 0.7767798
  • Posting style : 0.77614117
  • Digital artifactual value : 0.77253044
  • Multimodality : 0.7708764
  • Computer ethics : 0.77002776
  • The user-subjective approach : 0.77001673
  • Virtual community : 0.76979464
  • Computer-assisted translation : 0.76953083
  • Sharing : 0.76945585
  • Unity (user interface) : 0.7686337
  • Text annotation : 0.7677352
  • Separation of presentation and content : 0.7676085
  • Internet forum : 0.7660768
  • Comment (computer programming) : 0.7638492
  • List of email subject abbreviations : 0.7634484
  • Alternative terms for free software : 0.76130253
  • Single source publishing : 0.7607195
  • Open-source religion : 0.76062995
  • Hacker (term) : 0.76004744
  • Hacker ethic : 0.75634766
  • Internet-related prefixes : 0.7556772
  • New media : 0.755563
  • The Word Bible Software : 0.75490427

We’re going to introduce this fun activity on the website in the next few weeks, together with a search engine

 

NCISC as a search engine : now it works too !

We’ve had mixed results so far when using the results of our algorithm for search-engine like applications, i.e. finding the documents relevant for a given term. The symmetric neighbourhood, for instance documents similar to a document OR words similar to a word have always shown great results, but the results in the asymmetric case were much less relevant.

We have finally found the problem, and here are some results from an analysis of the English Wikipedia (70 features, no information added) :

Wikipedia pages relevant for the word “turing” : (titles are completely ignored, as well as page popularity) :

  • Hypercomputation
  • Solomonoff’s theory of inductive inference
  • Unbounded nondeterminism
  • Super-recursive algorithm
  • Theory of computation
  • Automated theorem proving
  • List of important publications in theoretical computer science
  • Turing machine
  • Turing completeness
  • Sheila Greibach
  • Church–Turing thesis
  • Algorithmic information theory
  • Universal Turing machine
  • Wolfram’s 2-state 3-symbol Turing machine
  • Digital physics
  • Logical framework
  • Fuzzy logic
  • Denotational semantics
  • List of machine learning algorithms
  • Parallel computation thesis
  • Computational geometry
  • Algorithm
  • Logic in computer science
  • Operational semantics
  • Satisfiability Modulo Theories
  • Automated reasoning
  • Information theory
  • Denotational semantics of the Actor model
  • Oracle machine
  • Indeterminacy in concurrent computation
  • Baum–Welch algorithm
  • List of books in computational geometry
  • John V. Tucker
  • List of PSPACE-complete problems
  • Power domains
  • List of computability and complexity topics

Similarly, getting the most relevant words for a given page “Turing machine” gives pretty good results :

  • turing
  • recursive
  • arithmetic
  • boolean
  • mappings
  • recursion
  • deterministic
  • automata
  • algorithmically
  • algorithm
  • melzak
  • iterating
  • computes
  • compute
  • leiserson
  • provably
  • iterate
  • cormen
  • computations
  • subproblems
  • definable
  • embedding
  • automaton
  • subtraction
  • satisfiability
  • undecidable
  • iteratively
  • constraint
  • deterministically
  • tuples
  • tuple
  • associative
  • pseudocode
  • pseudorandom
  • reachability
  • completeness
  • recursively
  • iterative
  • unary
  • logic

 

Is our universe a simulation ?

Slashdot features an article today, about a rebound of the argument about the likeliness of living in a computer simultation. It made me react because I remember having used the very opposite argument to show that our universe being ruled by a god is less likeley than our universe not being ruled by one (in my definition of a god, this is very close to the “we live in a computer simulation” thing).

To keep it short, Bostrom assumes that at least one of these proposition is true :

  1. the human species is very likely to go extinct before reaching a “posthuman” stage;
  2. any posthuman civilization is extremely unlikely to run a significant number of simulations of their evolutionary history (or variations thereof);
  3. we are almost certainly living in a computer simulation.

He then reaches the conclusion that we are more likely to live in a subworld simulation of the evolutionary history of a species which probably emerged from a very similar environment, than we are to live in the real stuff.

I remember pondering about the probability of our world to be bootstrapped by a superpower vs. not being bootstrapped. Basically for an intelligent species to emerge from nothing, you have to wait a long time, and the probability is low : let’s say N after a given time. But then, if the universe is not restriced to a single environment, you multiply your chances by the number of environments E, produced by the universe (if our visible universe is a good image of a real universe, then this number is quite huge).

Then you have to take into account the probability that such an intelligent species can transition to a species able to run simulations of the universe with a size big enough for the simulation to be perceived as the real thing by its inhabitants. And also the probability that the energy required to run that simulation is available, and that it is interesting enough to be run in a way compatible with what we can observe.

To answer this, we should seriously think about what can be interesting enough for us to spend a large amount of computing power in running a simulation of our world : predict the future ? have infinite TV programs of primitives doing primitives things (remember that we would be in post-human state by then) ? be a placeholder for post-humans brain to play into ?

I don’t know the processing power currently spent in video games vs. universe simulation vs. small scale life simulation vs. spying on people, but my guess is that, if we’re in a post-human state, then people will crave for fun, because everything that could be understood will have been understood, and predicting the future in a post-human era would require to simulate a post-human society. So either we are a simulation for a TV network, or we are in a video game. But I don’t think a post-human civilization would like to spend energy in simulating many human worlds, unless it’s for the fun of playing gods in a society of the past.

Now, factor all that : E.N is the number of human-level intelligent societies in a big universe after a given time. Now multiply that by the probability P that we can transition to a post-human society with enough computing power and energy to run simulations. Then multiply by the number S of simulations that could be run by such a post-human to have the characteristics of a subpart of the universe without any visible intervention from almighty actors. The question is thus: is E.N.P.S > E.N, or is P.S > 1 ? I’m not so sure.

Actually my personal conclusion is : since we have not observed god-like people making fun of us, I would vote that we’re living in a non-simulated world. Because seriously, if you could play god in a simulation of the earth in 20th century, wouldn’t you ? Or would you rather just watch primitives live their boring lives ?

Talk at Paris Machine Learning Meetup

I’ll give a talk at the fantastic Meetup organized by Igor Carron and Franck Bardol ( here to the meetup ) about NCISC and the demo we are setting up on wikipedia data.

It’s tomorrow, Wednesday the 12th in Paris.

I’ll present NCISC, and some results on a few standard datasets (reuters, 20newsgroups, ohsumed) as well as a comparison with Deep-Learning inspired methods. I’ll also present the pipeline we’ve setup for analyzing Wikipedia from the text, links and category perspectives.

There’s a demo (with relatively old and buggy data) here : http://demowiki.exensa.net/

Also the talk will be live streamed with Google + Hangout and the slides will be available here

Spark + MLLib

Spark with a star

Apache Spark Logo

I’m still digging into Spark and MLLib, but for now, I can clearly say that this combination has some serious points.

  • It can easily replace PIG with the added benefit that you don’t have to mix several technologies for you UDFs
  • It’s really efficient
  • It’s made for iterative methods (which is the case for NC-ISC) and definitely more suited for ML than pure Hadoop/MRv1 (I don’t know if Yarn performs better, but I don’t really think so)
  • Scala is a great language
  • Spark is at least as scalable as Hadoop, and can use several cluster management systems : Hadoop Yarn, EC2, Mesos and Spark

My opinion is that Spark is the near-perfect framework to efficiently write efficient mapreduce jobs (see the double efficiency here ?)

Also, it’s funny to see that Twitter’s Scalding has partially the same goal as Spark (i.e. help writing concise MapReduce jobs) except that Spark adds a much more powerful execution engine that allows intermediate computations to be kept in memory when possible. Probably not something that really big datasets can afford, but at least it’s a good thing to have it as an option.