Mostly linguistically computational | Adventure in collaborative filtering, information retrieval, matrix factorization and other stuff

Paris Datageeks talk : video available

Posted on 2014/05/12 Leave a Comment

I gave a talk at the Paris Datageeks meetup a month ago. Here is the video of my presentation of Apache Spark.

Amazing results of NCISC on Netflix

Posted on 2014/04/23 Leave a Comment

Last week, I’ve played a little bit with the Netflix prize dataset and NC-ISC. The results are amazingly better than the current state of the art.

Short version

Netflix dataset contains ratings of movies by user. We try to predict unseen good ratings (a few ratings are kept aside in a separate dataset, and we try to see if we can predict that a given user would have rated those particular movies).

I’ve compared state-of-the-art results with my method, NC-ISC. With the best settings, I can predict 24% of the unseen ratings in the first 50 results, compared to 19% for the current best methods. Considering the fact that a non personalized prediction, based on popularity only, gives a 10% score, what we achieved is a 55% improvement over the current state of the art.

This is a huge gain.

And it could be applied to other recommendation problems : music, jobs, social network, products in e-commerce and social shopping, etc.

A bit of context

The Netflix Prize was an open competition held in 2009. The dataset consist of 100M ratings (1-5) given by 400K+ user on 17K+ movies. The original goal was to predict “unseen” ratings with the highest possible accuracy. The measure used was the RMSE (root mean square error) on the hidden ratings.

The RMSE, however, is a very incomplete evaluation measure (and also it cannot be estimated on the results of NC-ISC, which is one of the reason we haven’t spent much time to compare our results with the state of the art). Indeed, the common task for a recommender system is not to predict the rating you would give a given movie, but to provide you a list of K movies that you could potentially like.

As a result, more and more effort has been put on the so-called top-K recommender task, and as a side effect on learning to rank instead of learning to rate.

I’ve stumbled recently on an interesting paper : Alternating Least Squares for Personalized Ranking by Gabor Takacs and Domonkos Tikk, from RecSys 2012 (which has been covered here), and found myself happy to find a paper with results I could compare to (often, either the metric, or the dataset is at odds with what I can produce/access).

The experiment

So I ran NC-ISC on the same dataset they used in the paper (just keep the top ratings), which reduce the train dataset to 22M ratings and the probe to 400K ratings. At first the results were very disappointing. The main reason for this is that NC-ISC inherently discards any bias on popularity. And the fact is, ranking Movies is a very popularity-biased problem (below we can see that 15000 movies have very few ratings in the test set, while a few movies have 12000 ratings) :

However, while NC-ISC cannot perform rating prediction natively, and discards popularity in both its learning and its ranking, it’s quite easy to put popularity back into the learning, and back into the ranking. Putting it back in the ranking simply means to multiply the score produced by NC-ISC by each movie popularity. Putting popularity back into the learning means decreasing the unbiasing factor in NC-ISC.

The results

And here we come (I’ve added my results as an overlay to the original paper figure). Just a disclaimer : while I’ve reproduced results from the popularity baseline, I didn’t have the time to check the implicit ALS results (I’ve got an implementation here in Octave that take hours to run), so I’m not 100% confident I haven’t missed something in the reproduction of the experiment.

I’ve added a mark for 100 features, which wasn’t in the original paper, probably since there results tended to stop improving.

Performance

A short note about NCISC performance versus ImplicitALS, RankALS and RankSGD. First, while SGD takes on average 30 iterations to converge, ALS and NCISC takes between 5 and 10 iterations.

Regarding iteration time, while Takacs and Tikk give their timings for Y!Music data, I could only work top Netflix ratings dataset. However, the two datasets have approximately the same number of ratings (22M) and approximately the same total number of elements (17700 + 480189 = 497889 for Netflix, 249012+296111= 545123 for Y!Music).

Finally, I’ve run my unoptimized algorithm using 1 core of a Core i7 980 (cpumark score 8823 for 6 cores), while they have used 1 core of a Xeon E5405 (2894 for 4 cores) to run their unoptimized RankALS and ImplicitALS and optimized RankSGD algorithms. The rescaling factor to account for the processor difference is thus exactly 2 (!)

Here are my rescaled timings and the original timings of the other methods (remember it’s not an optimized version, though):

method	#5	#10	#20	#50	#100
NC-ISC	1	2	4	10	25
ImplicitALS	82	102	152	428	1404
RankALS	167	190	243	482	1181
RankSGD	21	25	30	42	65
RankSGD2	26	32	40	61	102

Time (seconds) per iteration for various # of features.

Speedup range from x3 to x167 per iteration (for SGD you should account for the increased number of iterations required). I’ve generally observed that my optimized implementations of NC-ISC run 10 to 20 times faster than my Matlab implementation.

eXenSa / Pivot / eXenGine

Posted on 2014/04/23 Leave a Comment

I just realized that I haven’t posted a single word about the fate of our SalesAdviser product (SaaS recommendations for e-commerce). I’ve had this project to publish a video of me explaining why we failed at it, but I think I won’t have time to devote to this before long. So let’s write it here.

We’ve decided to stop the SalesAdviser product in July, 2014 because it didn’t catch up commercially speaking. Main reasons for this failure are :

Too much focus on the technology for a business that wasn’t so much technological : NC-ISC is a great stuff, no doubt about that, but if you have the best engine in the world, maybe you shouldn’t try to sell cars for elderly people. They don’t care about the technology, they don’t understand it, and they won’t use it to its potential.
No beta-test customer / no customer in the direct network. We had a few indirect connections with e-commerce people, but nobody directly involved in using/testing our service to help us drive it in the right direction.
A Business Model that seemed to hurt our customers feelings : one of the great ideas (we thought) for SalesAdviser was to be paid on a percentage of the sales made through the recommendations. And I think “percentage of the sales” is a repellant for the people in e-commerce

Whatever the real reasons are, we didn’t make it with this project. So we decided to pivot toward an “engine maker” business model, and we started by doing some consulting, and writing a complete data processing workflow with NCISC at the core.

It took us between mid-October and mid-February to reach the beta status for eXenGine and deploy a demo with the wikipedia analysis.

Now we have our first customers, and we are looking forward new ones who want to use a better engine than all their competitors.

New blog

Posted on 2014/04/23 Leave a Comment

Dotclear has served me well for a few years, but this last month I have been unable to log into the admin section, which is a bit of a problem…

So I’ve moved to the great WordPress, and I’m quite happy with that.

More features, more results on wikipedia

Posted on 2014/03/24 Leave a Comment

While we are working to make eXenGine more efficient, more expressive and, more specifically more sellable, I keep playing with the results of NCISC on Wikipedia. Here is an example of the neighbours of GenSim (a well known python toolkit for text mining/natural language processing by Radim Řehůřek), this time on the document/word model with 90 features :

OpenNLP : 0.9343325
Semantically-Interlinked Online Communities : 0.92996
GML Application Schemas : 0.9280249
XML Metadata Interchange : 0.926392
Languageware : 0.924556
Visual modeling : 0.92413026
GeoSPARQL : 0.9231725
Clairlib : 0.92301136
Heurist : 0.92240715
VisTrails : 0.9223076
Embedded RDF : 0.92183596
NetworkX : 0.9217739
UIMA : 0.9217461
Software Ideas Modeler : 0.92173994
List of Unified Modeling Language tools : 0.9215104
UModel : 0.92115307
SmartQVT : 0.9210423
Linked data : 0.9207388
Natural Language Toolkit : 0.92064124

It’s far from perfect, one major cause being that we still don’t use bigrams or skip-grams in the input transformation (i.e. we use plain old style bag of words), but it clearly shows the power of NCISC.

You can compare with the results provided by LSA using GenSim itself, in this post, a comment gives the top 10 results on a 500 features model trained on wikipedia:

Snippet (programming)
Prettyprint
Smalltalk
Plagiarism detection
D (programming language)
Falcon (programming language)
Intelligent code completion
OCaml
Explicit semantic analysis

When using the document/document interlink model (90 features), we obtain different, very good results:

Jubatus : 0.98519164
Scikit-learn : 0.98126435
Feature Selection Toolbox : 0.97841007
Structure mining : 0.97700834
ADaMSoft : 0.9755426
Mallet (software project) : 0.97395116
Shogun (toolbox) : 0.9718502
CRM114 (program) : 0.96751755
Weka (machine learning) : 0.96635485
Clairlib : 0.9659551
Document retrieval : 0.96506816
Oracle Data Mining : 0.96350515
Approximate string matching : 0.96212476
Bayesian spam filtering : 0.96208096
Dlib : 0.9620419
GSP Algorithm : 0.96116054
Discounted cumulative gain : 0.9606682
ELKI : 0.9604578
NeuroSolutions : 0.96015286
Waffles (machine learning) : 0.9597046
Information extraction : 0.9588822
Latent semantic mapping : 0.95838064
ScaLAPACK : 0.9563968
Learning Based Java : 0.95608145
Relevance feedback : 0.9559279
Web search query : 0.9558407
Grapheur : 0.9556832
LIBSVM : 0.95526296
Entity linking : 0.95325243

Much better than the 40 features version displayed on our demo site,

Java GUI for R : 0.995283
Decision table : 0.99500793
Programming domain : 0.9946934
Scikit-learn : 0.994619
Feature model : 0.9941954
Dlib : 0.9938446
Pattern directed invocation programming language : 0.9937961
Compiler correctness : 0.9937197
On the Cruelty of Really Teaching Computer Science : 0.99280906
Structure mining : 0.99274904
Hidden algebra : 0.99265593
OMDoc : 0.9925574
Institute for System Programming : 0.99253553
Radial tree : 0.9924036
Partial evaluation : 0.9922968
Jubatus : 0.99167436
Query optimization : 0.9916388
Effect system : 0.99141824

Fun (and relevant) results on Wikipedia for multi-documents queries

Posted on 2014/03/14 Leave a Comment

We are settings things up to improve on our prototype demo site : http://demowiki.exensa.net and we’re looking for ideas.

While playing around with our system, and wondering how to push a fun activity (like on Wikidistrict) on the website. I tried to see what would come out of a multi-document query, and the results are very nice and fun :

Debian + Religion neighbours (70 neighbours documents/words model):

Ten Commandments of Computer Ethics : 0.8062684
Internet linguistics : 0.79933107
Translation memory : 0.7954829
Free software movement : 0.7887893
Machine translation : 0.7826216
E-Sword : 0.77988267
Technical translation : 0.7798575
New English Translation : 0.7767798
Posting style : 0.77614117
Digital artifactual value : 0.77253044
Multimodality : 0.7708764
Computer ethics : 0.77002776
The user-subjective approach : 0.77001673
Virtual community : 0.76979464
Computer-assisted translation : 0.76953083
Sharing : 0.76945585
Unity (user interface) : 0.7686337
Text annotation : 0.7677352
Separation of presentation and content : 0.7676085
Internet forum : 0.7660768
Comment (computer programming) : 0.7638492
List of email subject abbreviations : 0.7634484
Alternative terms for free software : 0.76130253
Single source publishing : 0.7607195
Open-source religion : 0.76062995
Hacker (term) : 0.76004744
Hacker ethic : 0.75634766
Internet-related prefixes : 0.7556772
New media : 0.755563
The Word Bible Software : 0.75490427

We’re going to introduce this fun activity on the website in the next few weeks, together with a search engine

NCISC as a search engine : now it works too !

Posted on 2014/03/05 Leave a Comment

We’ve had mixed results so far when using the results of our algorithm for search-engine like applications, i.e. finding the documents relevant for a given term. The symmetric neighbourhood, for instance documents similar to a document OR words similar to a word have always shown great results, but the results in the asymmetric case were much less relevant.

We have finally found the problem, and here are some results from an analysis of the English Wikipedia (70 features, no information added) :

Wikipedia pages relevant for the word “turing” : (titles are completely ignored, as well as page popularity) :

Hypercomputation
Solomonoff’s theory of inductive inference
Unbounded nondeterminism
Super-recursive algorithm
Theory of computation
Automated theorem proving
List of important publications in theoretical computer science
Turing machine
Turing completeness
Sheila Greibach
Church–Turing thesis
Algorithmic information theory
Universal Turing machine
Wolfram’s 2-state 3-symbol Turing machine
Digital physics
Logical framework
Fuzzy logic
Denotational semantics
List of machine learning algorithms
Parallel computation thesis
Computational geometry
Algorithm
Logic in computer science
Operational semantics
Satisfiability Modulo Theories
Automated reasoning
Information theory
Denotational semantics of the Actor model
Oracle machine
Indeterminacy in concurrent computation
Baum–Welch algorithm
List of books in computational geometry
John V. Tucker
List of PSPACE-complete problems
Power domains
List of computability and complexity topics

Similarly, getting the most relevant words for a given page “Turing machine” gives pretty good results :

turing
recursive
arithmetic
boolean
mappings
recursion
deterministic
automata
algorithmically
algorithm
melzak
iterating
computes
compute
leiserson
provably
iterate
cormen
computations
subproblems
definable
embedding
automaton
subtraction
satisfiability
undecidable
iteratively
constraint
deterministically
tuples
tuple
associative
pseudocode
pseudorandom
reachability
completeness
recursively
iterative
unary
logic

Is our universe a simulation ?

Posted on 2014/02/17 Leave a Comment

Slashdot features an article today, about a rebound of the argument about the likeliness of living in a computer simultation. It made me react because I remember having used the very opposite argument to show that our universe being ruled by a god is less likeley than our universe not being ruled by one (in my definition of a god, this is very close to the “we live in a computer simulation” thing).

To keep it short, Bostrom assumes that at least one of these proposition is true :

the human species is very likely to go extinct before reaching a “posthuman” stage;
any posthuman civilization is extremely unlikely to run a significant number of simulations of their evolutionary history (or variations thereof);
we are almost certainly living in a computer simulation.

He then reaches the conclusion that we are more likely to live in a subworld simulation of the evolutionary history of a species which probably emerged from a very similar environment, than we are to live in the real stuff.

I remember pondering about the probability of our world to be bootstrapped by a superpower vs. not being bootstrapped. Basically for an intelligent species to emerge from nothing, you have to wait a long time, and the probability is low : let’s say N after a given time. But then, if the universe is not restriced to a single environment, you multiply your chances by the number of environments E, produced by the universe (if our visible universe is a good image of a real universe, then this number is quite huge).

Then you have to take into account the probability that such an intelligent species can transition to a species able to run simulations of the universe with a size big enough for the simulation to be perceived as the real thing by its inhabitants. And also the probability that the energy required to run that simulation is available, and that it is interesting enough to be run in a way compatible with what we can observe.

To answer this, we should seriously think about what can be interesting enough for us to spend a large amount of computing power in running a simulation of our world : predict the future ? have infinite TV programs of primitives doing primitives things (remember that we would be in post-human state by then) ? be a placeholder for post-humans brain to play into ?

I don’t know the processing power currently spent in video games vs. universe simulation vs. small scale life simulation vs. spying on people, but my guess is that, if we’re in a post-human state, then people will crave for fun, because everything that could be understood will have been understood, and predicting the future in a post-human era would require to simulate a post-human society. So either we are a simulation for a TV network, or we are in a video game. But I don’t think a post-human civilization would like to spend energy in simulating many human worlds, unless it’s for the fun of playing gods in a society of the past.

Now, factor all that : E.N is the number of human-level intelligent societies in a big universe after a given time. Now multiply that by the probability P that we can transition to a post-human society with enough computing power and energy to run simulations. Then multiply by the number S of simulations that could be run by such a post-human to have the characteristics of a subpart of the universe without any visible intervention from almighty actors. The question is thus: is E.N.P.S > E.N, or is P.S > 1 ? I’m not so sure.

Actually my personal conclusion is : since we have not observed god-like people making fun of us, I would vote that we’re living in a non-simulated world. Because seriously, if you could play god in a simulation of the earth in 20th century, wouldn’t you ? Or would you rather just watch primitives live their boring lives ?

Talk at Paris Machine Learning Meetup

Posted on 2014/02/11 Leave a Comment

I’ll give a talk at the fantastic Meetup organized by Igor Carron and Franck Bardol ( here to the meetup ) about NCISC and the demo we are setting up on wikipedia data.

It’s tomorrow, Wednesday the 12th in Paris.

I’ll present NCISC, and some results on a few standard datasets (reuters, 20newsgroups, ohsumed) as well as a comparison with Deep-Learning inspired methods. I’ll also present the pipeline we’ve setup for analyzing Wikipedia from the text, links and category perspectives.

There’s a demo (with relatively old and buggy data) here : http://demowiki.exensa.net/

Also the talk will be live streamed with Google + Hangout and the slides will be available here

Spark + MLLib

Posted on 2013/10/14

Apache Spark Logo

I’m still digging into Spark and MLLib, but for now, I can clearly say that this combination has some serious points.

It can easily replace PIG with the added benefit that you don’t have to mix several technologies for you UDFs
It’s really efficient
It’s made for iterative methods (which is the case for NC-ISC) and definitely more suited for ML than pure Hadoop/MRv1 (I don’t know if Yarn performs better, but I don’t really think so)
Scala is a great language
Spark is at least as scalable as Hadoop, and can use several cluster management systems : Hadoop Yarn, EC2, Mesos and Spark

My opinion is that Spark is the near-perfect framework to efficiently write efficient mapreduce jobs (see the double efficiency here ?)

Also, it’s funny to see that Twitter’s Scalding has partially the same goal as Spark (i.e. help writing concise MapReduce jobs) except that Spark adds a much more powerful execution engine that allows intermediate computations to be kept in memory when possible. Probably not something that really big datasets can afford, but at least it’s a good thing to have it as an option.