Scikit-Learn, GraphLab, MLBase, Mahout

There’s a flurry of Machine Learning platforms/languages/libraries/systems that all implements almost the same algorithms. Have you tried them ? I’m wondering which one is the best to express a new algorithm quickly, efficiently and scalable.

For what I’ve seen, SciKit and MLBase sounds like the best choices from the usability point of view, GraphLab, MLBase and Mahout are great on scalability, and GraphLab is the most efficient, with SciKit and MLBase just on its trail.

Also it seems that GraphLab is not super easy to deploy.

GraphLab and (obviously) SciKit can interface with iptyhon notebook. What about MLBase ? At least Spark has Scala and Python bindings, so it should be able to connect.

SciKit-learn : http://scikit-learn.org GraphLab : http://graphlab.org/ MLBase : http://www.mlbase.org/ Mahout : http://mahout.apache.org/

Any other solutions ?

Couchbase 1.8.0 -> 1.8.1 upgrade and rebalance freeze issue

We’ve stumbled on a (relatively) important issue in the version 1.8.0 of couchbase, which is that during online upgrade to 1.8.1, the rebalance process stops/hangs/freezes . Various reasons have been proposed for this problem (for instance, the existence of empty vbuckets).

I’ve figured out that there are good old reasons why the upgrade process could fail. So after 5 unsuccessful attempts at smoothly upgrading, I’ve finally managed to find a procedure that seems to work :

(this is a procedure for online upgrade, where for each node, you successively remove/rebalance it, then stop/upgrade it, then add/rebalance it again) (this procedure is using the debian package)

  1. backup with cbbackup.
  2. remove the node and rebalance (it has never failed me at this point)
  3. stop the couchbase-server
  4. make sure that epmd is not still running (it was always running in at least 3 of my previous upgrade attempts). Otherwise, kill it !
  5. wipe the /opt/couchbase (ok, maybe this is overkill, but at least once I HAD to do it in order to continue the upgrade)
  6. dpkg -i couchbase-server...1.8.1.deb
  7. at this point you can edit /opt/couchbase/etc/couchbase_init.d and add this line “export HOME=/tmp” at the beginning, else you won’t be able to stop the server using “service couchbase-server stop
  8. add the node back to the cluster / rebalance

eXenSa : new site, video teaser, we have moved

We’ve made a lot of progress at eXenSa, and I’ve had less and less time to publish our results (also, since we’ve moved to a more sensible kind of data, I can hardly publish results about it).

On the entrepreneurship front, the news are excellent :

  • Brand new website with a video teaser (english version here)
  • We’ve moved in a new place
  • First customers are beginning to knock at the door (the product is ready but I want to make sure it’s perfect before opening the registration process to everyone, so we start a pilot program with choosen customers)

On the science front, we’ve made a lot of progress, thanks to the help of the Centre Francilien de l’Innovation and the famous Credit Impot Recherche which allowed us to do our research with a relative peace of mind. Major improvements have been made, mainly on two points :

  • the core embedding computation methods have been made more robust in some conditions
  • the knowledge injection mecanism has been vastly improved, especially from the point of view of inference

That’s all for now, I still have got a lot to do to make this a business success :)

Reuters 21578 dataset : inconsistent results

I have recently decided to take a quick look back to a problem I haven’t spent much time to solve, namely my results on the Reuters dataset. One of my problems was due to inconsistent results between my experiments and the published results by Ranzato and Szummer.

I have already talked about their paper in previous posts, and NC-ISC outperforms their method on the 20 newsgroups dataset (they haven’t published any precision/recall curve on Ohsumed). On the small Reuters dataset, though, they have very good results, and they also publish some material that I have a hard time reproducing. The figure 4 of their paper “Semi-supervised learning of compact document representations with deep networks”, especially, compares results from LSI and from their method. The problem is that my own experiments show much better results for LSI.

Their results :

Screenshot-Semi-supervised_Learning_of_Compact_Document_Representations_with_Deep_Networks

 

And my results (just LSI) :

reuters-LSI

To sum up : my results for LSI-2 starts with a precision of 0.35 while theirs starts with a precision of 0.06, LSI-10 respectively starts at 0.57 versus 0.35, and LSI-40 at 0.68 versus 0.56.

What is really surprising (even though, technically, not completely impossible) is that their results for LSI-2 and LSI-3 are below the baseline one could obtain by randomly choosing documents. My TF-IDF curve is relatively close to theirs (even though not exactly the same).

Since they have taken the data from the same source I have, I really don’t understand how we can have such different results.

Another puzzling fact is that, on the Reuters dataset, TF-IDF actually performs worse than cosine distance on the raw counts (at least for a recall below 1%). This is, I think, the only dataset with such a behaviour.

 

Improvement and correction over last post

I made a big mistake in my last post about our results improvements, the precision/recall curves of our experiments were in “gain” mode, that is, the curve tend toward zero as it approach the precision of a random selection. As a result, instead of finishing at about 5% precision (for a 20 classes problem with balanced classes, that’s what one should obtain). So our actual result are even slightly better than what you saw in my previous post.

More importantly, I’ve included here another variant of semantic knowledge injection, this time using bi and trigrams in addition to normal words (for instance, “mother board” is now considered as a unique vocabulary item) – the bi and trigrams are automatically selected.

The vocabulary size climbs from about 60K to 95K, with many grammatical constructs (“but if” and such), and the results are quite amazing : an improvement of precision of 4-5% on the left, and about +10% precision toward the 0.1 recall mark.

This is probably a more important result than what we’ve achieved with class knowledge injection, since class knowledge is rarely available in real world problems (in the case of product recommendation, you don’t know the class of your users, for instance).

The fact that our method, using only inner knowledge, can outperform, up to the 2% recall mark, the Deep Autoencoders using class knowledge, is probably more important than any result I could have only using class information.

NC-ISC-small-new)

Recent NC-ISC method improvement

Despite the already good results of our method, we have further improved it. The most notable improvement is due to a better handling of the numeric compatibility between the two final embeddings (for instance words and documents embedding).

Here are the results on 20newsgroups-bydate (60% training, 40% testing, test data correspond to the latest documents in time).

This was also the occasion for us to show how simply injecting ”a priori” knowledge about word semantics can significantly improve the results (this ”a priori” knowledge is computed using NC-ISC on a cooccurrence matrix, using a 20 words window on a mix of 20newsgroups and reuters RCV1 corpus).

The curves are classical precision/recall results. For comparison, I have included what I think are the best results in the state of the art (I have missed more recent results, please tell me) from Ranzatto & Szummer (ICML 2008) from their figure showing their best results of the shallow method (I have no clue as to whether their deep method can do any better, but I doubt it since they mainly argue that deep methods can perform as well as shallow method but with less features).

NC-ISC-small-old

One interesting thing is that Ranzatto and Szummer use a preprocessed 20 newsgroups corpus that reduce their vocabulary to 10000 (using Rainbow), while we use pure raw words (including uppercase and lowercase variants). It would be interesting to see wether their method could perform better if they were to use the whole unprocessed vocabulary. Also, our first experiments using frequent pairs and triplets of words as new vocabulary items has shown an unexpected global performance decrease (this concern raw TF-IDF as well as all NC-ISC variants). Comments on this are warmly welcome.

eXenSa is born !

I had a few busy days lately, eXenSa  is officially created since the 3rd of April. As I’ve quickly sketched before, eXenSa will propose automated product recommendations to e-commerce site. The most important innovation is the fact that, by using and learning semantics from several e-shops, we will be able to recommend stuff with few data from user actions. As a consequence our market target will include much smaller e-shops than other recommendations providers (this in addition with the excellent quality of our recommendation engine, of course).

Competition in the recommendation area

Recommendation engines are nothing new. Many algorithms can be used to obtain results of variable quality, and a lot of people jump in the ship (including myself). Now that the eCommerce market is really mature, and now that anyone with a bit of technical knowledge and a lot of will can start an eShop, everybody is going to look after systems that they can plug into their shop in order to obtain better product recommendations.

Here is a list of recommendation service providers I’ve found so far. Obviously some are taken from the wikipedia page

Recommendation system : a quick overview of the problem and its relation with NC-ISC

As a followup to my researches on documents and / or vocabulary characterizing, we are preparing (some friends and myself) to launch a new recommendation service. Our first targets are eCommerce sites that can greatly benefit some improvement over the products they show to the users (whether it’s in the recommended items list or in the display order of a search query). Amazon reportedly has a 30% share on its sales that directly comes from recommended items (and their recommendations are not unanimously acclaimed). Now the question will be : why will my algorithm be better than the others ? and why should you use it.

As stated in this article, the data you collect is more important than the algorithm you use. That will always be true : with no data at all, the best algorithm will always perform worse than a clumpsy algorithm working on tons of excellent and clean data. That’s almost tautological, but sadly true, no miracle can come only from the algorithm. Now, if you do your best at collecting data, then the algorithm can make a real difference (take for example my article about the 20 newsgroups experiments : while most high performance existing algorithms can only take into account a few words, mine does much better because it can seemlessly handles all available data. Also, in my previous note about Ohsumed, I show that with exactly the same data, my algorithm NC-ISC can largely outperform other state-of-the art methods.

Now that you’re persuaded that my algorithm works for document retrieval, I must convince you that it will also work on other topics : recommendation / collaborative filtering is a quite standard task in machine learning community, and it has already gathered a lot from computational linguistics methods (see RSVD for instance). The idea is this : in both cases (information retrieval and recommendation systems) we have a few relational information between type A objects and type B objects.

In IR, type A are words, type B are documents, the relation being the fact that a word appears in a document. In recommendation systems, we have users as type A objects and products as type B. The relations can be “view”,”buy”,”put in basket”, etc.

In both case, the idea is to fill the blanks in the matrix, or, to guess the strength of the relation between any A object and any B object, wether they have been observed before or not.

Alternative to mail, blogging, micro-blogging, multi-facets publishing, chatting, etc.

So after that yesterday post about an open protocol that could cover all of our social needs while preserving privacy, I’ve thought a lot. One thing that stroke me is how much the concept of blogging/publishing/chatting is the opposite of how the mail system is made. This brings me to some interesting ideas (or not, you tell me)

The internet mail is a very good image of the real mail system : you’ve got

  • post offices that collect departing messages near you (SMTP),
  • mail agents that carry the messages from post offices to post offices (until it reaches the post office near the receiver)
  • and finally mailboxes that stores the messages for whom you’re the receiver. It’s a model based on minimizing the distances, and that’s probably not the best model for the internet (even though it has some good

Now how would it work if emails where following the publishing way of sharing information : S wants to send a message to R, so S publish it on its server (which also serves for blogging, sharing photos with family, etc.). S is the only one on earth that can talk to its server, When the server receives the publication, it contacts R’s server, telling him that there’s a message from S for him. During the short contact, serverS sends a key to serverR that will serve to retrieve the content of the message. There is no reason that serverR would really want to retrieve the message, only when R goes online will she see that S has sent her a message, so she might decide to download it (or it may be done automatically if R knows S and the message is small enough). The advantages :

  • S can discard its message if R hasn’t read it
  • No duplication of data (even when sending a mail to many people)
  • SPAMMING is be much harder
  • Adding/removing trusted sources would be in control of the user, no need to write rules to send undesired mails to trash
  • Side effect : your social network IS your set of trusted sources,
  • Double side effect : every publishing could be advertised in the same way (to your subscribers/followers, for instance)

The drawbacks

  • How R receives the message would depend on serverS’ ping and bandwidth, so the quality of your email service would actually impact your receivers, not you… (that may be a good thing actually)
  • A lot of private/public keys and symmetric keys. I don’t see this as a real drawback either
  • You complete…

So, could we

  • Get rid of SPAM ?
  • Have an open, privacy respecting, distributed, secure social networking system ?
  • Have email service providers that would compete for the quality of service, not for keeping their customers in prison ?

Sure we could :)