Competition in the recommendation area

Recommendation engines are nothing new. Many algorithms can be used to obtain results of variable quality, and a lot of people jump in the ship (including myself). Now that the eCommerce market is really mature, and now that anyone with a bit of technical knowledge and a lot of will can start an eShop, everybody is going to look after systems that they can plug into their shop in order to obtain better product recommendations.

Here is a list of recommendation service providers I’ve found so far. Obviously some are taken from the wikipedia page

Recommendation system : a quick overview of the problem and its relation with NC-ISC

As a followup to my researches on documents and / or vocabulary characterizing, we are preparing (some friends and myself) to launch a new recommendation service. Our first targets are eCommerce sites that can greatly benefit some improvement over the products they show to the users (whether it’s in the recommended items list or in the display order of a search query). Amazon reportedly has a 30% share on its sales that directly comes from recommended items (and their recommendations are not unanimously acclaimed). Now the question will be : why will my algorithm be better than the others ? and why should you use it.

As stated in this article, the data you collect is more important than the algorithm you use. That will always be true : with no data at all, the best algorithm will always perform worse than a clumpsy algorithm working on tons of excellent and clean data. That’s almost tautological, but sadly true, no miracle can come only from the algorithm. Now, if you do your best at collecting data, then the algorithm can make a real difference (take for example my article about the 20 newsgroups experiments : while most high performance existing algorithms can only take into account a few words, mine does much better because it can seemlessly handles all available data. Also, in my previous note about Ohsumed, I show that with exactly the same data, my algorithm NC-ISC can largely outperform other state-of-the art methods.

Now that you’re persuaded that my algorithm works for document retrieval, I must convince you that it will also work on other topics : recommendation / collaborative filtering is a quite standard task in machine learning community, and it has already gathered a lot from computational linguistics methods (see RSVD for instance). The idea is this : in both cases (information retrieval and recommendation systems) we have a few relational information between type A objects and type B objects.

In IR, type A are words, type B are documents, the relation being the fact that a word appears in a document. In recommendation systems, we have users as type A objects and products as type B. The relations can be « view », »buy », »put in basket », etc.

In both case, the idea is to fill the blanks in the matrix, or, to guess the strength of the relation between any A object and any B object, wether they have been observed before or not.

Alternative to mail, blogging, micro-blogging, multi-facets publishing, chatting, etc.

So after that yesterday post about an open protocol that could cover all of our social needs while preserving privacy, I’ve thought a lot. One thing that stroke me is how much the concept of blogging/publishing/chatting is the opposite of how the mail system is made. This brings me to some interesting ideas (or not, you tell me)

The internet mail is a very good image of the real mail system : you’ve got

  • post offices that collect departing messages near you (SMTP),
  • mail agents that carry the messages from post offices to post offices (until it reaches the post office near the receiver)
  • and finally mailboxes that stores the messages for whom you’re the receiver. It’s a model based on minimizing the distances, and that’s probably not the best model for the internet (even though it has some good

Now how would it work if emails where following the publishing way of sharing information : S wants to send a message to R, so S publish it on its server (which also serves for blogging, sharing photos with family, etc.). S is the only one on earth that can talk to its server, When the server receives the publication, it contacts R’s server, telling him that there’s a message from S for him. During the short contact, serverS sends a key to serverR that will serve to retrieve the content of the message. There is no reason that serverR would really want to retrieve the message, only when R goes online will she see that S has sent her a message, so she might decide to download it (or it may be done automatically if R knows S and the message is small enough). The advantages :

  • S can discard its message if R hasn’t read it
  • No duplication of data (even when sending a mail to many people)
  • SPAMMING is be much harder
  • Adding/removing trusted sources would be in control of the user, no need to write rules to send undesired mails to trash
  • Side effect : your social network IS your set of trusted sources,
  • Double side effect : every publishing could be advertised in the same way (to your subscribers/followers, for instance)

The drawbacks

  • How R receives the message would depend on serverS’ ping and bandwidth, so the quality of your email service would actually impact your receivers, not you… (that may be a good thing actually)
  • A lot of private/public keys and symmetric keys. I don’t see this as a real drawback either
  • You complete…

So, could we

  • Get rid of SPAM ?
  • Have an open, privacy respecting, distributed, secure social networking system ?
  • Have email service providers that would compete for the quality of service, not for keeping their customers in prison ?

Sure we could :)

Diaspora the distributed social networking system

After I wrote my last post, I immediately thought « Darn, I’ve forgotten to talk about Diaspora ». Fortunately, nobody has flamed me yet…

So here it is, Diaspora is still in its infancy, and actually is still in closed beta status, but it has already been acclaimed by many as a decent alternative to facebook. However it is hard to believe that it will really strikes – yesterday I stumbled on a post while looking for something about their protocol : they seemingly may have started the project the wrong way, writing the implementation before thinking about an interop protocol. It’s certainly not completely bad. Writing an implementation of a protocol is a mandatory step to find its weaknesses. From a company point of view, actually, this is quite a good approach. Designing a protocol is a cumbersome and expensive thing to do. If interoperability and openness is not your primary goal, then the Diaspora way is probably the fastest way. Going fast is not the best way to go far, though…

Standard protocol for Social Networking

I was recently wondering about the fact that we where at the very beginning of the social networking era. And I was thinking to myself that, like HTTP vs. hypercard, the only way to deal openly with social groups real-time sharing needs is to develop an open protocol that would allow the development of social accounts (in the very same way that you have mail accounts) that you could easily create and control. I believe we really are at the pre-standard age of the phenomenon (or should I say, the completion of the phenomenon, since emails and IRC where bits of our social life).

I’m not the only one to think that way, of course : Adrian Thurston has developped the distributed social networking protocol which may well be the right way to do it. Will he be the next Tim Berners-Lee ? Today it’s probably not as easy as in the good ol’ times. When TBL came up with the HTTP protocol, its most proeminent private competitor was HyperCard, a $50 Apple application that actually shipped with all new Macs (we were in 1987). So on the bad side, today, Mr. Thurston would have to fight against one big company (FB) that is thought to handle a significant amount of the social traffic (this is not true, though, blogs, emails, chats, phone calls, face-to-face meetings, still hold the largest part of this social traffic) On the bright side, several people are a bit pissed off by the success of FB. Google’s Orkut and Buzz are attempts to jump in the SN/micro-blogging phenomenon, and Apple’s Ping is a miserable attempt to sneak iTunes users into a social network. So if someone proposes a way to remove the « forerunner advantage » effect that currently benefits FB. So a question arises : while Google almost always chose to undermine competitors positions by pushing open standards efforts, why did it not do it this time for social networking ? It would be quite simple to allow deep interop with Social Networking sites if your had social identities held by your email addresses. Just transform mail servers into identities servers, using private/public key for publication/sharing, a real-time push/pull API, and you’re almost done ! It would probably not even destroy Facebook, as it would remain a great contender because of what it really brings to the users : an open environment for application integration. But it would also open the field for a lot of new competitors bringing interesting innovations.