Improvement and correction over last post

I made a big mistake in my last post about our results improvements, the precision/recall curves of our experiments were in « gain » mode, that is, the curve tend toward zero as it approach the precision of a random selection. As a result, instead of finishing at about 5% precision (for a 20 classes problem with balanced classes, that’s what one should obtain). So our actual result are even slightly better than what you saw in my previous post.

More importantly, I’ve included here another variant of semantic knowledge injection, this time using bi and trigrams in addition to normal words (for instance, « mother board » is now considered as a unique vocabulary item) – the bi and trigrams are automatically selected.

The vocabulary size climbs from about 60K to 95K, with many grammatical constructs (« but if » and such), and the results are quite amazing : an improvement of precision of 4-5% on the left, and about +10% precision toward the 0.1 recall mark.

This is probably a more important result than what we’ve achieved with class knowledge injection, since class knowledge is rarely available in real world problems (in the case of product recommendation, you don’t know the class of your users, for instance).

The fact that our method, using only inner knowledge, can outperform, up to the 2% recall mark, the Deep Autoencoders using class knowledge, is probably more important than any result I could have only using class information.

NC-ISC-small-new)

Recent NC-ISC method improvement

Despite the already good results of our method, we have further improved it. The most notable improvement is due to a better handling of the numeric compatibility between the two final embeddings (for instance words and documents embedding).

Here are the results on 20newsgroups-bydate (60% training, 40% testing, test data correspond to the latest documents in time).

This was also the occasion for us to show how simply injecting  »a priori » knowledge about word semantics can significantly improve the results (this  »a priori » knowledge is computed using NC-ISC on a cooccurrence matrix, using a 20 words window on a mix of 20newsgroups and reuters RCV1 corpus).

The curves are classical precision/recall results. For comparison, I have included what I think are the best results in the state of the art (I have missed more recent results, please tell me) from Ranzatto & Szummer (ICML 2008) from their figure showing their best results of the shallow method (I have no clue as to whether their deep method can do any better, but I doubt it since they mainly argue that deep methods can perform as well as shallow method but with less features).

NC-ISC-small-old

One interesting thing is that Ranzatto and Szummer use a preprocessed 20 newsgroups corpus that reduce their vocabulary to 10000 (using Rainbow), while we use pure raw words (including uppercase and lowercase variants). It would be interesting to see wether their method could perform better if they were to use the whole unprocessed vocabulary. Also, our first experiments using frequent pairs and triplets of words as new vocabulary items has shown an unexpected global performance decrease (this concern raw TF-IDF as well as all NC-ISC variants). Comments on this are warmly welcome.