From Earth to Heaven

Since my first attempt to compute “similarity paths” between Wikipedia pages on wikinsights.org, we’ve made some progress.

But first, a word about this “similarity path”, because a lot of people completely miss the point about it :

Usually, people use the Wikipedia graph to find path between pages. Unfortunately, Wikipedia is so well linked, that most path are just made of two or three hops. You can, for instance, play with wikidistrict.com to find the shortest path between two articles : from Earth to Heaven with wikidistrict is a short path : Earth -> Deity -> Heaven. Of course there are other paths (a huge number of paths actually).

Using the latent models built by NCISC, we can, however, recreate a graph as we want, you just have to decide

  • the model (semantic model, in-link model, out-link model, in-out-link model, out-in-link model)
  • the depth of the neighbours for each node

With this, we are now able to find our way in a much finer grained rewritten version of Wikipedia. For instance, Earth to Heaven gives :

Earth -> Moon -> Mercury (planet) -> Jupiter -> Sun -> Aurora -> Atmospheric diffraction -> Atmospheric optics -> Earth’s shadow -> Night sky -> Phases of Venus -> Counter-Earth -> History of the Center of the Universe -> Pythagorean astronomical system -> Astrological age -> History of astrology -> Zoroaster -> Asha -> Amesha Spenta -> Creator deity -> Heaven

The interesting thing here is the moment where we switch from atronomie to astrology. It happens with “Counter-Earth“, an hypothesis about another planet orbiting the sun in counter-phase with the earth.

What I show here is not in wikinsights yet, you’ll have to wait a bit before it comes live, but don’t worry it won’t be too long. We have a few novelties to add to our Wikinsights demo, with a set of improved models.

Some other paths:

  • Harry Potter to Jacques Chirac

Harry Potter -> Harry Potter and the Deathly Hallows -> Cerebus the Aardvark -> Anarky (comic book) -> Publication history of Anarky -> Anarky -> Operation Mindfuck -> Political strategy -> Political privacy -> Yellow Ribbon campaign (Fiji) -> Reaction to the 2005–06 Fijian political crisis -> Sitiveni Rabuka -> 1987 Fijian coups d’état -> Presidential Council (Benin) -> Hubert Maga -> Haitian general election, 2006 -> Jacques Chirac

  • Charleston, South Carolina -> Greece

Charleston, South Carolina -> History of Charleston, South Carolina -> History of the Southern United States -> History of Georgia (U.S. state) -> History of the United States -> United Kingdom–United States relations -> Decolonization -> International relations of the Great Powers (1814–1919) -> History of Europe -> Ottoman Greece -> Background of the Greek War of Independence -> Megali Idea -> Draft:Island of Cyprus -> Cyprus -> Greece

 

 

Amazing results of NCISC on Netflix

Last week, I’ve played a little bit with the Netflix prize dataset and NC-ISC. The results are amazingly better than the current state of the art.

Short version

Netflix dataset contains ratings of movies by user. We try to predict unseen good ratings (a few ratings are kept aside in a separate dataset, and we try to see if we can predict that a given user would have rated those particular movies).

I’ve compared state-of-the-art results with my method, NC-ISC. With the best settings, I can predict 24% of the unseen ratings in the first 50 results, compared to 19% for the current best methods. Considering the fact that a non personalized prediction, based on popularity only, gives a 10% score, what we achieved is a 55% improvement over the current state of the art.

This is a huge gain.

And it could be applied to other recommendation problems : music, jobs, social network, products in e-commerce and social shopping, etc.

A bit of context

The Netflix Prize was an open competition held in 2009. The dataset consist of 100M ratings (1-5) given by 400K+ user on 17K+ movies. The original goal was to predict “unseen” ratings with the highest possible accuracy. The measure used was the RMSE (root mean square error) on the hidden ratings.

netflix-recommendations

The RMSE, however, is a very incomplete evaluation measure  (and also it cannot be estimated on the results of NC-ISC, which is one of the reason we haven’t spent much time to compare our results with the state of the art). Indeed, the common task for a recommender system is not to predict the rating you would give a given movie, but to provide you a list of K movies that you could potentially like.

As a result, more and more effort has been put on the so-called top-K recommender task, and as a side effect on learning to rank instead of learning to rate.

I’ve stumbled recently on an interesting paper : Alternating Least Squares for Personalized Ranking by Gabor Takacs and Domonkos Tikk, from RecSys 2012 (which has been covered here), and found myself happy to find a paper with results I could compare to (often, either the metric, or the dataset is at odds with what I can produce/access).

The experiment

So I ran NC-ISC on the same dataset they used in the paper (just keep the top ratings), which reduce the train dataset to 22M ratings and the probe to 400K ratings. At first the results were very disappointing. The main reason for this is that NC-ISC inherently discards any bias on popularity. And the fact is, ranking Movies is a very popularity-biased problem (below we can see that 15000 movies have very few ratings in the test set, while a few movies have 12000 ratings) :

Distribution Ratings Netflix

However, while NC-ISC cannot perform rating prediction natively, and discards popularity in both its learning and its ranking, it’s quite easy to put popularity back into the learning, and back into the ranking. Putting it back in the ranking simply means to multiply the score produced by NC-ISC by each movie popularity. Putting popularity back into the learning means decreasing the unbiasing factor in NC-ISC.

The results

And here we come (I’ve added my results as an overlay to the original paper figure). Just a disclaimer : while I’ve reproduced results from the popularity baseline, I didn’t have the time to check the implicit ALS results (I’ve got an implementation here in Octave that take hours to run), so I’m not 100% confident I haven’t missed something in the reproduction of the experiment.

NetFlix-Recall@50

I’ve added a mark for 100 features, which wasn’t in the original paper, probably since there results tended to stop improving.

Performance

A short note about NCISC performance versus ImplicitALS, RankALS and RankSGD. First, while SGD takes on average 30 iterations to converge, ALS and NCISC takes between 5 and 10 iterations.

Regarding iteration time, while Takacs and Tikk give their timings for Y!Music data, I could only work top Netflix ratings dataset. However, the two datasets have approximately the same number of ratings (22M) and approximately the same total number of elements (17700 + 480189 = 497889 for Netflix, 249012+296111= 545123 for Y!Music).

Finally, I’ve run my unoptimized algorithm using 1 core of a Core i7 980 (cpumark score 8823 for 6 cores), while they have used 1 core of a Xeon E5405  (2894 for 4 cores) to run their unoptimized RankALS and ImplicitALS and optimized RankSGD algorithms. The rescaling factor to account for the processor difference is thus exactly 2 (!)

Here are my rescaled timings and the original timings of the other methods (remember it’s not an optimized version, though):

method #5 #10 #20 #50 #100
NC-ISC 1 2 4 10 25
ImplicitALS 82 102 152 428 1404
RankALS 167 190 243 482 1181
RankSGD 21 25 30 42 65
RankSGD2 26 32 40 61 102

 Time (seconds) per iteration for various # of features.

Speedup range from x3 to x167 per iteration (for SGD you should account for the increased number of iterations required). I’ve generally observed that my optimized implementations of NC-ISC run 10 to 20 times faster than my Matlab implementation.

eXenSa / Pivot / eXenGine

I just realized that I haven’t posted a single word about the fate of our SalesAdviser product (SaaS recommendations for e-commerce). I’ve had this project to publish a video of me explaining why we failed at it, but I think I won’t have time to devote to this before long. So let’s write it here.

We’ve decided to stop the SalesAdviser product in July, 2014 because it didn’t catch up commercially speaking. Main reasons for this failure are :

  • Too much focus on the technology for a business that wasn’t so much technological : NC-ISC is a great stuff, no doubt about that, but if you have the best engine in the world, maybe you shouldn’t try to sell cars for elderly people. They don’t care about the technology, they don’t understand it, and they won’t use it to its potential.
  • No beta-test customer / no customer in the direct network. We had a few indirect connections with e-commerce people, but nobody directly involved in using/testing our service to help us drive it in the right direction.
  • A Business Model that seemed to hurt our customers feelings : one of the great ideas (we thought) for SalesAdviser was to be paid on a percentage of the sales made through the recommendations. And I think “percentage of the sales” is a repellant for the people in e-commerce

Whatever the real reasons are, we didn’t make it with this project. So we decided to pivot toward an “engine maker” business model, and we started by doing some consulting, and writing a complete data processing workflow with NCISC at the core.

It took us between mid-October and mid-February to reach the beta status for eXenGine and deploy a demo with the wikipedia analysis.

Now we have our first customers, and we are looking forward new ones who want to use a better engine than all their competitors.

More features, more results on wikipedia

While we are working to make eXenGine more efficient, more expressive and, more specifically more sellable, I keep playing with the results of NCISC on Wikipedia. Here is an example of the neighbours of GenSim (a well known python toolkit for text mining/natural language processing by Radim Řehůřek), this time on the document/word model with 90 features :

  • OpenNLP : 0.9343325
  • Semantically-Interlinked Online Communities : 0.92996
  • GML Application Schemas : 0.9280249
  • XML Metadata Interchange : 0.926392
  • Languageware : 0.924556
  • Visual modeling : 0.92413026
  • GeoSPARQL : 0.9231725
  • Clairlib : 0.92301136
  • Heurist : 0.92240715
  • VisTrails : 0.9223076
  • Embedded RDF : 0.92183596
  • NetworkX : 0.9217739
  • UIMA : 0.9217461
  • Software Ideas Modeler : 0.92173994
  • List of Unified Modeling Language tools : 0.9215104
  • UModel : 0.92115307
  • SmartQVT : 0.9210423
  • Linked data : 0.9207388
  • Natural Language Toolkit : 0.92064124

It’s far from perfect, one major cause being that we still don’t use bigrams or skip-grams in the input transformation (i.e. we use plain old style bag of words), but it clearly shows the power of NCISC.

You can compare with the results provided by LSA using GenSim itself, in this post, a comment gives the top 10 results on a 500 features model trained on wikipedia:

  • Snippet (programming)
  • Prettyprint
  • Smalltalk
  • Plagiarism detection
  • D (programming language)
  • Falcon (programming language)
  • Intelligent code completion
  • OCaml
  • Explicit semantic analysis

When using the document/document interlink model (90 features), we obtain different, very good results:

  • Jubatus : 0.98519164
  • Scikit-learn : 0.98126435
  • Feature Selection Toolbox : 0.97841007
  • Structure mining : 0.97700834
  • ADaMSoft : 0.9755426
  • Mallet (software project) : 0.97395116
  • Shogun (toolbox) : 0.9718502
  • CRM114 (program) : 0.96751755
  • Weka (machine learning) : 0.96635485
  • Clairlib : 0.9659551
  • Document retrieval : 0.96506816
  • Oracle Data Mining : 0.96350515
  • Approximate string matching : 0.96212476
  • Bayesian spam filtering : 0.96208096
  • Dlib : 0.9620419
  • GSP Algorithm : 0.96116054
  • Discounted cumulative gain : 0.9606682
  • ELKI : 0.9604578
  • NeuroSolutions : 0.96015286
  • Waffles (machine learning) : 0.9597046
  • Information extraction : 0.9588822
  • Latent semantic mapping : 0.95838064
  • ScaLAPACK : 0.9563968
  • Learning Based Java : 0.95608145
  • Relevance feedback : 0.9559279
  • Web search query : 0.9558407
  • Grapheur : 0.9556832
  • LIBSVM : 0.95526296
  • Entity linking : 0.95325243

Much better than the 40 features version displayed on our demo site,

  • Java GUI for R : 0.995283
  • Decision table : 0.99500793
  • Programming domain : 0.9946934
  • Scikit-learn : 0.994619
  • Feature model : 0.9941954
  • Dlib : 0.9938446
  • Pattern directed invocation programming language : 0.9937961
  • Compiler correctness : 0.9937197
  • On the Cruelty of Really Teaching Computer Science : 0.99280906
  • Structure mining : 0.99274904
  • Hidden algebra : 0.99265593
  • OMDoc : 0.9925574
  • Institute for System Programming : 0.99253553
  • Radial tree : 0.9924036
  • Partial evaluation : 0.9922968
  • Jubatus : 0.99167436
  • Query optimization : 0.9916388
  • Effect system : 0.99141824

 

Improvement and correction over last post

I made a big mistake in my last post about our results improvements, the precision/recall curves of our experiments were in “gain” mode, that is, the curve tend toward zero as it approach the precision of a random selection. As a result, instead of finishing at about 5% precision (for a 20 classes problem with balanced classes, that’s what one should obtain). So our actual result are even slightly better than what you saw in my previous post.

More importantly, I’ve included here another variant of semantic knowledge injection, this time using bi and trigrams in addition to normal words (for instance, “mother board” is now considered as a unique vocabulary item) – the bi and trigrams are automatically selected.

The vocabulary size climbs from about 60K to 95K, with many grammatical constructs (“but if” and such), and the results are quite amazing : an improvement of precision of 4-5% on the left, and about +10% precision toward the 0.1 recall mark.

This is probably a more important result than what we’ve achieved with class knowledge injection, since class knowledge is rarely available in real world problems (in the case of product recommendation, you don’t know the class of your users, for instance).

The fact that our method, using only inner knowledge, can outperform, up to the 2% recall mark, the Deep Autoencoders using class knowledge, is probably more important than any result I could have only using class information.

NC-ISC-small-new)

Performance of NC-ISC on Ohsumed

Ohsumed is a well known dataset consisting of 34389 documents classified among 23 classes (each document can belong to several classes). The documents use a vocabulary of 30689 words. Dataset split is a random 80/20%. I’ve used the dataset provided at the Alessandro Moschitti’s corpora webpage and more exactly the preprocessed data available on the Rate Adapting Poisson model site. Please note that I have not managed to reproduce the results (for LSI and TF-IDF) from The Rate Adapting Poisson (RAP) model for Information Retrieval and Object Recognition – Peter V. Gehler, Alex D. Holub and Max Welling, ICML 2006. A short email discussion with P. Gehler did not help much to understand the source of the difference. Anyway, the most interesting stuff reside in the difference between the raw data (TF-IDF) and the processed data.

Here are my results :

ohsumed

If you look at the results of the RAP paper, you’ll see that precision of TF-IDF tops at 0.84, and goes down to 0.66, while mine tops at 0.88 and goes down to 0.59. My methodology for counting an “answer” good or false is this : good if at least one class of the candidate is also a class of the query document, bad otherwise. It is either one or zero, no in-between. Comments about this are really welcome.

My other neighbors

Ok, I’ve got a lot of things to show you, but unfortunately, not muchtime to prepare the data. Here is another comparison between LSA and my method (NC-ISC), for the neighbors of the word “200”. I know it’s not really a word, but at least it is easier to sort out the good neighbors from the bad ones with this kind of word, right ?
So here is the list, on the left the neighbors found by my method, on the right those found by LSI (or more precisely my faster implementation of LSI). I’ve highlighted the undisputable (I’ve been pretty liberal though) numbers/quantities in green, and the undisputable unrelated words in red. Related but unsure words where left unhighlighted. You can dispute me on any word however if you think I was wrong, my judgment is far worst than my system’s.

For LSI, the first completely unrelated word appears at 165, versus 427 for NC-ISC. In the last 100 neighbors, 11 are good candidates for LSI (some of them are really borderline,you judge) , versus 32 for NC-ISC.

These neighbors were obtained with the exact same method than in my previous post. No internal knowledge of the words composition was used (that would be cheating for such a problem).

By the way, thanks to my method, I’ve learned a new word : umpteen. It seems that LSI didn’t find it as a potential neighbor for “200”.

NC-ISC LSI (ISC)
1 200 200
2 500 300
3 300 500
4 250 400
5 600 250
6 400 600
7 100 100
8 150 150
9 700 800
10 350 700
11 800 1000
12 120 350
13 1000 120
14 900 5000
15 60 900
16 450 450
17 140 1500
18 80 0
19 5000 750
20 70 140
21 50 4000
22 240 3000
23 850 650
24 750 160
25 40 850
26 550 7000
27 0 550
28 160 170
29 hundred 2500
30 170 50
31 180 180
32 thousand 240
33 130 1300
34 4000 sq
35 1500 80
36 230 thousand
37 fifty 130
38 30 1200
39 3000 hundred
40 720 60
41 7000 1400
42 1200 8000
43 125 3500
44 280 70
45 8000 million
46 2500 hundreds
47 560 125
48 90 40
49 75 560
50 35 110
51 220 2200
52 110 260
53 hundreds 90
54 forty 1600
55 fifteen 2400
56 680 280
57 65 thousands
58 seventy 680
59 270 230
60 ten 75
61 million millions
62 sixty 220
63 tens 10000
64 twenty 210
65 20 175
66 thirty cubic
67 50 460
68 420 480
69 225 2300
70 1300 billions
71 480 100 000
72 260 30
73 eighty 370
74 3500 950
75 960 360
76 100 000 275
77 360 380
78 570 1700
79 seven 1150
80 370 270
81 175 billion
82 330 225
83 275 330
84 eight 490
85 twelve fifty
86 45 50
87 five 320
88 460 820
89 210 960
90 10000 580
91 nine 390
92 1100 530
93 190 430
94 six 1100
95 sixteen 20
96 430 520
97 thirteen 135
98 950 620
99 650 6000
100 fourteen 190
101 ninety 770
102 millions trillion
103 375 310
104 440 860
105 15 440
106 660
107 1400 pmol
108 eleven 2000
109 48 9000
110 640 11m
111 224 tens
112 thousands 340
113 Fifteen 760
114 114 ½
115 10 µmol
116 dozens 730
117 880 per
118 Six
119 36 928
120 Thirty 625
121 620 980
122 530
123 156 315
124 930
125 770 880
126 Sixteen 25
127 Ten 265
128 980 610
129 72 ng
130 Five 640
131 2400 475
132 820 dozens
133 135 10
134 185 65
135 eighteen 670
136 172 690
137 340 375
138 seventeen 920
139 Fourteen 1800
140 Fifty forty
141 64 cu
142 540 40m
143 290 UNKNOWN
144 25 ¼
145 four 10m
146 470 2100
147 490 710
148 42 45
149 320 720
150 Seventeen
151 35
152 132 95
153 Thirteen Million
154 38 420
155 47 500m
156 66 Hundred
157 Four 325
158 115 twenty
159 55 675
160 660 1250
161 890
162 Three 425
163 63 60m
164 690 145
165 144 Tsu
166 Sixty 630
167 780 thirty
168 Eleven 185
169 Eighteen 250m
170 108 395
171 Eight pg
172 165
173 3 000 3 000
174 24 195
175 dozen 1m
176 Two 156
177 625 seventy
178 105 470
179 323 830
180 580 236
181 255 410
182 85 IU
183 18 sixty
184 134 kilo
185 Twenty 990
186 325 15m
187 155 ten
188 syllabic
189 three 478
190 106 µg
191 sq 165
192 186 570
193 422 212
194 265 numbers
195 825 970
196 154 725
197 112 510
198 cubic frac13
199 12
200 Twelve 290
201 37 pesetas
202 145 Macoumba
203 33 Fundacio
204 14 Alessandria
205 Nine 840
206 34 429
207 84 fifteen
208 58 5m
209 168 GREENPEACE
210 Forty 875
211 46 24
212 102 100000
213 121 Ludovit
214 740 FRINGE
215 630 bakings
216 Fewer 6m
217 13 MILLION
218 billions 132
219 trillion 647
220 57 yuan
221 380 Pilmé
222 39 Getachew
223 16 Deployment
224 410 780
225 22 114
226 32 mg
227 327 25m
228 174 eighty
229 umpteen francs
230 710 Neally
231 365 FEARFUL
232 28 burble
233 333 115
234 billion Varsha
235 920 Scotches
236 432 McClanahan
237 670 reducibility
238 195 Familiale
239 1250 Accounts…
240 273 combo
241 number 450g
242 41 roack
243 118 objectifications
244 couple kukri
245 208 355
246 Seven Habara
247 131 foody
248 2200 Aeroengines
249 77 Hatu
250 62 Elicitation
251 128 Arigoni
252 17 silicite
253 990 Rodent
254 760 16 March 2005
255 610 174
256 78 164
257 171 schoolchilden
258 52 neuropsychologically
259 369 Mrs Hall
260 349 scorch
261 two Worpington
262 754 Electroacupuncture
263 ½ Checkers
264 Several Kanezo
265 297 Feakle
266 520 Curtlingside
267 196 623
268 54 343
269 123 15
270 136 PURPOSEFUL
271 88 Kjellstrand
272 100000 ILPEC
273 545 17m
274 few mmol
275 453 540
276 116 DRS
277 334 dollars
278 157 3bn
279 326 heah
280 68 bodyplans
281 488 Aggressive
282 940 ‘legislative
283 146 Judicature
284 266 Delight…
285 204 CFRS
286 830 benserazide
287 73 2m
288 970 reconaissance
289 95 MASSINGBERD
290 fewer Azinger
291 56 lizzies
292 930 Hooraydom
293 ¾ bovril
294 395 549
295 860 Theognidea
296 890 Monstrosities
297 67 Democratically
298 Ninety Per
299 295 MPhys
300 cu l919
301 Eighty 12VDC
302 53 actio
303 256 SHE…
304 390 PUERTO
305 381 ‘Restructuring
306 104 WILLNER
307 51 unrebellious
308 74 Ciriaco
309 Hundred steaming…
310 178 DEPC
311 425 CUTE
312 324 niobium
313 several Eicken
314 465 DEPLOYMENT
315 268 Poulsen
316 810 NED
317 875 farme
318 549 Fanelli
319 122 contemporaneo
320 478 litres
321 614 Sutker
322 2300 reconfirmation
323 310 DDSI
324 385 105
325 43 Fewer
326 124 Kulturmuhle
327 192 compo
328 69 Aberration
329 19 353
330 82 Trivulziana
331 26 Oultram
332 227 Hurdles
333 475 NIPPV
334 164 Kalmu
335 many fighting for
336 317 Djer
337 575 slctd
338 138 pbk
339 21 HCo
340 27 GNCTU
341 61 copyrights
342 336 pp69
343 259 overrepresentation
344 numbers ingrafted
345 76 ¾
346 per Quand
347 44 nonhegemon
348 512 Ings
349 732 Caesar…
350 142 WeeKly
351 86 vexes
352 1600 swingy
353 126 Novgorodsev
354 999 heteromorphy
355 169 Bouffant
356 handful Wyre
357 113 Twentysomething
358 2 Mondi
359 445 Fastnets
360 49 Skadar
361 485 seacliffs
362 590 QualityTeam
363 79 Maystadt
364 184 FOZZ
365 525 forminant
366 139 ‘Trust
367 107 productivity…
368 11 Emptor
369 Many Bergstrand
370 31 organum
371 103 Fortunes
372 456 21sst
373 372 614
374 148 newletter
375 29 happiness…
376 pesetas 430905
377 198 Elongated
378 ¼ Charolais…
379 153 Carolana
380 167 WCM
381 243 poireau
382 frac13 HRBMOA
383 nineteen arabicum
384 443 synnoetics
385 840 heatedly
386 215 Davros
387 235 Bundesnachrichtendienst
388 345 300m
389 355 wicca
390 87 Schiemann
391 267 Assante
392 1150 Panitch
393 152 imparters
394 Seventy Hueber
395 133 Staart
396 147 Ponce
397 grams BUTLIN
398 257 45753
399 Few urophthalmus
400 231 LIMAVADY
401 188 Crina
402 Some Diprotodon
403 461 Clune
404 474 Caradog
405 870 WONKY
406 287 posessing
407 288 horseshoes
408 quarter Thornett
409 137 decolonisation
410 tenths 506bn
411 All 131
412 81 RAIS
413 399 Parozzo
414 247 Camesi
415 thirds ASLEF
416 161 Stockmayer
417 countless opera…
418 1m ALton
419 462 MONSTROUS
420 127 archduke
421 9000 40m²
422 0 Munding
423 11m Troup
424 59 hatinhensis
425 313 Brunmayr
426 101 Shiban
427 evacuate Masheke
428 557 discographer
429 366 1340s
430 358 unsold
431 477 Marleigh
432 283 IRONIC
433 143 depletes
434 109 warthog
435 99 Phillies
436 francs 142g
437 294 Wodemed
438 149 Quix
439 244 l906
440 458 quadrathon
441 SEVEN Eddings
442 6m discorporate
443 441 Mr Yáñez
444 285 mesenteric
445 98 Lumley…
446 83 businesswise
447 23 RSPP
448 counting Defreyn
449 5 37318
450 among Patches
451 94 the new
452 472 MultiWorks
453 111 institutionalization
454 8 cretaceous
455 675 Dosedel
456 183 Bornfreund
457 209 Azeri
458 548 ‘bridge’
459 amongst Scoope
460 119 sacrementality
461 majority Kibble
462 307 megawatts
463 141 740
464 5m 940
465 pmol neurobiology
466 205 Killermont
467 60m Jehu
468 75p vookodlaks
469 129 Codmans
470 159 climsy
471 unarmed JS
472 228 877
473 412 Versus
474 FOUR Medlicot
475 202 MARGATE
476 218 Informalism
477 318 EUR 307
478 162 Esbjerg
479 189 vis
480 666 Yetminster
481 795 Periglacial
482 181 criminal…
483 500m Tamaris
484 269 activi
485 kilo Ache
486 261 zig
487 Most HP15
488 9 Contravention
489 8m 7368
490 495 pigdog
491 315 Munchkin
492 467 HyperStream
493 633 327
494 316 586
495 468 WES
496 301 sailskiffs
497 underpaid Daaad
498 455 55mm
499 429 Gudent
500 277 GenBank
501 262 CONTRIBUTES
502 241 Upadhyay
503 216 Safranski
504 389 CAPITALS
505 499 Carbofuran
506 359 apostasy
507 335 198741
508 222 ‘undertake
509 308 432
510 plus Nescafe
511 remaining Jaen
512 cents Dacorum
513 298 COMPOTE
514 Those Stegs
515 Nineteen Petunia
516 litres gergasites
517 585 Poetry…
518 647 Discussants
519 309 795
520 92 towndwellers
521 877 safetly
522 2100 Bodenham
523 217 p64
524 centimetres ligate
525 211 IR81
526 442 Lasts
527 FIVE egoistical
528 89 Callosum
529 245 Willkommen
530 other Sesto
531 254 Frederikstraat
532 312 Satcoms
533 Other Khubthi
534 435 Engwirda
535 pence Wierdsma
536 353 KEF
537 364 deathshead
538 recruit managerialism
539 203 Lamarckenis
540 24 23220
541 registered Wilcocks
542 unsold proabortion
543 538 bussed
544 236 pregnancies…
545 6000 Partan
546 446 viewcam
547 487 TANNETT
548 282 PREZIOSE
549 925 M67
550 212 hymophylus
551 fewest Claro
552 253 1 000 amendments
553 innumerable palmettes
554 436 enchantments…
555 605 aqualungs
556 96 Litchet
557 45p Hobor
558 296 822937
559 1800 GUMS
560 623 DISCO
561 3m Clobbered
562 displaced gamey
563 THREE Stimson
564 individual Molua
565 6 LIDBA
566 626 Swingland
567 278 superspy
568 7 forwardness
569 15p pence
570 consecutive 999
571 449 Taraki
572 2m sovkhoz
573 237 despairingly…
574 187 grams
575 367 Wallacean
576 guineas sellability
577 272 Lymphomatous
578 176 impersonate
579 7m vrroom
580 sending Nekemte
581 extra Egyptologists
582 173 punishingly
583 pg metaborate
584 219 Arnolds
585 proportion Won
586 1700 Perikli
587 331 overmorphinisation
588 928 Buitoni
589 plight metric
590 229 Urchin
591 numerous Daiichi
592 705 Benumbed
593 Among Maravilla
594 Individual hakweed
595 574 Aloisia
596 158 Thoc
597 exodus Schönborn
598 91 Borringdon
599 Whadcoat riot…
600 half premonitionary
601 shillings Llanfairpwll
602 393 Schantz
603 213 Sagna
604 536 brittling
605 454 RLL
606 those UNFPA
607 3bn brare
608 221 Anghiari
609 531 1666mph
610 TWO RNAs
611 CFA Modulyo
612 Numerous Irreverent
613 510 ABORTIVE
614 179 countless
615 milligrams Turbocharging
616 influx steins
617 343 NEUDESK
618 massacred 5
619 421 Rearmament
620 civilian faoil
621 millionth 16H
622 337 Spinnakered
623 repatriation Joyrider
624 382 636
625 7p Rps40
626 995 peintre
627 166 dramatherapy
628 all Strandhill
629 151 Bahriye
630 246 52085
631 MgCl weightiness
632 metric UNRG
633 60p Orace
634 hiring Currency
635 163 Narses
636 555 Melanocytes
637 730 autorepression
638 older 1470
639 pounds 295
640 slaughtering scatterbrained
641 319 naturalisation
642 selected Incise
643 117 chromomere
644 outnumbered SEDISI
645 562 Morua
646 419 maximands
647 311 Abbs
648 414 Clanwilliam
649 assorted Chickens’
650 Generally Ayra
651 DM1 propitiated
652 304 Esdec
653 457 Dabo
654 354 shamateur
655 4 Panathanaikos
656 attract chirp
657 shilling Keetmanshoep
658 715 Huttwil
659 405 delevopment
660 258 semidarkness
661 426 piffle
662 sell Dourif
663 kilowatt Nemerov
664 314 Mr Mejdahl
665 289 proprietorships
666 276 Jonglei
667 586 Characterized
668 scores 1550
669 MILLION Sisterson
670 glut Rowlinson
671 recruiting Kennaugh
672 274 Radically
673 417 enameller
674 14m Emmons
675 assembled Worksong
676 cent UNDCP
677 206 Coproduction
678 ordinary Bradypus
679 disabled lolly
680 248 Glase’s
681 outnumber Cinncinatti
682 388 Bayerischer
683 451 Slovaks
684 ferrying thoroughfares
685 517 Lothiah
686 µmol concious
687 699 Beltelecom
688 these Discordiale
689 293 Attanasio
690 x antisepsis
691 398 Luttman
692 surviving legitimising
693 339 36650
694 kg ‘standards’
695 shortlisted Secrétaire
696 97 Forgandenny
697 4m Ailsi
698 marshalling throughways
699 775 Pilce
700 444 kagoul
701 These Parasite
702 322 Blucher
703 250m 5kyr
704 send 538
705 kilograms Sylvio
706 EIGHT Pics
707 601 Albuhera
708 252 Kaufmann’s
709 199 ductile
710 71 Compañia
711 40m Bancario
712 376 Vigils
713 Older Chromex
714 338 Casely
715 424 RENNIE
716 shortage inclusivism
717 µg ESAG
718 418 salbutamol
719 starving price’
720 freeing polychromism
721 735 Angeleno
722 264 nematomorphs
723 344 Mr Santoro
724 232 Elandre
725 shelters Belau
726 448 Sagep
727 Interested ILLEGAL
728 employing ‘way
729 391 Woodeson
730 mistreatment biquaternions
731 0d 71000
732 533 7m
733 25p Spotless
734 347 laserballs
735 279 anteater
736 286 Senneh
737 expelling competitive – the
738 5 840555
739 413 Sartilly
740 fellow Lochassynt
741 lots 9094
742 involving Rande
743 Thousand LU1
744 8bn enuretics
745 617 delighteth
746 2p crick
747 725 colourations
748 263 499
749 xx viceless
750 attracted Gyrene
751 368 crutchless
752 paid Papple
753 15m Issachar
754 3 Ginevra
755 ranking Favart
756 smaller Enough…
757 protect Beechey
758 foreign 365
759 Kosovar SPP
760 299 Packed
761 Ugandan gesetzlicher
762 weighs cp16
763 409 Klinck
764 529 Burnout
765 40s 130pc
766 194 Seahorse
767 197 folastre
768 1450 floristics
769 576 tempeh
770 30m Shucard
771 hire Hicolour
772 separate woodfiller
773 disaffected TODD
774 selling Hellesdon
775 Younger ADPRT
776 allowing Tickel
777 attracting ripon
778 insuring Raarsa
779 25m Voorhies
780 various Duosome
781 accommodate DOONEY
782 685 zonkered
783 193 Dol
784 201 4PR
785 fifths languages’
786 equipping garda
787 massacre ALUM
788 2 RODDY
789 3 McNair
790 284 exception – and
791 bogus arcmin
792 supply 215
793 242 simpatico
794 forcing Luckhaus
795 378 400m
796 migrant indicción
797 371 Beschantnikh
798 374 190E
799 dollars Spen
800 thousandth saxed
801 TEN Kabilia
802 561 Viccei
803 Both georgoi
804 489 Eurom
805 Previous Sach
806 2000 resussitated
807 300m Atrypa
808 carrying Weighbridge
809 624 Glenann
810 devalued Disengagement
811 some MexTech
812 9m Ghandour
813 342 dehumanises
814 seamen Stetchworth
815 4d Laxolt
816 505 autoflight
817 75m 35m
818 successive Qila
819 Per Previati
820 636 class – the
821 volunteer Kutubu
822 allow frölich
823 enable CASCADE
824 637 Retainer
825 768 aid…
826 mostly 14282
827 239 ravage
828 403 Quadraphony
829 Experienced Emiliani
830 generation adpressed
831 pou Oxygene
832 363 Outlaws
833 µ greyscales
834 km scheme’s
835 counterfeit francaise
836 249 coreceptors
837 placing take’
838 508 Rhie
839 SIX Pentathlon
840 urging unlatch
841 35m oozy
842 50p Bilancetta
843 nought RS
844 handicapped square
845 luring freedwomen
846 ng Entertain
847 397 Chairmanships
848 buying rupees
849 226 Teachta
850 303 multinucleate
851 433 720
852 fours TEXTB
853 177 INR1
854 including bladdered
855 educate Supercard
856 fide spirtu
857 628 aîné
858 FEW subdeacon
859 refit iubebunt
860 30p 2567
861 needy Wordmarc
862 preponderance Radomiro
863 Chinese Mynours
864 Sending Ferrovie
865 between DG Enterprise
866 rupees 8931977
867 younger STUN
868 escorting JURA
869 expulsion Arrigo
870 476 Thinkabout
871 overseas Sobers
872 enables Petrakis
873 eligible beruhrt
874 slaughter Directive 89
875 Smaller Astakakalij
876 911 Ahs
877 sold patronize
878 widows mulattas
879 approximately Go1
880 stealing ‘stealth
881 348 Vanbutchel
882 DM CONCUSSION
883 10m 75 million
884 498 20m
885 list Narrations
886 includes Forehead
887 defenceless Ayissi
888 incoming Uhlig
889 452 Hospices
890 281 disipal
891 93 Weenie
892 alternate Hibiya
893 0 clothworkers
894 ohms 5×3
895 buys rearfoot
896 748 PNS
897 quarters chloridel
898 fives unaspiring
899 352 r7A
900 unemployed King Gyanendra
901 harassing Honiton
902 stateless meeker
903 millimetres instantiates
904 local encamping
905 both tryst
906 battleground Phromnia
907 missing Jacopi
908 retrain 60
909 mega spiffingly
910 GBP Pleated
911 Sinhalese Girod
912 milliseconds fireproof
913 416 Stoelting
914 NINE Matsuura
915 minority ignoring…
916 351 disparage
917 taxpayers’ Konalyck
918 imprisoning fretwire…
919 428 39071
920 394 Thurlby
921 range Manjil
922 available Huskey
923 532 5Mbps
924 whereby Kosovo – and
925 destitute Béatrix
926 271 Alave
927 excluding unled
928 untrained NgendabanyikwaPublic
929 411 7F
930 swap Schneidau
931 423 Haemorrhoids
932 indigenous Chenal
933 employ Truffe
934 sample anticlericalism
935 832
936 generations 37936
937 BEd km
938 series rights – physical
939 fined DISORDERS
940 Wolds Antapologia
941 214 Urania
942 transporting Polymethylmethacrylate
943 comprising Häider
944 bringing reimbursements
945 10p Manzù
946 292 donfirm
947 urged migrating
948 putting tosea
949 Currency Kettner
950 British Intaglio
951 524 Saez
952 homeless Ladyland
953 unoccupied Handcrafts
954 Certain C40103
955 intimidate Byne
956 404 archways
957 Potential Moorby
958 Asian kindergartener
959 400m Behaved
960 wealthier premelting
961 troublesome JULIO
962 493 brownings
963 attracts Hooded
964 377 yomp
965 24m Sherleys
966 sells esn
967 charging Juvenus
968 329 GARRY
969 sized Refolding
970 luckier Hrachya
971 4p Extensification
972 lists Tramezaïgues
973 Hooded Sahlin
974 615 enslave
975 innocent COLOURCODE
976 inundated shearography
977 recalcitrant Parabroteas
978 scattering Melocomic
979 penny SDR87
980 collection Kuttner
981 631 erinacea
982 additional fewer
983 402 NIMCO
984 hauling Minister Zenawi
985 visiting biggish
986 priced 3000lb
987 closing 4m
988 790 512
989 group mistreatment
990 young 1680
991 costing Nobber
992 387 Wishbone
993 567 reformlet
994 listing DNASTAR
995 bribed Balcombe
996 638 saccharine
997 471 Monteviot
998 sledges aulacogens
999 enabling 259p
1000 units superpoint
1001 686 manganiferous
1002 lire bovis
1003 Croats Pak
1004 buy productson
1005 intercepting oxidised
1006 Leading Koki
1007 sixes 9mJy
1008 729 Ovocný
1009 Separate Affane
1010 multiple abdicates
1011 17m Risi
1012 trained MARTON
1013 687 Knappertsbusch
1014 collected Concord
1015 707 carbocations
1016 Sahrawi mastics
1017 Larger Curit
1018 504 alleluias
1019 release Swankers
1020 pound DAYLIGHTS
1021 deter Coupons
1022 patrolling potsh
1023 persecuting Nedbank
1024 191 Bimbim

 

My (1024) neighbours are nicer than yours :)

In this post I show you a few (sort of) neighbours of “yellow”. I compare the neighbourhood of the word vectors obtained from LSI (or more precisely my fast variant of LSI) and using my method “NC-ISC”. The original cooccurrence matrix has been obtained from a merge of Europarl corpus and the BNC. No preprocessing have been applied to the corpus, not even lowercasing.

The vocabulary size is about 520000, the matrix has been transformed with positive pointwise mutual information.

The cooccurrence relation is simply “the next word”. Other relations could be used (including classical word windows), but I have chosen to show you this relation because it seems to be the most basic possible.

To sum up, the 520000*520000 cooccurrence matrix containing about 25M entries has been transformed into a 192 features-words dictionary.

Here are the (long, sorry) lists obtained

  • on the left, using my method (NC-ISC)
  • on the right, using the equivalent of a simple LSI (i.e. a thin SVD)

Things are getting a bit weird with LSI from the 100th row, approximately. On the other hands, NC-ISC keeps listing colors, aspects (visual then physical), for quite a long time. I’m not a native English speaker, so if you want to list the “acceptable” neighbours from the bad ones, you’re welcome !

NC-ISC ISC (LSI)
1 yellow yellow
2 red red
3 pink pink
4 blue blue
5 green purple
6 purple green
7 brown orange
8 white white
9 crimson brown
10 scarlet scarlet
11 coloured crimson
12 grey coloured
13 black grey
14 orange black
15 mauve golden
16 turquoise rainbow
17 shiny violet
18 golden saffron
19 lilac mauve
20 thick maroon
21 russet Pink
22 waxy bright
23 dyed pale
24 yellowish multicoloured
25 stained buff
26 greyish reddish
27 tawny peach
28 beige auburn
29 fluffy russet
30 magenta rosy
31 violet whitish
32 burgundy candy
33 striped emerald
34 faded waxy
35 maroon gold
36 amber fluorescent
37 emerald pearly
38 pale brownish
39 mottled gaudy
40 indigo dark
41 brownish cream
42 silky lilac
43 reddish ochre
44 peach amber
45 pearly darker
46 translucent Coloured
47 bright neon
48 trimmed silver
49 greenish Red
50 pinkish turquoise
51 silvery dun
52 wore tawny
53 patterned pearl
54 buff creamy
55 auburn flickering
56 scuffed striped
57 saffron incandescent
58 ochre pastel
59 rainbow Purple
60 shining spotted
61 satin magenta
62 gaudy apricot
63 dun gray
64 silken hyacinth
65 gold pinkish
66 flecked beige
67 bluish rufous
68 tattered faded
69 velvet sprouting
70 shaggy trailing
71 pleated grease
72 thin khaki
73 wearing White
74 whitish sooty
75 khaki lavender
76 dark peacock
77 beaded fawn
78 candy filigree
79 wears agarose
80 greasy sepia
81 downy ruby
82 woven Yellow
83 tulip colour
84 purplish burgundy
85 glistening sticky
86 silver Modotti
87 luminous deglamorised
88 soft Alesis
89 multicoloured starry
90 floppy luminous
91 veined flaming
92 iridescent chestnut
93 loose Silk
94 burnished fading
95 creamy ruddy
96 pastel silvery
97 embroidered ginger
98 hyacinth yellowish
99 crumpled zebra
100 lacy silk
101 edged eye
102 wear pencil
103 worn Blue
104 peacock Solenodon
105 silk Huntershill
106 thinning 3450
107 matching tinsel
108 lustrous iridescent
109 billowing Black
110 Pale silken
111 shades navy
112 tousled Vingoleuse
113 flowered Treve
114 frayed Airway
115 gloss plum
116 plaited reggaeish
117 lavender ignorers
118 gleaming ×30
119 navy lotus
120 tinted taffeta
121 oily pleated
122 fawn iris
123 rosy patterned
124 tinsel Smoltine
125 sported Informatics
126 necked Glengowan
127 Thick NUMEROSO
128 streaked Hedgecock
129 tan Chiropodist
130 ruby rust
131 speckled colouring
132 wavy indelible
133 brimmed TPS
134 waxed Scadplus
135 livid PORC
136 stripe AVERAGES
137 feathery indigo
138 fondant variegated
139 baggy spots
140 waterproof Nikši
141 spiky Funshineland
142 glossy Fencot
143 dusky lime
144 frothy tired…
145 bloodstained Hopgood
146 curly Claredon
147 swathed musk
148 trailing Bright
149 bushy beaded
150 taffeta purplish
151 brocade immunostaining
152 glowing Chalwyn
153 spattered 6U
154 pearl lemon
155 wispy youcallems
156 rufous Uists
157 fleshy slipcase
158 gray 350bhp
159 tinged toffee
160 chunky brighter
161 darker rayon
162 grubby glowing
163 smudged matt
164 toning Whiggism
165 frilly Savas
166 garish Bogakovsky
167 donned Pale
168 creased singlehandedly
169 quilted SHU
170 dripping cigarettes…
171 bleached Sherland
172 cream Shantytown
173 mulberry Ribman
174 wet embroidered
175 fluorescent Najaar
176 leathery fez
177 matt DELIMITING
178 sticky peaches
179 whiter halogen
180 ragged Leaves
181 shapeless Springvale
182 chestnut FFr37
183 cotton 37344
184 coarse Sponge
185 jewelled LCpl
186 dainty 95b
187 woolly shaped
188 shaped Saffron
189 enamelled ites
190 encrusted INTELLIGENTSIA
191 lace Southleaze
192 cropped peneplanation
193 leather kickers
194 soaked Sautoy
195 velvety Hardyman
196 ribbed Batcave
197 greying brayes
198 rumpled olive
199 plaid Milou
200 oatmeal FLICKERED
201 curling Dorada
202 sleeveless Jegsy
203 straw Grenside
204 frilled Conversative
205 collared Paraty
206 delicate Gartnavel
207 starched 12
208 oiled WENBAN
209 feather hobbyhorse
210 torn comest
211 damp lapis
212 smeared flame
213 flowery wheelback
214 gorgeous trunkal
215 buttoned Galatsaray
216 crepe BLIND
217 crinkled fiery
218 sleeved bluish
219 inky Nayakkar
220 knitted Yidjaru
221 spiked videotelephony
222 chequered Skerneside
223 sooty Woodall
224 paler RIOTING
225 tartan Focker
226 plain Assaf
227 enveloping Progressional
228 folds Chapmans
229 beading Brochwel
230 dappled satin
231 woollen uncleanly
232 clean Lemonhead
233 Wearing BOWN
234 matted paler
235 soiled oily
236 fringed Grey
237 underparts Vaches
238 tatty stridently
239 towelling Alternation
240 crisp Æthelingadene
241 reds whiter
242 ivory Jechan
243 denim harakiri
244 streaks Dominelli
245 blouses whol
246 pinks theistic
247 suede Channelling
248 milky Turania
249 necklaces OESD
250 heavy Gorbal
251 floral Manningtree
252 olive Huskison
253 beneath Ambrumenil
254 coats Sciskalsi
255 nylon methylguanine
256 shimmering governmentally
257 protective WOODARD
258 powdered patchiness
259 voluminous llah
260 hooded foliage
261 linen Yant
262 tweed CLIT
263 furry 9Sinclair
264 apricot Flint…
265 yellowing doubleheader
266 unbuttoned 4252
267 chipped quartz
268 thicker mulberry
269 feathered Lingerie
270 painted Cessnock
271 rubber 24X
272 opaque blond
273 soled showjumps
274 stiff colors
275 tufted Chencellor
276 flimsy downy
277 sodden Krankoor
278 ruddy HUH
279 belted Boehmer
280 wool modest…
281 bulbous Learner
282 tiny KATAHDIN
283 blond Caterall
284 ruffled Thoroughbred
285 chiffon Jangpura
286 threadbare cathexes
287 gauze Areca
288 polyester currant
289 dense dappled
290 muslin blackcurrant
291 drape POPULARITY
292 patchwork MAXCompress
293 padded Biz
294 scented dabbed
295 oversized unislamic
296 shirts insensiblement
297 ripped 40dB
298 Coloured uncultivatable
299 lightweight tonner
300 banded Tianawa
301 wax Photoionization
302 thickest hatchways
303 damask erth
304 scorched pumpkin
305 balaclava shining
306 dresses ovule
307 flashing COMMENTARIES
308 parchment Chegwin
309 gloves scented
310 wrinkled Weidhaas
311 corduroy Thornycroft
312 flecks Gulbot
313 scaly tunny
314 braided MAVRA
315 Thin 2844186
316 camouflage regularised
317 festooned Phonoctonus
318 length Mr Dmitri Rogozin
319 dusty Leskov
320 fur RELEVANCY
321 jersey RDP
322 wrapped 172ff
323 tangled interclonal
324 crochet horsehair
325 sleek FRAK
326 cardigan brushy
327 socks Pole…
328 straggly fundraisers
329 swirling 2872
330 jade rainbows
331 flaming wisps
332 rust laryngitis
333 patched Hyett
334 sepia BYRD
335 trousers xerophilous
336 sapphire Voxes
337 mourning Anrep
338 coated Hromada
339 fuzzy Heterosexism
340 dirty Atharva
341 pinned Dewell
342 flared BF397
343 fluted 42m
344 mandarin feather
345 gowns Shakeshaft
346 pubic DUNROSSIL
347 washable carless
348 peeling lacy
349 washed HCF
350 smooth Brenbachhof
351 breasted Axioms
352 tight Sarasvati
353 icing Macilwain
354 eggshell Arent
355 primrose TSh22
356 peaked Plus’
357 outsize Lackham
358 piece bloom
359 draped quantititative
360 cashmere Luczak
361 textured
362 plucked Leucadia
363 moss finmarchicus
364 stripes Campanella
365 glinting Spp
366 patches Rustenburg
367 wiped 601222
368 tights 34s
369 skirt tenetur
370 resplendent Munchmallows
371 coral Antenna
372 polished outcoming
373 tablecloths Gaeumannomyces
374 spotted Culross
375 wrinkle Thanquo
376 fine panopticon
377 breeches Multitude
378 stockings Porthmoina
379 oval OLLIER
380 greaseproof Adamek
381 pyjama weasel
382 calico SPECTROMETER
383 hues Shielding
384 weave dunnos
385 blotched Luddesdowne
386 oblong Colvern
387 drooping ‘State
388 plastic jewelled
389 jeans Alphatronic
390 colour 198os
391 longish ×11
392 rayon waives
393 slanting R150
394 lacquered FRENCHAY
395 bridal Shepperdson
396 fake Mr Victorino
397 fitting Midas
398 discoloured mGal
399 light garish
400 puffy wax
401 ebony ripstop
402 clinging INITIAL
403 Soft homotransplants
404 warm ‘enlargement
405 latex Fashionable
406 flannel connotated
407 flickering BARLASTON
408 crested unphysiologic
409 gilt recalcitrance
410 glittering overwhelmimg
411 slim Assimularum
412 tufts shaggy
413 neat procedurely
414 dampened plasmic
415 holly Huddersfield…
416 skimpy 357
417 napkins discoloured
418 covered traversers
419 azure Debriefed
420 quilts Chakri
421 battered Yeager
422 leaf Tanichthys
423 sheepskin shine…
424 foliage maintance
425 hessian kwoon
426 sparkling Dated
427 metallic Mallei
428 marzipan Grassmarket
429 wisps aparthied
430 knit pusilla
431 springy commmunity
432 tailored 30287
433 coat Pinguli
434 pitted Kouyate
435 blazer Atwater
436 ashen soul…
437 frosted reinstalls
438 hugging Esser
439 fading STUTTARD
440 bald RICO
441 underneath 15billion
442 cloudy TINGAY
443 rustling putrefaction
444 leggings mandrills
445 perfumed ameasure
446 Bright picking
447 neon overpriecd
448 filament aftershocks
449 caked 33min
450 straggling Rusu
451 cut reinterprets
452 shabby cetane
453 pair backslaps
454 bead 36403
455 cloth 35845
456 vests MDA
457 knotted holdful
458 smoothing Benzel
459 quartered noronhianum
460 cardigans Création
461 zipped airbrushing
462 mop 48
463 laced cherry
464 sequins Taglialatela
465 bulging McKENZIE
466 pile Franjiyah
467 pendants thicker
468 pullover Playle
469 sweaters Meanwood
470 ginger asymptotically
471 flowing Sephton
472 redness Perring
473 jerseys Gawn
474 twinkling imploded
475 globular consultat
476 yellows bossiest
477 bulky reinvesting
478 brushed Quarryman
479 weathered 9BY
480 patchy brioche
481 shirt Biceps
482 unsightly Athelstone
483 Matching milky
484 colours Labden
485 stuffed HAPPENING
486 rimmed 35006
487 bonnets Descents
488 shreds carburettors
489 drab ANNULLED
490 variegated stabilizers…
491 folded SRPS
492 grimy Shatalin
493 glint Frolunda
494 unwashed Precipitation
495 discarded Dyli
496 smoky becos
497 moistened Watson’s
498 pearls reigneth
499 supple cruisy
500 misty unprejudiced
501 laundered monologuing
502 bristling Arlingham
503 charred SAFARI
504 canvas Mulder report
505 bronze Leavism
506 plump trala
507 wallpaper Souchong
508 viscose pinniped
509 large Hubner
510 dabbed GERBER
511 waistband Eginhard
512 Leather repetivity
513 moist ple
514 tunic DosFax
515 pallid Nettos
516 anorak Cardmember
517 lighter 300µ
518 marbled polyester
519 heeled Wisk
520 beads Meyssac
521 sprouting Menial
522 earrings zest
523 beautiful korm
524 diamond Crumbleholme
525 fleece countryside…
526 soak hooded
527 blazing ILE
528 blonde gooks
529 Silk Burrells
530 unkempt clotted
531 sheets Vick
532 burnt Padric
533 furs inefficiency’
534 bikini TH
535 acrylic Gontcharova
536 polish fleecewear
537 hats Rougement
538 canary McMahons
539 dishevelled COWBOY
540 dove candle
541 glassy Osley
542 soaking gluconeogenic
543 daisies Abernant
544 tarred p603
545 luxuriant Musicale
546 Colours Creamstick
547 sprayed Sepolcro
548 proofed Isireli
549 greased Gomolzig
550 ribbons Tantric
551 ripping pitchshifter
552 ruff A698
553 petticoat Sellick
554 studded Kucera
555 slender forestalling
556 stray artound
557 dry blossom
558 borrowed Virginiamycin
559 lime dideoxynucleoside
560 dull Ablington
561 powdery Laniel
562 jackets introverts
563 fingering barne
564 impregnated wryed
565 charcoal Tablecloth
566 pencil shall…
567 lemon Luciers
568 fabrics sallow
569 lumpy XZ
570 tubers recommit
571 handmade Brueghelesque
572 fragrant NIA
573 tarnished colas
574 ringed Angouleme
575 christening Moston
576 plum heifers…
577 spiny Benthamism
578 rippling TURBOCHARGER
579 crushed tailplane
580 Tiny pristina
581 conical Matterson
582 spotless txt
583 scraps prizewinning
584 Nylon Phelps…
585 chintz SUSPECT
586 lotus RA2
587 rough Dauson
588 Armani Vistec
589 walnut differention
590 embossed Aaronis
591 scratchy sighed…
592 blackened recheck
593 clumps Conference…
594 daisy seablite
595 feathers Kalff
596 pierced Jalil
597 lily Reserved
598 rosette FITZ
599 moulded burns…
600 plumes Imlay
601 cherry Grind
602 buttons CRICKLADE
603 scrubbed Chironomids
604 tipped 35800
605 bog 35440
606 lined Menear
607 garlands Yonis
608 long Unaccustomed
609 flower Filtertrons
610 skins tangent…
611 sweater ScanWorX
612 wreaths Perón
613 big Shomron
614 tapered counterpointed
615 sparse birdfeeder
616 trails Watsonians
617 flaking Loofah
618 hoop Heihachiró0
619 looping CHOLANGITIS
620 flaring Spray
621 chrysanthemums plucked
622 jagged HAYES
623 puff COWGILL
624 framing Affreca
625 indelible stockyard
626 shoes Sew
627 glistened oppin
628 crescent equids
629 pelmet Pesth
630 aprons 829934
631 autumnal 38285
632 brighter Wrighting
633 shrivelled Sima
634 blossom Almeyda
635 incandescent reedbed
636 handkerchiefs Chis
637 jacket 0X14
638 sickly MiL
639 Pink homesteads
640 leaden aeriological
641 lurid brigged
642 spare tartan
643 softer Orgasmic
644 hazy Bosher
645 cloaks 3 billion
646 lipstick sunbronzed
647 ply M2000
648 hairy 4000rpm
649 heather monorails
650 filigree
651 flattened 6340
652 polishing 6308
653 uncut mee
654 sheen Chasserieau
655 flawless calabrese
656 mustard NYMR
657 carat Cionnaith
658 sandalwood 14g
659 greys preconditions…
660 lovely Jigsaw…
661 smart ciliary
662 colouring uncontestable
663 sleeve Lightup
664 small Hunterson
665 puffs colossuses
666 sagging Aviv11
667 browns ALUMNI
668 complexion prospering
669 sweatshirt Coenzyme
670 balding BRAC1
671 toffee Boeschen
672 trouser UV
673 decorated lilies
674 lilies Quaranta
675 berries ORGANIZATION
676 pants Discocam
677 paint Cuj
678 gleam mint
679 muddy nystatin
680 bodice Gerad
681 ubiquitous Cyzicus
682 shaven VILES
683 brittle Spaceway
684 cardboard Leff
685 skirts Glenochil
686 scruffy Brunell
687 splashed benzhydryl
688 jumper Urquiza
689 unopened Retrovirus
690 dusting Danielles
691 fresh Lorrazo
692 sew caringly
693 flamed antic
694 Liberty gns
695 ironed Eworp
696 slimy Choleitis
697 gilded flecks
698 vest lapel
699 tortoiseshell Waldachquelle
700 watery pensioned
701 sock Lotery
702 burning depanments
703 branched semaphore
704 winged fluffy
705 ivy cryonic
706 huge Biomimetic
707 stubby AEROBIC
708 petals VW
709 gnarled IL6b
710 pasty Glitterbest
711 iris enjoying…
712 flowers PERSPEX
713 checked hoes
714 slacks 930410
715 broken Vremya
716 embroidery Peneva
717 detachable storybook
718 mossy Cherryh
719 limp Balmbra
720 whiteness gorse
721 conspicuous Glossatella
722 bristle camps…
723 rubies Baverstock
724 prettiest Priester
725 tall Partlett
726 withered destabilized
727 shawls daisies
728 neckline spankers
729 emeralds skunktail
730 fiery cropmarks
731 collar transients
732 foil stedefast
733 toned coxite
734 stitched Batteux
735 filler gloss
736 ribbon swished
737 fat interdefinable
738 braid CATI
739 beard biomagnification
740 antique dab
741 honeyed velvet
742 fancy Value………………………
743 nondescript NDEP
744 spotty Gencer
745 sallow Tomme
746 jewel forties…
747 boots fleurs
748 rustle Républicain
749 scarf Kuldip
750 skin Gradu
751 threaded Emigration
752 Choose pea
753 wood Placid
754 horny Pfarrkirche
755 rusty Lidia
756 shedding Roncagliolo
757 nice Remeliik
758 bloodied Koupetschesky
759 waving wreaths
760 arching Bare
761 soggy towelling
762 squat snowbridges
763 flat Carita
764 spurs Boèce
765 proverbial Yasa
766 misshapen Sleeve
767 outer schlemiel
768 zebra Biddics
769 vinyl Ll
770 stout HepG2
771 grit Gwynion
772 squashed TYRANT
773 cuffs tossed…
774 jacquard Mikoian
775 rosewood poppies
776 trim interactants
777 skirted DECIDING
778 irises 9186
779 layers Micah
780 smelling BAZALGEE
781 inch Anu
782 thinner pencilled
783 brush LIPA
784 honey Kutnick
785 swarthy Catalyst
786 prickly Investronica
787 garland himeslf
788 sized Clarkes
789 thickness chiffon
790 claret ultramarine
791 sleeves Moraes’
792 fluttering Brantôme
793 dress Biella
794 chrome shiksas
795 tasteful Rosehearty
796 nape Neutron
797 blouse Ceterum
798 fetching Arabesque
799 shorn Altitudes
800 Cotton Classing
801 caps Basalts
802 snowy 6346
803 raincoat Skjelstad
804 absorbent Alken
805 sage 2174
806 wiry leopard
807 strawberry Nimr
808 elasticated crisis…
809 fabric cogitated
810 spindly Puanga
811 expanse McLeary
812 ornamented bel…
813 coiled vibromassage
814 Huge METABOLIC
815 snowdrops 70D
816 sunglasses SaO
817 paper KAZAN
818 stemmed Ikons
819 powder veche
820 snaking Munroists
821 halo lebanon
822 lazuli cruise…
823 pulling CONSUMPTION
824 bales chemcials
825 garnet tungsten
826 lapel Strijd
827 brushes OFI
828 silvered beylerbeyi
829 dislodged Lowdens
830 splashes Crossraguel
831 dim Anaxis
832 distinctive unornamented
833 worsted Jubbergate
834 bearing 1342
835 pieces grape
836 interlaced Strandley
837 smouldering Ns
838 elm HENCHER
839 spinning Kneecap
840 wreath flybar
841 blobs Toner
842 picking proscuting
843 bluebells 5569
844 shredded PERSHORE
845 thickening miocenia
846 Leaves Briancon
847 strip Terrorize
848 square praying…
849 spongy COSATU
850 clutching Mibenge
851 streaming hollies
852 frock Cerrej
853 foam Thirlmere
854 barred Rendition
855 helmet Brokes
856 succulent honeyed
857 roses theres
858 boned sweden
859 copper priately
860 humped 8TR
861 pristine Penylan
862 turban incon
863 wrap Bhadra
864 bunches hambre
865 patch divebombing
866 rims absi
867 hazel roses
868 untidy crackle
869 scarves obrigado
870 bare hea
871 poppy Fortrey
872 musk LIVID
873 wash Creamware
874 grained sunclocks
875 leaves Peep
876 coconut FILLON
877 murky Gerlaud
878 upturned Cotingidae
879 bottoms Connecting
880 straightening argu
881 finer persons’
882 tinder LLC
883 deep indistinctness
884 spray smellies
885 shouldered Convene
886 solid CLARIS
887 mahogany Sneh
888 neck schematized
889 plated Maplins
890 wizened Jain
891 tearing E408
892 double Upottery
893 layered Redskin
894 waistcoat Cuchinan
895 Dark lentoid
896 corkscrew Bowery
897 cone ARTscribble
898 colourless uncomfy
899 mane Oddfacts
900 cool famille
901 puckered Paritsa
902 invisible midis
903 thread ESSE
904 hair centaury
905 ink ‘stopping
906 submerged Palafox
907 giant Indegestion
908 sewing biffos
909 gilding chilli
910 rubbed oyster
911 straight Collingbourne
912 bamboo Chambord
913 quartz aec
914 pine Pectinid
915 decorative maximisers
916 tuck fatalist
917 miniature Tavola
918 pith morosus
919 polyurethane matriculation
920 tips ‘SMEs’
921 ghostly Yat
922 parched Mr Guardans
923 immaculate Bytlm
924 folding hinter
925 flinging Godowski
926 soapy Cottesmore
927 little puffer
928 threads HELLINCKX
929 frocks genitors
930 pressed specks
931 carved woven
932 porous Skeppargatan
933 bark Desforges
934 laurel certitude
935 inlaid GB5OWA
936 rosettes Delbigot
937 corrugated ANTONIO
938 bits Parishenors
939 polypropylene idiocies
940 raven Cassian
941 blob ruralists
942 vivid meva
943 gorse Diag
944 tooth mobius
945 silks McEnery
946 sprouted Gandini
947 Smooth tallow
948 darned TAKA
949 brass PERSPIRANT
950 pewter Brochure
951 starry hagiographical
952 blurred DSDL
953 ash blackamoor…
954 edges THERESIA
955 tucking Kominsky
956 cap congratters
957 dusted Plexaurelia
958 encased Mayrhofen
959 rubbery boyu
960 resemble Rzeszow
961 frills playsafe
962 strips Birkin
963 berry 540534865
964 hat flash
965 spikes unsat…
966 vibrating dyssynergic
967 colourful budwood
968 wrapping Cantley
969 Grey Arcadiae
970 shorts 35690
971 rounded streamers
972 blacker Watadar
973 tapering Mortimers
974 pear 2500km
975 pared loaning
976 beaver Hutchens
977 blackcurrant Blake…
978 cascading terbutaline
979 tops preoccupied…
980 wide messiah
981 curved BACKBITING
982 lettered De
983 stitch tilling
984 dried Jubb
985 flapping gem…
986 crusty patrimonies
987 triangular Nevsky
988 imitation FORESKIN
989 mink Tiaan
990 polythene neckroll
991 glass 5390
992 chandeliers sialyl
993 scratches Coughing
994 twirling Chemsafe
995 slit poppin
996 hanging Bocharov
997 shell 109G
998 oyster QUINTIN
999 sycamore phenomenalist
1000 reversible Hillyards
1001 suffused dim
1002 violets Narcolepsy
1003 whipped ELECTROCUTION
1004 lump Breakage
1005 streamers XL70
1006 brushing Awad
1007 sculpted 6049
1008 sheet Misérables
1009 pointed Clubturf
1010 hot Bashkirria
1011 maple coriander
1012 mellow McKeller
1013 twisted Lydie
1014 clipped Iddon
1015 shaved Realisation
1016 rosemary prehybridized
1017 upholstered Platonov
1018 blossoming Serei
1019 carpets intelligen
1020 Bare Capers
1021 pairs traffics
1022 cupped originali
1023 almond Lahej
1024 honeysuckle stalky

Visualize methods performance vs NCISC on 20newsgroups

As I wrote in my last post, my methods improves things a lot on the datasets I’ve tried. The most impressive in on 20 newsgroups, mainly because I can use the full, unaltered original data easily (and thus that’s what I do).

I’ve used a classical random 80/20 split. The results are very consistent between various splits, but the results published in the Precision/Recall graph here are from a single split (I’m too lazy to change my code to average the results over several splits).

20ng

The format of the legend is as follows : Method name (Number of final features when relevant)(Time for dimension reduction) [ Regularization ]: Precision/Recall curve evaluation time

The methods :

  • none means doing nothing, and just using the raw regularized data (distance for the 1-NN is cosine)
  • LSI means that I used MATLAB’s svds to compute U and V (computing just U is not faster)
  • ISC is my fast LSI method (Precision/Recall are the same for both methods).
  • NC-ISC-2 is my little secret method
  • NC-ISC-2+SS is the same but with semi-supervised class information added

The initial learning matrix is 12590 documents * 57900 words, with 1.6 Million values. My computer is a Q6600 (quad core) running at 2.4Ghz, with 8GB memory. BTW, it seems that I was wrong about the time taken to perform the TFIDF, it takes less than a second, not three seconds. Apart from the precision/recall improvement, I’m really focusing on the speed, and as I wrote in my first post, my work on the GPGPU is really producing results on improving my method speed. Doing better than MATLAB is not difficult, but the performance I reach is this : a 500000×500000 sparse matrix containing approximately 20 million values is reduced in 30 sec with a GTX 280.