More features, more results on wikipedia

While we are working to make eXenGine more efficient, more expressive and, more specifically more sellable, I keep playing with the results of NCISC on Wikipedia. Here is an example of the neighbours of GenSim (a well known python toolkit for text mining/natural language processing by Radim Řehůřek), this time on the document/word model with 90 features :

  • OpenNLP : 0.9343325
  • Semantically-Interlinked Online Communities : 0.92996
  • GML Application Schemas : 0.9280249
  • XML Metadata Interchange : 0.926392
  • Languageware : 0.924556
  • Visual modeling : 0.92413026
  • GeoSPARQL : 0.9231725
  • Clairlib : 0.92301136
  • Heurist : 0.92240715
  • VisTrails : 0.9223076
  • Embedded RDF : 0.92183596
  • NetworkX : 0.9217739
  • UIMA : 0.9217461
  • Software Ideas Modeler : 0.92173994
  • List of Unified Modeling Language tools : 0.9215104
  • UModel : 0.92115307
  • SmartQVT : 0.9210423
  • Linked data : 0.9207388
  • Natural Language Toolkit : 0.92064124

It’s far from perfect, one major cause being that we still don’t use bigrams or skip-grams in the input transformation (i.e. we use plain old style bag of words), but it clearly shows the power of NCISC.

You can compare with the results provided by LSA using GenSim itself, in this post, a comment gives the top 10 results on a 500 features model trained on wikipedia:

  • Snippet (programming)
  • Prettyprint
  • Smalltalk
  • Plagiarism detection
  • D (programming language)
  • Falcon (programming language)
  • Intelligent code completion
  • OCaml
  • Explicit semantic analysis

When using the document/document interlink model (90 features), we obtain different, very good results:

  • Jubatus : 0.98519164
  • Scikit-learn : 0.98126435
  • Feature Selection Toolbox : 0.97841007
  • Structure mining : 0.97700834
  • ADaMSoft : 0.9755426
  • Mallet (software project) : 0.97395116
  • Shogun (toolbox) : 0.9718502
  • CRM114 (program) : 0.96751755
  • Weka (machine learning) : 0.96635485
  • Clairlib : 0.9659551
  • Document retrieval : 0.96506816
  • Oracle Data Mining : 0.96350515
  • Approximate string matching : 0.96212476
  • Bayesian spam filtering : 0.96208096
  • Dlib : 0.9620419
  • GSP Algorithm : 0.96116054
  • Discounted cumulative gain : 0.9606682
  • ELKI : 0.9604578
  • NeuroSolutions : 0.96015286
  • Waffles (machine learning) : 0.9597046
  • Information extraction : 0.9588822
  • Latent semantic mapping : 0.95838064
  • ScaLAPACK : 0.9563968
  • Learning Based Java : 0.95608145
  • Relevance feedback : 0.9559279
  • Web search query : 0.9558407
  • Grapheur : 0.9556832
  • LIBSVM : 0.95526296
  • Entity linking : 0.95325243

Much better than the 40 features version displayed on our demo site,

  • Java GUI for R : 0.995283
  • Decision table : 0.99500793
  • Programming domain : 0.9946934
  • Scikit-learn : 0.994619
  • Feature model : 0.9941954
  • Dlib : 0.9938446
  • Pattern directed invocation programming language : 0.9937961
  • Compiler correctness : 0.9937197
  • On the Cruelty of Really Teaching Computer Science : 0.99280906
  • Structure mining : 0.99274904
  • Hidden algebra : 0.99265593
  • OMDoc : 0.9925574
  • Institute for System Programming : 0.99253553
  • Radial tree : 0.9924036
  • Partial evaluation : 0.9922968
  • Jubatus : 0.99167436
  • Query optimization : 0.9916388
  • Effect system : 0.99141824

 

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *