More features, more results on wikipedia

While we are working to make eXenGine more efficient, more expressive and, more specifically more sellable, I keep playing with the results of NCISC on Wikipedia. Here is an example of the neighbours of GenSim (a well known python toolkit for text mining/natural language processing by Radim Řehůřek), this time on the document/word model with 90 features :

OpenNLP : 0.9343325
Semantically-Interlinked Online Communities : 0.92996
GML Application Schemas : 0.9280249
XML Metadata Interchange : 0.926392
Languageware : 0.924556
Visual modeling : 0.92413026
GeoSPARQL : 0.9231725
Clairlib : 0.92301136
Heurist : 0.92240715
VisTrails : 0.9223076
Embedded RDF : 0.92183596
NetworkX : 0.9217739
UIMA : 0.9217461
Software Ideas Modeler : 0.92173994
List of Unified Modeling Language tools : 0.9215104
UModel : 0.92115307
SmartQVT : 0.9210423
Linked data : 0.9207388
Natural Language Toolkit : 0.92064124

It’s far from perfect, one major cause being that we still don’t use bigrams or skip-grams in the input transformation (i.e. we use plain old style bag of words), but it clearly shows the power of NCISC.

You can compare with the results provided by LSA using GenSim itself, in this post, a comment gives the top 10 results on a 500 features model trained on wikipedia:

Snippet (programming)
Prettyprint
Smalltalk
Plagiarism detection
D (programming language)
Falcon (programming language)
Intelligent code completion
OCaml
Explicit semantic analysis

When using the document/document interlink model (90 features), we obtain different, very good results:

Jubatus : 0.98519164
Scikit-learn : 0.98126435
Feature Selection Toolbox : 0.97841007
Structure mining : 0.97700834
ADaMSoft : 0.9755426
Mallet (software project) : 0.97395116
Shogun (toolbox) : 0.9718502
CRM114 (program) : 0.96751755
Weka (machine learning) : 0.96635485
Clairlib : 0.9659551
Document retrieval : 0.96506816
Oracle Data Mining : 0.96350515
Approximate string matching : 0.96212476
Bayesian spam filtering : 0.96208096
Dlib : 0.9620419
GSP Algorithm : 0.96116054
Discounted cumulative gain : 0.9606682
ELKI : 0.9604578
NeuroSolutions : 0.96015286
Waffles (machine learning) : 0.9597046
Information extraction : 0.9588822
Latent semantic mapping : 0.95838064
ScaLAPACK : 0.9563968
Learning Based Java : 0.95608145
Relevance feedback : 0.9559279
Web search query : 0.9558407
Grapheur : 0.9556832
LIBSVM : 0.95526296
Entity linking : 0.95325243

Much better than the 40 features version displayed on our demo site,

Java GUI for R : 0.995283
Decision table : 0.99500793
Programming domain : 0.9946934
Scikit-learn : 0.994619
Feature model : 0.9941954
Dlib : 0.9938446
Pattern directed invocation programming language : 0.9937961
Compiler correctness : 0.9937197
On the Cruelty of Really Teaching Computer Science : 0.99280906
Structure mining : 0.99274904
Hidden algebra : 0.99265593
OMDoc : 0.9925574
Institute for System Programming : 0.99253553
Radial tree : 0.9924036
Partial evaluation : 0.9922968
Jubatus : 0.99167436
Query optimization : 0.9916388
Effect system : 0.99141824

Mostly linguistically computational

Adventure in collaborative filtering, information retrieval, matrix factorization and other stuff

More features, more results on wikipedia

Laisser un commentaire Annuler la réponse.