More features, more results on wikipedia

While we are working to make eXenGine more efficient, more expressive and, more specifically more sellable, I keep playing with the results of NCISC on Wikipedia. Here is an example of the neighbours of GenSim (a well known python toolkit for text mining/natural language processing by Radim Řehůřek), this time on the document/word model with 90 features :

  • OpenNLP : 0.9343325
  • Semantically-Interlinked Online Communities : 0.92996
  • GML Application Schemas : 0.9280249
  • XML Metadata Interchange : 0.926392
  • Languageware : 0.924556
  • Visual modeling : 0.92413026
  • GeoSPARQL : 0.9231725
  • Clairlib : 0.92301136
  • Heurist : 0.92240715
  • VisTrails : 0.9223076
  • Embedded RDF : 0.92183596
  • NetworkX : 0.9217739
  • UIMA : 0.9217461
  • Software Ideas Modeler : 0.92173994
  • List of Unified Modeling Language tools : 0.9215104
  • UModel : 0.92115307
  • SmartQVT : 0.9210423
  • Linked data : 0.9207388
  • Natural Language Toolkit : 0.92064124

It’s far from perfect, one major cause being that we still don’t use bigrams or skip-grams in the input transformation (i.e. we use plain old style bag of words), but it clearly shows the power of NCISC.

You can compare with the results provided by LSA using GenSim itself, in this post, a comment gives the top 10 results on a 500 features model trained on wikipedia:

  • Snippet (programming)
  • Prettyprint
  • Smalltalk
  • Plagiarism detection
  • D (programming language)
  • Falcon (programming language)
  • Intelligent code completion
  • OCaml
  • Explicit semantic analysis

When using the document/document interlink model (90 features), we obtain different, very good results:

  • Jubatus : 0.98519164
  • Scikit-learn : 0.98126435
  • Feature Selection Toolbox : 0.97841007
  • Structure mining : 0.97700834
  • ADaMSoft : 0.9755426
  • Mallet (software project) : 0.97395116
  • Shogun (toolbox) : 0.9718502
  • CRM114 (program) : 0.96751755
  • Weka (machine learning) : 0.96635485
  • Clairlib : 0.9659551
  • Document retrieval : 0.96506816
  • Oracle Data Mining : 0.96350515
  • Approximate string matching : 0.96212476
  • Bayesian spam filtering : 0.96208096
  • Dlib : 0.9620419
  • GSP Algorithm : 0.96116054
  • Discounted cumulative gain : 0.9606682
  • ELKI : 0.9604578
  • NeuroSolutions : 0.96015286
  • Waffles (machine learning) : 0.9597046
  • Information extraction : 0.9588822
  • Latent semantic mapping : 0.95838064
  • ScaLAPACK : 0.9563968
  • Learning Based Java : 0.95608145
  • Relevance feedback : 0.9559279
  • Web search query : 0.9558407
  • Grapheur : 0.9556832
  • LIBSVM : 0.95526296
  • Entity linking : 0.95325243

Much better than the 40 features version displayed on our demo site,

  • Java GUI for R : 0.995283
  • Decision table : 0.99500793
  • Programming domain : 0.9946934
  • Scikit-learn : 0.994619
  • Feature model : 0.9941954
  • Dlib : 0.9938446
  • Pattern directed invocation programming language : 0.9937961
  • Compiler correctness : 0.9937197
  • On the Cruelty of Really Teaching Computer Science : 0.99280906
  • Structure mining : 0.99274904
  • Hidden algebra : 0.99265593
  • OMDoc : 0.9925574
  • Institute for System Programming : 0.99253553
  • Radial tree : 0.9924036
  • Partial evaluation : 0.9922968
  • Jubatus : 0.99167436
  • Query optimization : 0.9916388
  • Effect system : 0.99141824

 

Fun (and relevant) results on Wikipedia for multi-documents queries

We are settings things up to improve on our prototype demo site : http://demowiki.exensa.net and we’re looking for ideas.

While playing around with our system, and wondering how to push a fun activity (like on Wikidistrict) on the website. I tried to see what would come out of a multi-document query, and the results are very nice and fun :

Debian + Religion neighbours (70 neighbours documents/words model):

  • Ten Commandments of Computer Ethics : 0.8062684
  • Internet linguistics : 0.79933107
  • Translation memory : 0.7954829
  • Free software movement : 0.7887893
  • Machine translation : 0.7826216
  • E-Sword : 0.77988267
  • Technical translation : 0.7798575
  • New English Translation : 0.7767798
  • Posting style : 0.77614117
  • Digital artifactual value : 0.77253044
  • Multimodality : 0.7708764
  • Computer ethics : 0.77002776
  • The user-subjective approach : 0.77001673
  • Virtual community : 0.76979464
  • Computer-assisted translation : 0.76953083
  • Sharing : 0.76945585
  • Unity (user interface) : 0.7686337
  • Text annotation : 0.7677352
  • Separation of presentation and content : 0.7676085
  • Internet forum : 0.7660768
  • Comment (computer programming) : 0.7638492
  • List of email subject abbreviations : 0.7634484
  • Alternative terms for free software : 0.76130253
  • Single source publishing : 0.7607195
  • Open-source religion : 0.76062995
  • Hacker (term) : 0.76004744
  • Hacker ethic : 0.75634766
  • Internet-related prefixes : 0.7556772
  • New media : 0.755563
  • The Word Bible Software : 0.75490427

We’re going to introduce this fun activity on the website in the next few weeks, together with a search engine

 

NCISC as a search engine : now it works too !

We’ve had mixed results so far when using the results of our algorithm for search-engine like applications, i.e. finding the documents relevant for a given term. The symmetric neighbourhood, for instance documents similar to a document OR words similar to a word have always shown great results, but the results in the asymmetric case were much less relevant.

We have finally found the problem, and here are some results from an analysis of the English Wikipedia (70 features, no information added) :

Wikipedia pages relevant for the word « turing » : (titles are completely ignored, as well as page popularity) :

  • Hypercomputation
  • Solomonoff’s theory of inductive inference
  • Unbounded nondeterminism
  • Super-recursive algorithm
  • Theory of computation
  • Automated theorem proving
  • List of important publications in theoretical computer science
  • Turing machine
  • Turing completeness
  • Sheila Greibach
  • Church–Turing thesis
  • Algorithmic information theory
  • Universal Turing machine
  • Wolfram’s 2-state 3-symbol Turing machine
  • Digital physics
  • Logical framework
  • Fuzzy logic
  • Denotational semantics
  • List of machine learning algorithms
  • Parallel computation thesis
  • Computational geometry
  • Algorithm
  • Logic in computer science
  • Operational semantics
  • Satisfiability Modulo Theories
  • Automated reasoning
  • Information theory
  • Denotational semantics of the Actor model
  • Oracle machine
  • Indeterminacy in concurrent computation
  • Baum–Welch algorithm
  • List of books in computational geometry
  • John V. Tucker
  • List of PSPACE-complete problems
  • Power domains
  • List of computability and complexity topics

Similarly, getting the most relevant words for a given page « Turing machine » gives pretty good results :

  • turing
  • recursive
  • arithmetic
  • boolean
  • mappings
  • recursion
  • deterministic
  • automata
  • algorithmically
  • algorithm
  • melzak
  • iterating
  • computes
  • compute
  • leiserson
  • provably
  • iterate
  • cormen
  • computations
  • subproblems
  • definable
  • embedding
  • automaton
  • subtraction
  • satisfiability
  • undecidable
  • iteratively
  • constraint
  • deterministically
  • tuples
  • tuple
  • associative
  • pseudocode
  • pseudorandom
  • reachability
  • completeness
  • recursively
  • iterative
  • unary
  • logic