More features, more results on wikipedia

Posted on 2014/03/24 Leave a Comment

While we are working to make eXenGine more efficient, more expressive and, more specifically more sellable, I keep playing with the results of NCISC on Wikipedia. Here is an example of the neighbours of GenSim (a well known python toolkit for text mining/natural language processing by Radim Řehůřek), this time on the document/word model with 90 features :

OpenNLP : 0.9343325
Semantically-Interlinked Online Communities : 0.92996
GML Application Schemas : 0.9280249
XML Metadata Interchange : 0.926392
Languageware : 0.924556
Visual modeling : 0.92413026
GeoSPARQL : 0.9231725
Clairlib : 0.92301136
Heurist : 0.92240715
VisTrails : 0.9223076
Embedded RDF : 0.92183596
NetworkX : 0.9217739
UIMA : 0.9217461
Software Ideas Modeler : 0.92173994
List of Unified Modeling Language tools : 0.9215104
UModel : 0.92115307
SmartQVT : 0.9210423
Linked data : 0.9207388
Natural Language Toolkit : 0.92064124

It’s far from perfect, one major cause being that we still don’t use bigrams or skip-grams in the input transformation (i.e. we use plain old style bag of words), but it clearly shows the power of NCISC.

You can compare with the results provided by LSA using GenSim itself, in this post, a comment gives the top 10 results on a 500 features model trained on wikipedia:

Snippet (programming)
Prettyprint
Smalltalk
Plagiarism detection
D (programming language)
Falcon (programming language)
Intelligent code completion
OCaml
Explicit semantic analysis

When using the document/document interlink model (90 features), we obtain different, very good results:

Jubatus : 0.98519164
Scikit-learn : 0.98126435
Feature Selection Toolbox : 0.97841007
Structure mining : 0.97700834
ADaMSoft : 0.9755426
Mallet (software project) : 0.97395116
Shogun (toolbox) : 0.9718502
CRM114 (program) : 0.96751755
Weka (machine learning) : 0.96635485
Clairlib : 0.9659551
Document retrieval : 0.96506816
Oracle Data Mining : 0.96350515
Approximate string matching : 0.96212476
Bayesian spam filtering : 0.96208096
Dlib : 0.9620419
GSP Algorithm : 0.96116054
Discounted cumulative gain : 0.9606682
ELKI : 0.9604578
NeuroSolutions : 0.96015286
Waffles (machine learning) : 0.9597046
Information extraction : 0.9588822
Latent semantic mapping : 0.95838064
ScaLAPACK : 0.9563968
Learning Based Java : 0.95608145
Relevance feedback : 0.9559279
Web search query : 0.9558407
Grapheur : 0.9556832
LIBSVM : 0.95526296
Entity linking : 0.95325243

Much better than the 40 features version displayed on our demo site,

Java GUI for R : 0.995283
Decision table : 0.99500793
Programming domain : 0.9946934
Scikit-learn : 0.994619
Feature model : 0.9941954
Dlib : 0.9938446
Pattern directed invocation programming language : 0.9937961
Compiler correctness : 0.9937197
On the Cruelty of Really Teaching Computer Science : 0.99280906
Structure mining : 0.99274904
Hidden algebra : 0.99265593
OMDoc : 0.9925574
Institute for System Programming : 0.99253553
Radial tree : 0.9924036
Partial evaluation : 0.9922968
Jubatus : 0.99167436
Query optimization : 0.9916388
Effect system : 0.99141824

Fun (and relevant) results on Wikipedia for multi-documents queries

Posted on 2014/03/14 Leave a Comment

We are settings things up to improve on our prototype demo site : http://demowiki.exensa.net and we’re looking for ideas.

While playing around with our system, and wondering how to push a fun activity (like on Wikidistrict) on the website. I tried to see what would come out of a multi-document query, and the results are very nice and fun :

Debian + Religion neighbours (70 neighbours documents/words model):

Ten Commandments of Computer Ethics : 0.8062684
Internet linguistics : 0.79933107
Translation memory : 0.7954829
Free software movement : 0.7887893
Machine translation : 0.7826216
E-Sword : 0.77988267
Technical translation : 0.7798575
New English Translation : 0.7767798
Posting style : 0.77614117
Digital artifactual value : 0.77253044
Multimodality : 0.7708764
Computer ethics : 0.77002776
The user-subjective approach : 0.77001673
Virtual community : 0.76979464
Computer-assisted translation : 0.76953083
Sharing : 0.76945585
Unity (user interface) : 0.7686337
Text annotation : 0.7677352
Separation of presentation and content : 0.7676085
Internet forum : 0.7660768
Comment (computer programming) : 0.7638492
List of email subject abbreviations : 0.7634484
Alternative terms for free software : 0.76130253
Single source publishing : 0.7607195
Open-source religion : 0.76062995
Hacker (term) : 0.76004744
Hacker ethic : 0.75634766
Internet-related prefixes : 0.7556772
New media : 0.755563
The Word Bible Software : 0.75490427

We’re going to introduce this fun activity on the website in the next few weeks, together with a search engine

NCISC as a search engine : now it works too !

Posted on 2014/03/05 Leave a Comment

We’ve had mixed results so far when using the results of our algorithm for search-engine like applications, i.e. finding the documents relevant for a given term. The symmetric neighbourhood, for instance documents similar to a document OR words similar to a word have always shown great results, but the results in the asymmetric case were much less relevant.

We have finally found the problem, and here are some results from an analysis of the English Wikipedia (70 features, no information added) :

Wikipedia pages relevant for the word « turing » : (titles are completely ignored, as well as page popularity) :

Hypercomputation
Solomonoff’s theory of inductive inference
Unbounded nondeterminism
Super-recursive algorithm
Theory of computation
Automated theorem proving
List of important publications in theoretical computer science
Turing machine
Turing completeness
Sheila Greibach
Church–Turing thesis
Algorithmic information theory
Universal Turing machine
Wolfram’s 2-state 3-symbol Turing machine
Digital physics
Logical framework
Fuzzy logic
Denotational semantics
List of machine learning algorithms
Parallel computation thesis
Computational geometry
Algorithm
Logic in computer science
Operational semantics
Satisfiability Modulo Theories
Automated reasoning
Information theory
Denotational semantics of the Actor model
Oracle machine
Indeterminacy in concurrent computation
Baum–Welch algorithm
List of books in computational geometry
John V. Tucker
List of PSPACE-complete problems
Power domains
List of computability and complexity topics

Similarly, getting the most relevant words for a given page « Turing machine » gives pretty good results :

turing
recursive
arithmetic
boolean
mappings
recursion
deterministic
automata
algorithmically
algorithm
melzak
iterating
computes
compute
leiserson
provably
iterate
cormen
computations
subproblems
definable
embedding
automaton
subtraction
satisfiability
undecidable
iteratively
constraint
deterministically
tuples
tuple
associative
pseudocode
pseudorandom
reachability
completeness
recursively
iterative
unary
logic

Mostly linguistically computational

Adventure in collaborative filtering, information retrieval, matrix factorization and other stuff

Month: mars 2014

More features, more results on wikipedia

Fun (and relevant) results on Wikipedia for multi-documents queries

NCISC as a search engine : now it works too !