Fun (and relevant) results on Wikipedia for multi-documents queries

Posted on 2014/03/14 Leave a Comment

We are settings things up to improve on our prototype demo site : http://demowiki.exensa.net and we’re looking for ideas.

While playing around with our system, and wondering how to push a fun activity (like on Wikidistrict) on the website. I tried to see what would come out of a multi-document query, and the results are very nice and fun :

Debian + Religion neighbours (70 neighbours documents/words model):

Ten Commandments of Computer Ethics : 0.8062684
Internet linguistics : 0.79933107
Translation memory : 0.7954829
Free software movement : 0.7887893
Machine translation : 0.7826216
E-Sword : 0.77988267
Technical translation : 0.7798575
New English Translation : 0.7767798
Posting style : 0.77614117
Digital artifactual value : 0.77253044
Multimodality : 0.7708764
Computer ethics : 0.77002776
The user-subjective approach : 0.77001673
Virtual community : 0.76979464
Computer-assisted translation : 0.76953083
Sharing : 0.76945585
Unity (user interface) : 0.7686337
Text annotation : 0.7677352
Separation of presentation and content : 0.7676085
Internet forum : 0.7660768
Comment (computer programming) : 0.7638492
List of email subject abbreviations : 0.7634484
Alternative terms for free software : 0.76130253
Single source publishing : 0.7607195
Open-source religion : 0.76062995
Hacker (term) : 0.76004744
Hacker ethic : 0.75634766
Internet-related prefixes : 0.7556772
New media : 0.755563
The Word Bible Software : 0.75490427

We’re going to introduce this fun activity on the website in the next few weeks, together with a search engine

NCISC as a search engine : now it works too !

Posted on 2014/03/05 Leave a Comment

We’ve had mixed results so far when using the results of our algorithm for search-engine like applications, i.e. finding the documents relevant for a given term. The symmetric neighbourhood, for instance documents similar to a document OR words similar to a word have always shown great results, but the results in the asymmetric case were much less relevant.

We have finally found the problem, and here are some results from an analysis of the English Wikipedia (70 features, no information added) :

Wikipedia pages relevant for the word « turing » : (titles are completely ignored, as well as page popularity) :

Hypercomputation
Solomonoff’s theory of inductive inference
Unbounded nondeterminism
Super-recursive algorithm
Theory of computation
Automated theorem proving
List of important publications in theoretical computer science
Turing machine
Turing completeness
Sheila Greibach
Church–Turing thesis
Algorithmic information theory
Universal Turing machine
Wolfram’s 2-state 3-symbol Turing machine
Digital physics
Logical framework
Fuzzy logic
Denotational semantics
List of machine learning algorithms
Parallel computation thesis
Computational geometry
Algorithm
Logic in computer science
Operational semantics
Satisfiability Modulo Theories
Automated reasoning
Information theory
Denotational semantics of the Actor model
Oracle machine
Indeterminacy in concurrent computation
Baum–Welch algorithm
List of books in computational geometry
John V. Tucker
List of PSPACE-complete problems
Power domains
List of computability and complexity topics

Similarly, getting the most relevant words for a given page « Turing machine » gives pretty good results :

turing
recursive
arithmetic
boolean
mappings
recursion
deterministic
automata
algorithmically
algorithm
melzak
iterating
computes
compute
leiserson
provably
iterate
cormen
computations
subproblems
definable
embedding
automaton
subtraction
satisfiability
undecidable
iteratively
constraint
deterministically
tuples
tuple
associative
pseudocode
pseudorandom
reachability
completeness
recursively
iterative
unary
logic

Is our universe a simulation ?

Posted on 2014/02/17 Leave a Comment

Slashdot features an article today, about a rebound of the argument about the likeliness of living in a computer simultation. It made me react because I remember having used the very opposite argument to show that our universe being ruled by a god is less likeley than our universe not being ruled by one (in my definition of a god, this is very close to the « we live in a computer simulation » thing).

To keep it short, Bostrom assumes that at least one of these proposition is true :

the human species is very likely to go extinct before reaching a “posthuman” stage;
any posthuman civilization is extremely unlikely to run a significant number of simulations of their evolutionary history (or variations thereof);
we are almost certainly living in a computer simulation.

He then reaches the conclusion that we are more likely to live in a subworld simulation of the evolutionary history of a species which probably emerged from a very similar environment, than we are to live in the real stuff.

I remember pondering about the probability of our world to be bootstrapped by a superpower vs. not being bootstrapped. Basically for an intelligent species to emerge from nothing, you have to wait a long time, and the probability is low : let’s say N after a given time. But then, if the universe is not restriced to a single environment, you multiply your chances by the number of environments E, produced by the universe (if our visible universe is a good image of a real universe, then this number is quite huge).

Then you have to take into account the probability that such an intelligent species can transition to a species able to run simulations of the universe with a size big enough for the simulation to be perceived as the real thing by its inhabitants. And also the probability that the energy required to run that simulation is available, and that it is interesting enough to be run in a way compatible with what we can observe.

To answer this, we should seriously think about what can be interesting enough for us to spend a large amount of computing power in running a simulation of our world : predict the future ? have infinite TV programs of primitives doing primitives things (remember that we would be in post-human state by then) ? be a placeholder for post-humans brain to play into ?

I don’t know the processing power currently spent in video games vs. universe simulation vs. small scale life simulation vs. spying on people, but my guess is that, if we’re in a post-human state, then people will crave for fun, because everything that could be understood will have been understood, and predicting the future in a post-human era would require to simulate a post-human society. So either we are a simulation for a TV network, or we are in a video game. But I don’t think a post-human civilization would like to spend energy in simulating many human worlds, unless it’s for the fun of playing gods in a society of the past.

Now, factor all that : E.N is the number of human-level intelligent societies in a big universe after a given time. Now multiply that by the probability P that we can transition to a post-human society with enough computing power and energy to run simulations. Then multiply by the number S of simulations that could be run by such a post-human to have the characteristics of a subpart of the universe without any visible intervention from almighty actors. The question is thus: is E.N.P.S > E.N, or is P.S > 1 ? I’m not so sure.

Actually my personal conclusion is : since we have not observed god-like people making fun of us, I would vote that we’re living in a non-simulated world. Because seriously, if you could play god in a simulation of the earth in 20th century, wouldn’t you ? Or would you rather just watch primitives live their boring lives ?

Talk at Paris Machine Learning Meetup

Posted on 2014/02/11 Leave a Comment

I’ll give a talk at the fantastic Meetup organized by Igor Carron and Franck Bardol ( here to the meetup ) about NCISC and the demo we are setting up on wikipedia data.

It’s tomorrow, Wednesday the 12th in Paris.

I’ll present NCISC, and some results on a few standard datasets (reuters, 20newsgroups, ohsumed) as well as a comparison with Deep-Learning inspired methods. I’ll also present the pipeline we’ve setup for analyzing Wikipedia from the text, links and category perspectives.

There’s a demo (with relatively old and buggy data) here : http://demowiki.exensa.net/

Also the talk will be live streamed with Google + Hangout and the slides will be available here

Scikit-Learn, GraphLab, MLBase, Mahout

Posted on 2013/10/01 Leave a Comment

There’s a flurry of Machine Learning platforms/languages/libraries/systems that all implements almost the same algorithms. Have you tried them ? I’m wondering which one is the best to express a new algorithm quickly, efficiently and scalable.

For what I’ve seen, SciKit and MLBase sounds like the best choices from the usability point of view, GraphLab, MLBase and Mahout are great on scalability, and GraphLab is the most efficient, with SciKit and MLBase just on its trail.

Also it seems that GraphLab is not super easy to deploy.

GraphLab and (obviously) SciKit can interface with iptyhon notebook. What about MLBase ? At least Spark has Scala and Python bindings, so it should be able to connect.

SciKit-learn : http://scikit-learn.org GraphLab : http://graphlab.org/ MLBase : http://www.mlbase.org/ Mahout : http://mahout.apache.org/

Any other solutions ?

Couchbase 1.8.0 -> 1.8.1 upgrade and rebalance freeze issue

Posted on 2013/01/09 Leave a Comment

We’ve stumbled on a (relatively) important issue in the version 1.8.0 of couchbase, which is that during online upgrade to 1.8.1, the rebalance process stops/hangs/freezes . Various reasons have been proposed for this problem (for instance, the existence of empty vbuckets).

I’ve figured out that there are good old reasons why the upgrade process could fail. So after 5 unsuccessful attempts at smoothly upgrading, I’ve finally managed to find a procedure that seems to work :

(this is a procedure for online upgrade, where for each node, you successively remove/rebalance it, then stop/upgrade it, then add/rebalance it again) (this procedure is using the debian package)

backup with cbbackup.
remove the node and rebalance (it has never failed me at this point)
stop the couchbase-server
make sure that epmd is not still running (it was always running in at least 3 of my previous upgrade attempts). Otherwise, kill it !
wipe the /opt/couchbase (ok, maybe this is overkill, but at least once I HAD to do it in order to continue the upgrade)
dpkg -i couchbase-server...1.8.1.deb
at this point you can edit /opt/couchbase/etc/couchbase_init.d and add this line « export HOME=/tmp » at the beginning, else you won’t be able to stop the server using « service couchbase-server stop«
add the node back to the cluster / rebalance

Mostly linguistically computational

Adventure in collaborative filtering, information retrieval, matrix factorization and other stuff

Uncategorized