Fun (and relevant) results on Wikipedia for multi-documents queries

We are settings things up to improve on our prototype demo site : http://demowiki.exensa.net and we’re looking for ideas.

While playing around with our system, and wondering how to push a fun activity (like on Wikidistrict) on the website. I tried to see what would come out of a multi-document query, and the results are very nice and fun :

Debian + Religion neighbours (70 neighbours documents/words model):

  • Ten Commandments of Computer Ethics : 0.8062684
  • Internet linguistics : 0.79933107
  • Translation memory : 0.7954829
  • Free software movement : 0.7887893
  • Machine translation : 0.7826216
  • E-Sword : 0.77988267
  • Technical translation : 0.7798575
  • New English Translation : 0.7767798
  • Posting style : 0.77614117
  • Digital artifactual value : 0.77253044
  • Multimodality : 0.7708764
  • Computer ethics : 0.77002776
  • The user-subjective approach : 0.77001673
  • Virtual community : 0.76979464
  • Computer-assisted translation : 0.76953083
  • Sharing : 0.76945585
  • Unity (user interface) : 0.7686337
  • Text annotation : 0.7677352
  • Separation of presentation and content : 0.7676085
  • Internet forum : 0.7660768
  • Comment (computer programming) : 0.7638492
  • List of email subject abbreviations : 0.7634484
  • Alternative terms for free software : 0.76130253
  • Single source publishing : 0.7607195
  • Open-source religion : 0.76062995
  • Hacker (term) : 0.76004744
  • Hacker ethic : 0.75634766
  • Internet-related prefixes : 0.7556772
  • New media : 0.755563
  • The Word Bible Software : 0.75490427

We’re going to introduce this fun activity on the website in the next few weeks, together with a search engine

 

NCISC as a search engine : now it works too !

We’ve had mixed results so far when using the results of our algorithm for search-engine like applications, i.e. finding the documents relevant for a given term. The symmetric neighbourhood, for instance documents similar to a document OR words similar to a word have always shown great results, but the results in the asymmetric case were much less relevant.

We have finally found the problem, and here are some results from an analysis of the English Wikipedia (70 features, no information added) :

Wikipedia pages relevant for the word « turing » : (titles are completely ignored, as well as page popularity) :

  • Hypercomputation
  • Solomonoff’s theory of inductive inference
  • Unbounded nondeterminism
  • Super-recursive algorithm
  • Theory of computation
  • Automated theorem proving
  • List of important publications in theoretical computer science
  • Turing machine
  • Turing completeness
  • Sheila Greibach
  • Church–Turing thesis
  • Algorithmic information theory
  • Universal Turing machine
  • Wolfram’s 2-state 3-symbol Turing machine
  • Digital physics
  • Logical framework
  • Fuzzy logic
  • Denotational semantics
  • List of machine learning algorithms
  • Parallel computation thesis
  • Computational geometry
  • Algorithm
  • Logic in computer science
  • Operational semantics
  • Satisfiability Modulo Theories
  • Automated reasoning
  • Information theory
  • Denotational semantics of the Actor model
  • Oracle machine
  • Indeterminacy in concurrent computation
  • Baum–Welch algorithm
  • List of books in computational geometry
  • John V. Tucker
  • List of PSPACE-complete problems
  • Power domains
  • List of computability and complexity topics

Similarly, getting the most relevant words for a given page « Turing machine » gives pretty good results :

  • turing
  • recursive
  • arithmetic
  • boolean
  • mappings
  • recursion
  • deterministic
  • automata
  • algorithmically
  • algorithm
  • melzak
  • iterating
  • computes
  • compute
  • leiserson
  • provably
  • iterate
  • cormen
  • computations
  • subproblems
  • definable
  • embedding
  • automaton
  • subtraction
  • satisfiability
  • undecidable
  • iteratively
  • constraint
  • deterministically
  • tuples
  • tuple
  • associative
  • pseudocode
  • pseudorandom
  • reachability
  • completeness
  • recursively
  • iterative
  • unary
  • logic

 

Is our universe a simulation ?

Slashdot features an article today, about a rebound of the argument about the likeliness of living in a computer simultation. It made me react because I remember having used the very opposite argument to show that our universe being ruled by a god is less likeley than our universe not being ruled by one (in my definition of a god, this is very close to the « we live in a computer simulation » thing).

To keep it short, Bostrom assumes that at least one of these proposition is true :

  1. the human species is very likely to go extinct before reaching a “posthuman” stage;
  2. any posthuman civilization is extremely unlikely to run a significant number of simulations of their evolutionary history (or variations thereof);
  3. we are almost certainly living in a computer simulation.

He then reaches the conclusion that we are more likely to live in a subworld simulation of the evolutionary history of a species which probably emerged from a very similar environment, than we are to live in the real stuff.

I remember pondering about the probability of our world to be bootstrapped by a superpower vs. not being bootstrapped. Basically for an intelligent species to emerge from nothing, you have to wait a long time, and the probability is low : let’s say N after a given time. But then, if the universe is not restriced to a single environment, you multiply your chances by the number of environments E, produced by the universe (if our visible universe is a good image of a real universe, then this number is quite huge).

Then you have to take into account the probability that such an intelligent species can transition to a species able to run simulations of the universe with a size big enough for the simulation to be perceived as the real thing by its inhabitants. And also the probability that the energy required to run that simulation is available, and that it is interesting enough to be run in a way compatible with what we can observe.

To answer this, we should seriously think about what can be interesting enough for us to spend a large amount of computing power in running a simulation of our world : predict the future ? have infinite TV programs of primitives doing primitives things (remember that we would be in post-human state by then) ? be a placeholder for post-humans brain to play into ?

I don’t know the processing power currently spent in video games vs. universe simulation vs. small scale life simulation vs. spying on people, but my guess is that, if we’re in a post-human state, then people will crave for fun, because everything that could be understood will have been understood, and predicting the future in a post-human era would require to simulate a post-human society. So either we are a simulation for a TV network, or we are in a video game. But I don’t think a post-human civilization would like to spend energy in simulating many human worlds, unless it’s for the fun of playing gods in a society of the past.

Now, factor all that : E.N is the number of human-level intelligent societies in a big universe after a given time. Now multiply that by the probability P that we can transition to a post-human society with enough computing power and energy to run simulations. Then multiply by the number S of simulations that could be run by such a post-human to have the characteristics of a subpart of the universe without any visible intervention from almighty actors. The question is thus: is E.N.P.S > E.N, or is P.S > 1 ? I’m not so sure.

Actually my personal conclusion is : since we have not observed god-like people making fun of us, I would vote that we’re living in a non-simulated world. Because seriously, if you could play god in a simulation of the earth in 20th century, wouldn’t you ? Or would you rather just watch primitives live their boring lives ?

Talk at Paris Machine Learning Meetup

I’ll give a talk at the fantastic Meetup organized by Igor Carron and Franck Bardol ( here to the meetup ) about NCISC and the demo we are setting up on wikipedia data.

It’s tomorrow, Wednesday the 12th in Paris.

I’ll present NCISC, and some results on a few standard datasets (reuters, 20newsgroups, ohsumed) as well as a comparison with Deep-Learning inspired methods. I’ll also present the pipeline we’ve setup for analyzing Wikipedia from the text, links and category perspectives.

There’s a demo (with relatively old and buggy data) here : http://demowiki.exensa.net/

Also the talk will be live streamed with Google + Hangout and the slides will be available here

Scikit-Learn, GraphLab, MLBase, Mahout

There’s a flurry of Machine Learning platforms/languages/libraries/systems that all implements almost the same algorithms. Have you tried them ? I’m wondering which one is the best to express a new algorithm quickly, efficiently and scalable.

For what I’ve seen, SciKit and MLBase sounds like the best choices from the usability point of view, GraphLab, MLBase and Mahout are great on scalability, and GraphLab is the most efficient, with SciKit and MLBase just on its trail.

Also it seems that GraphLab is not super easy to deploy.

GraphLab and (obviously) SciKit can interface with iptyhon notebook. What about MLBase ? At least Spark has Scala and Python bindings, so it should be able to connect.

SciKit-learn : http://scikit-learn.org GraphLab : http://graphlab.org/ MLBase : http://www.mlbase.org/ Mahout : http://mahout.apache.org/

Any other solutions ?

Couchbase 1.8.0 -> 1.8.1 upgrade and rebalance freeze issue

We’ve stumbled on a (relatively) important issue in the version 1.8.0 of couchbase, which is that during online upgrade to 1.8.1, the rebalance process stops/hangs/freezes . Various reasons have been proposed for this problem (for instance, the existence of empty vbuckets).

I’ve figured out that there are good old reasons why the upgrade process could fail. So after 5 unsuccessful attempts at smoothly upgrading, I’ve finally managed to find a procedure that seems to work :

(this is a procedure for online upgrade, where for each node, you successively remove/rebalance it, then stop/upgrade it, then add/rebalance it again) (this procedure is using the debian package)

  1. backup with cbbackup.
  2. remove the node and rebalance (it has never failed me at this point)
  3. stop the couchbase-server
  4. make sure that epmd is not still running (it was always running in at least 3 of my previous upgrade attempts). Otherwise, kill it !
  5. wipe the /opt/couchbase (ok, maybe this is overkill, but at least once I HAD to do it in order to continue the upgrade)
  6. dpkg -i couchbase-server...1.8.1.deb
  7. at this point you can edit /opt/couchbase/etc/couchbase_init.d and add this line « export HOME=/tmp » at the beginning, else you won’t be able to stop the server using « service couchbase-server stop« 
  8. add the node back to the cluster / rebalance