Spark + MLLib

Spark with a star
Apache Spark Logo

I’m still digging into Spark and MLLib, but for now, I can clearly say that this combination has some serious points.

  • It can easily replace PIG with the added benefit that you don’t have to mix several technologies for you UDFs
  • It’s really efficient
  • It’s made for iterative methods (which is the case for NC-ISC) and definitely more suited for ML than pure Hadoop/MRv1 (I don’t know if Yarn performs better, but I don’t really think so)
  • Scala is a great language
  • Spark is at least as scalable as Hadoop, and can use several cluster management systems : Hadoop Yarn, EC2, Mesos and Spark

My opinion is that Spark is the near-perfect framework to efficiently write efficient mapreduce jobs (see the double efficiency here ?)

Also, it’s funny to see that Twitter’s Scalding has partially the same goal as Spark (i.e. help writing concise MapReduce jobs) except that Spark adds a much more powerful execution engine that allows intermediate computations to be kept in memory when possible. Probably not something that really big datasets can afford, but at least it’s a good thing to have it as an option.


Scikit-Learn, GraphLab, MLBase, Mahout

There’s a flurry of Machine Learning platforms/languages/libraries/systems that all implements almost the same algorithms. Have you tried them ? I’m wondering which one is the best to express a new algorithm quickly, efficiently and scalable.

For what I’ve seen, SciKit and MLBase sounds like the best choices from the usability point of view, GraphLab, MLBase and Mahout are great on scalability, and GraphLab is the most efficient, with SciKit and MLBase just on its trail.

Also it seems that GraphLab is not super easy to deploy.

GraphLab and (obviously) SciKit can interface with iptyhon notebook. What about MLBase ? At least Spark has Scala and Python bindings, so it should be able to connect.

SciKit-learn : GraphLab : MLBase : Mahout :

Any other solutions ?