I’m still digging into Spark and MLLib, but for now, I can clearly say that this combination has some serious points.
- It can easily replace PIG with the added benefit that you don’t have to mix several technologies for you UDFs
- It’s really efficient
- It’s made for iterative methods (which is the case for NC-ISC) and definitely more suited for ML than pure Hadoop/MRv1 (I don’t know if Yarn performs better, but I don’t really think so)
- Scala is a great language
- Spark is at least as scalable as Hadoop, and can use several cluster management systems : Hadoop Yarn, EC2, Mesos and Spark
My opinion is that Spark is the near-perfect framework to efficiently write efficient mapreduce jobs (see the double efficiency here ?)
Also, it’s funny to see that Twitter’s Scalding has partially the same goal as Spark (i.e. help writing concise MapReduce jobs) except that Spark adds a much more powerful execution engine that allows intermediate computations to be kept in memory when possible. Probably not something that really big datasets can afford, but at least it’s a good thing to have it as an option.