Scalding + Cascading + TEZ = ♥

scalding-tez

For some things in technology and Big Data an overrated hype exists. At some other things, incremental progress is continuous and information needs to be shared. On this article you can read about work in progress regarding Cascading/Scalding integration with Tez and also some initial results.

Quoting the announcement: It’s been a fun ride, and the news is that there are results. While the unboxing experience isn’t yet totally pleasant, these results are now very promising.

The really good part is that apart from build.sbt, we needed no changes to application code to run with the local, hadoop (1.x API on a 2.6.0 cluster), hadoop2-mr1, and hadoop2-tez back-ends.

Numbers:

  • Full dataset: about 116M lines in 6 distinct CSV inputs
    • hadoop: about 18 hours (pretty much busy all the time, maxing out either the LAN, disk bandwidth OR CPU depending on phases)
    • tez: about 8 hours (with no LAN and few disk saturation periods, and apparent room for improvement in CPU/task allocation — confident a couple hours could be shaved).
  • Reduced dataset (integration testing dataset): about 2.3M lines in 6 distinct inputs
    • hadoop: 112 minutes
    • tez: 6.25 minutes 
  • In common:
    • the job is a cascade made of 20 Flows, which compile into about 420 Cascading steps (Hadoop) or 20 DAG (TEZ)
    • about 10K lines of Scala code

In the small-dataset experiment, hadoop suffers a lot from the zillions of step setup ceremonials it has to perform with YARN, whereas TEZ apps are higher-level and tend to stay much longer from the ResourceManager’s point of view.

Results appear identical so far (still busy comparing and ensuring we’ve covered all code paths, which we haven’t yet, but this looks really good).

I am grateful for everyone who had the patience to sift through the huge haystacks of logs and graphs I sent, and for the time spent writing patches in the dark for me to test.

Chris, I have no idea how much time I tied you up on this, but wow, thanks!


With the above announcement happening on the mailing list on early April 2015, a month later a follow up article presents the regressions tests executed against Tez – and a number of bugs fixed and ironed out.

With reported improvement gains of up to 15x times faster execution time, Tez & Scalding  seems to be really delivering from Cascading’s perspective, as new execution fabrics are proving that Cascading/Scalding are enterprise ready, on a path of continuous improvement and can withstand the exciting claims from the growing community of Spark users.

For the tech savvy – the above results have been achieved using patched tez-0.6.0, patched scalding-0.13.1, cascading-3.0.0-wip and currently against cascading 3.0.0-wip-115(+)

Regards to Sylvain Veyrié & Cyrille Chépélov for all their preliminary work and co-operation with Chris K Wensel 

16
Share
-->