Posts Tagged with: cascading
For some things in technology and Big Data an overrated hype exists. At some other things, incremental progress is continuous and information needs to be shared. On this article you can read about work in progress regarding Cascading/Scalding integration with Tez and also some initial results. Quoting the announcement: It’s been a fun ride, and the news is that there are results. While the unboxing experience isn’t yet totally pleasant, these results are now very promising. The really good part is that apart from build.sbt, we needed no changes to application code to run with the local, hadoop (1.x API on a 2.6.0 cluster), hadoop2-mr1, and hadoop2-tez back-ends.
In our team we have many of our data, stored as a set of daily acquired folder accessed through Hive. In some of our scalding Job we had the necessity to process the content of these files. Being the files on the file system it is possible to create a MultiSourceTap to join all the partition constituting the table but that will imply a certain amount of work. We then tried to find a more practical solution to tbe problem. It turned out that the Cascading project contains a (very basic) Hive connector called cascading-hive. It allows to access Hive storage file and also to query HCatalog to retrieve all the files where a Hive table data is stored. We decided to add a Scalding wrapper of this connector to allow us to use it in our jobs. The result is the HiveSource tap available in the scalding-taps project. As I mentioned before, the code of the Scalding tap and of the underlying Cascading-Hive adapter is quite basic but it has been proven very useful for my team.