Reading Data from a External Partitioned Hive Table in Scalding


In our team we have many of our data, stored as a set of daily acquired folder accessed through Hive. In some of our scalding Job we had the necessity to process the content of these files. Being the files on the file system it is possible to create a MultiSourceTap to join all the partition constituting the table but that will imply a certain amount of work. We then tried to find a more practical solution to tbe problem. It turned out that the Cascading project contains a (very basic) Hive connector called cascading-hive. It allows to access Hive storage file and also to query HCatalog to retrieve all the files where a Hive table data is stored. We decided to add a Scalding wrapper of this connector to allow us to use it in our jobs. The result is the HiveSource tap available in the scalding-taps project. As I mentioned before, the code of the Scalding tap and of the underlying Cascading-Hive adapter is quite basic but it has been proven very useful for my team.

Read More