Quantcast
Channel: My Tech Notes
Viewing all articles
Browse latest Browse all 90

HCatalog and Parquet

$
0
0

I'm trying to use Sqoop to import Teradata tables into Impala's Parquet tables. Because Sqoop doesn't support to write Parquet files directly, it SEEMS very promising to use Sqoop HCatalog to write Parquet tables. Unfortunately I realized that it would not work at this time (CDH 5.1.0 with Hive 0.12) after several day's error and trial.

"Should never be used", you will see this error like this page. Hive 0.13 won't help too. Check out this Jira. This is a HCatalog problem.

My current solution is:

  • Dump data to an ORCFile table using sqoop hcatalog.
  • Run a hive query to insert into the Parquet table.
So I need 30 mins to dump a big table to my small cluster and another 7 mins for the Hive insert.Because Impala doesn't support ORCFile, I can only convert the data into Parquet.

I hate this solution!!! This is why I studied the code of HCatalog to see why Parquet table failed. I figure outed a hack so that I can use HCatalog to dump data into Parquet directly.

The basic idea is to extend MapredParquetOutputFormat to support getRecordWriter. Will put the details soon.


Viewing all articles
Browse latest Browse all 90

Trending Articles