Quantcast
Channel: My Tech Notes
Viewing all articles
Browse latest Browse all 90

Parquet Schema Incompatible between Pig and Hive

$
0
0
When you use Pig to process data and put into a Hive table, you need to be careful about the Pig namespace. It is possible that your complex Pig scripts may have namespace after group-by. And parquet.pig.ParquetStorer will keep the namespace prefix into the schema. Unfortunately, Hive Parquet map the table column to Parquet column by simply comparing string. See DataWritableReadSupport.java

@Override
public parquet.hadoop.api.ReadSupport.ReadContext init(final Configuration configuration,
final Map keyValueMetaData, final MessageType fileSchema) {
final String columns = configuration.get(IOConstants.COLUMNS);
final Map contextMetadata = new HashMap();
if (columns != null) {
final List listColumns = getColumns(columns);

final List typeListTable = new ArrayList();
for (final String col : listColumns) {
// listColumns contains partition columns which are metadata only
if (fileSchema.containsField(col)) { // containsField return false because col doesn't have namespace.
typeListTable.add(fileSchema.getType(col));
} else {
// below allows schema evolution
typeListTable.add(new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
}
}
and GroupType.java:

public boolean containsField(String name) {
return indexByName.containsKey(name);
}
There is not any name resolution. If you want Pig generate Hive readable Parquet files, you'd better give a lower-case name for each column before storing. I haven't tried whether writing into HCatalog table works or not. But HCatalog cannot write Parquet table in CDH 5.1.0, see my blog how to fix it.

Viewing all articles
Browse latest Browse all 90

Trending Articles