When you use Pig to process data and put into a Hive table, you need to be careful about the Pig namespace. It is possible that your complex Pig scripts may have namespace after group-by. And parquet.pig.ParquetStorer will keep the namespace prefix into the schema. Unfortunately, Hive Parquet map the table column to Parquet column by simply comparing string. See DataWritableReadSupport.java
and GroupType.java:
@Override
public parquet.hadoop.api.ReadSupport.ReadContext init(final Configuration configuration,
final MapkeyValueMetaData, final MessageType fileSchema) {
final String columns = configuration.get(IOConstants.COLUMNS);
final MapcontextMetadata = new HashMap ();
if (columns != null) {
final ListlistColumns = getColumns(columns);
final ListtypeListTable = new ArrayList ();
for (final String col : listColumns) {
// listColumns contains partition columns which are metadata only
if (fileSchema.containsField(col)) { // containsField return false because col doesn't have namespace.
typeListTable.add(fileSchema.getType(col));
} else {
// below allows schema evolution
typeListTable.add(new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.BINARY, col));
}
}
There is not any name resolution. If you want Pig generate Hive readable Parquet files, you'd better give a lower-case name for each column before storing. I haven't tried whether writing into HCatalog table works or not. But HCatalog cannot write Parquet table in CDH 5.1.0, see my blog how to fix it.
public boolean containsField(String name) {
return indexByName.containsKey(name);
}