Hive Metastore Configuration

Recently I wrote a post for Bad performance of Hive meta store for tables with large number of partitions. I did tests in our environment. Here is what I found:

Don't configure a hive client to access remote MySQL database directly as follows. The performance is really bad, especially when you query a table with a large number of partitions.


javax.jdo.option.ConnectionURL
jdbc:mysql://mysql_server/hive_meta


javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver


javax.jdo.option.ConnectionUserName
hive_user


javax.jdo.option.ConnectionPassword
password

Must start Hive metastore service on the same server where Hive MySQL database is running.

On database server, use the same configuration as above
Start the hive metasore service

hive --service metastore
# If use CDH
yum install hive-metastore
/sbin/service hive-metastore start

On hive client machine, use the following configuration.


hive.metastore.uris
thrift://mysql_server:9083

Don't worry if you see this error message.

ERROR conf.HiveConf: Found both hive.metastore.uris and javax.jdo.option.ConnectionURL Recommended to have exactly one of those config keyin configuration

The reason is: when Hive does partition pruning, it will read a list of partitions. The current metastore implementation uses JDO to query the metastore database:

Get a list of partition names using db.getPartitionNames()
Then call db.getPartitionsByName(List<Strin> partNames). If the list is too large, it will load in multiple times, 300 for each load by default. The JDO calls like this
- For one MPartition object.
- Send 1 query to retrieve MPartition basic fields.
- Send 1 query to retrieve MStorageDescriptors
- Send 1 query to retrieve data from PART_PARAMS.
- Send 1 query to retrieve data from PARTITION_KEY_VALS.
- ...
- Totally 10 queries for one MPartition. Because MPartition will be converted into Partition before send by, all fields will be populated
If one query takes 40ms in my environment. And you can calculate how long does it take for thousands partitions.
Using remote Hive metastore service, all those queries happens locally, it won't take that long for each query, so you can get performance improved significantly. But there are still a lot of queries.

I also wrote ObjectStore using EclipseLink JPA with @BatchFetch. Here is the test result, it will at least 6 times faster than remote metastore service. It will be even faster.

Partitions	JDO Remote MySQL	Remote Service	EclipseLink Remote MySQL
10	6,142	353	569
100	57,076	3,914	940
200	116,216	5,254	1,211
500	287,416	21,385	3,711
1000	574,606	39,846	6,652
3000		132,645	19,518

Hive Metastore Configuration

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112