EXPLAIN
官方wiki地址:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain explain命令的输出包括两个部分,STAGE DEPENDENCIES 和 STAGE PLANS 。其中 ABSTRACT SYNTAX TREE 在2.1.0中已经被移除了,并打算作为 explain ast 一个单独的命令在4.0.0版本推出。(相关issue:https://issues.apache.org/jira/browse/HIVE-15932。
下面以 explain select sum(number) from onecol where number>=4
为例,观察explain的输出。
阶段的依赖 STAGE DEPENDENCIES
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
这里显示有两个阶段,Stage-1是根阶段,Stage-0依赖于Stage-1,在Stage-1完成后执行。
阶段的执行计划 STAGE PLANS
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: onecol
Statistics: Num rows: 3 Data size: 3 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (number >= 4) (type: boolean)
Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(number)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
在本例中,有一个Map-Reduce操作阶段(Stage-1)。还有一个客户端获取数据的Fetch Operator。
一些常见的Operator:
- TableScan 读取数据,常见的属性 alias
- Select Operator 选取操作
- Group By Operator 分组聚合, 常见的属性 aggregations、mode , 当没有keys属性时只有一个分组。
- Reduce Output Operator 输出结果给Reduce , 常见的属性 sort order
- Fetch Operator 客户端获取数据 , 常见属性 limit
一个Map-Reduce阶段分为两个部分:
- Map Operator Tree:从表别名到Map Operator Tree的映射
- Reduce Operator Tree:处理map-reduce任务的reducer的所有行。在本例中,Stage-1的Reduce Operator Tree进行了sum这一聚合操作。
EXPLAIN EXTENDED
explain添加extended关键字,可以输出更多的信息,比如文件名信息等。下面是 ` explain extended select sum(number) from onecol where number>=4` 的部分输出。
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: onecol
Statistics: Num rows: 3 Data size: 3 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (number >= 4) (type: boolean)
Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(number)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
null sort order:
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
tag: -1
value expressions: _col0 (type: bigint)
auto parallelism: false
Path -> Alias:
hdfs://localhost:9000/Users/ruzzzz/devtools/apache-hive-2.3.4-bin/warehouse/onecol [onecol]
Path -> Partition:
hdfs://localhost:9000/Users/ruzzzz/devtools/apache-hive-2.3.4-bin/warehouse/onecol
Partition
base file name: onecol
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
bucket_count -1
column.name.delimiter ,
columns number
columns.comments
columns.types int
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://localhost:9000/Users/ruzzzz/devtools/apache-hive-2.3.4-bin/warehouse/onecol
name default.onecol
numFiles 3
numRows 3
rawDataSize 3
serialization.ddl struct onecol { i32 number}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 6
transient_lastDdlTime 1556243893
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
bucket_count -1
column.name.delimiter ,
columns number
columns.comments
columns.types int
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://localhost:9000/Users/ruzzzz/devtools/apache-hive-2.3.4-bin/warehouse/onecol
name default.onecol
numFiles 3
numRows 3
rawDataSize 3
serialization.ddl struct onecol { i32 number}
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 6
transient_lastDdlTime 1556243893
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.onecol
name: default.onecol
Truncated Path -> Alias:
/onecol [onecol]
Needs Tagging: false
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 0
directory: hdfs://localhost:9000/tmp/hive/ruzzzz/a1b20a04-24ad-4ea9-8285-3f438e8dd133/hive_2019-04-26_11-17-27_657_3718859854493336655-1/-mr-10001/.hive-staging_hive_2019-04-26_11-17-27_657_3718859854493336655-1/-ext-10002
NumFilesPerFileSink: 1
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix: hdfs://localhost:9000/tmp/hive/ruzzzz/a1b20a04-24ad-4ea9-8285-3f438e8dd133/hive_2019-04-26_11-17-27_657_3718859854493336655-1/-mr-10001/.hive-staging_hive_2019-04-26_11-17-27_657_3718859854493336655-1/-ext-10002/
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
columns _col0
columns.types bigint
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.escape.crlf true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
LIMIT
很多情况下,limit语句需要执行整个查询语句,然后在返回部分结果。这种情况通常是浪费的,所以应该尽可能地避免出现这种情况。Hive中有一个属性可以开启,当使用limit语句时,其可以对数据源进行抽样:
<property>
<name>hive.limit.optimize.enable</name>
<value>true</value>
</property>
当hive.limit.optimize.enable设置为true以后,另外两个参数也可以设置。
set hive.limit.row.max.size=100000; //使用limit做数据的子集查询时保证的最小行数据量,默认是100000
set hive.limit.optimize.limit.file=10; //使用简单limit查询数据子集时,可抽样的最大文件数,默认是10
这个功能的缺点是,有可能输入中有用的数据永远都不会被处理到。例如,像任何一个需要reduce的查询,join 和 group by操作,以及聚合函数的大多数调用等等,都会产生很不同的结果。