在配置单元表中计数

在配置单元表中计数

本文介绍了在配置单元表中计数(*)的结果错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在HIVE中创建了一张表

  CREATE TABLE IF NOT EXISTS daily_firstseen_analysis(
firstSeen STRING,
类别STRING,
circle STRING,
specId STRING,
语言STRING,
osType STRING,
count INT)
(天STRING)
行格式限定
字段终止'\t'
存储为orc;

count(*)不会给我这个表的正确结果

  hive>从daily_firstseen_analysis中选择count(*); 
确定
75
所需时间:0.922秒,提取:1行

此表中的行数为959行

  hive>从daily_firstseen_analysis中选择*; 
...
所用时间:0.966秒,提取:959行

它给出了959行的数据

  hive> ANALYZE TABLE daily_firstseen_analysis PARTITION(day)COMPUTE STATISTICS noscan; 
Partition logdata.daily_firstseen_analysis {day = 20140521} stats:[numFiles = 6,numRows = 70,totalSize = 4433,rawDataSize = 37202]
Partition logdata.daily_firstseen_analysis {day = 20140525} stats:[numFiles = 6,numRows = 257,totalSize = 4937,rawDataSize = 136385]
分区logdata.daily_firstseen_analysis {day = 20140523} stats:[numFiles = 6,numRows = 211,totalSize = 5059,rawDataSize = 112140]
分区logdata.daily_firstseen_analysis {day = 20140524} stats:[numFiles = 6,numRows = 280,totalSize = 5257,rawDataSize = 148808]
Partition logdata.daily_firstseen_analysis {day = 20140522} stats:[numFiles = 6,numRows = 141,totalSize = 4848,rawDataSize = 74938]
OK
所用时间:5.098秒

我使用hive版本Hive 0.13.0.2.1.2.0-402



注意:
我在count(*)中发现了这个问题if我们不止一次插入表格。使用单个插入创建的表不存在此问题

解决方案

我有同样的问题,并使用ANALYZE修复它。按顺序运行这些命令会给你正确的计数:

  hive> ANALYZE TABLE daily_firstseen_analysis PARTITION(day)COMPUTE STATISTICS; 
hive> SELECT COUNT(*)FROM daily_firstseen_analysis;

即您必须在计数之前使用分析命令。你的问题中有一半的答案。

I have created a table in HIVE

CREATE TABLE IF NOT EXISTS daily_firstseen_analysis (
    firstSeen         STRING,
    category          STRING,
    circle            STRING,
    specId            STRING,
    language          STRING,
    osType            STRING,
    count             INT)
    PARTITIONED BY  (day STRING)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS orc;

count(*) is not giving me correct result for this table

hive> select count(*) from daily_firstseen_analysis;
OK
75
Time taken: 0.922 seconds, Fetched: 1 row(s)

While the number of rows in this table is 959 rows

hive> select * from daily_firstseen_analysis;
....
Time taken: 0.966 seconds, Fetched: 959 row(s)

it gives data with 959 rows

hive> ANALYZE TABLE daily_firstseen_analysis PARTITION(day) COMPUTE STATISTICS noscan;
    Partition logdata.daily_firstseen_analysis{day=20140521} stats: [numFiles=6, numRows=70, totalSize=4433, rawDataSize=37202]
    Partition logdata.daily_firstseen_analysis{day=20140525} stats: [numFiles=6, numRows=257, totalSize=4937, rawDataSize=136385]
    Partition logdata.daily_firstseen_analysis{day=20140523} stats: [numFiles=6, numRows=211, totalSize=5059, rawDataSize=112140]
    Partition logdata.daily_firstseen_analysis{day=20140524} stats: [numFiles=6, numRows=280, totalSize=5257, rawDataSize=148808]
    Partition logdata.daily_firstseen_analysis{day=20140522} stats: [numFiles=6, numRows=141, totalSize=4848, rawDataSize=74938]
    OK
    Time taken: 5.098 seconds

I am using hive with version Hive 0.13.0.2.1.2.0-402

NOTE:I found this issue in count(*) if We are inserting into a table in more than one time. Tables created with single insert do not have this issue

解决方案

I had the same problem, and using ANALYZE fixed it. Running these commands in order should give you the correct count:

hive> ANALYZE TABLE daily_firstseen_analysis PARTITION(day) COMPUTE STATISTICS;
hive> SELECT COUNT(*) FROM daily_firstseen_analysis;

i.e. you have to use the analyze command before the count. You have half the answer within your question.

这篇关于在配置单元表中计数(*)的结果错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-04 16:02