本文介绍了使用HBase扫描在ScanMetrics中到底countOfRowsFiltered是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要进行大量插入和删除操作的表,我需要使用扫描"(仅按行键,没有列值)频繁地对其进行扫描.

I have a table that is subject to heavy insert and delete action, and I need to scan it frequently with Scans (only by row-key, no column values).

我注意到Scan延迟随着表中数据量的增加而增加.仔细检查ScanMetrics后,我注意到对于大多数更高延迟的扫描,ScanMetrics.countOfRowsFiltered的度量值比我实际请求扫描的行数高得多(我在我设置为scanFilterList中的ScanPageFilter().

I noticed that Scan latency increases as the amount of data in the table grows. After closer inspection of ScanMetrics, I noticed that for most higher-latency scans, the measure of ScanMetrics.countOfRowsFiltered is MUCH higher than the number of rows that I'm actually requesting to scan (which I specify both .setLimit() in the Scan and PageFilter() in the FilterList that I set to the scan).

countOfRowsFiltered的度量到底代表什么?在我的测试环境中,我永远无法重现所扫描的行数高于我设置的限制的情况,因此,该countOfRowsFiltered始终为零.但是在实际环境中,它通常很高(根据我的计算,这可能是整体扫描延迟逐渐增加的原因.)

What exactly does the measure of countOfRowsFiltered represent? In my testing environments, I can never reproduce the situation that the number of rows scanned is higher than what I set as a limit, and consequently this countOfRowsFiltered is always zero. But in the real environment it is frequently quite high (and according to my calculations, this may be the reason for the gradual increase in the overall scan latency).

我在那里找不到该措施的任何描述.有任何使用经验,以及如何将其最小化?

I can't find any description of this measure out there. Any experience with it, and how to minimize it?

我将扫描设置如下:

Scan scan = new Scan().withStartRow(rowKeyStart).withStopRow(rowKeyStop);
scan.setCaching(scanCache);
FilterList filterList = new FilterList(
        FilterList.Operator.MUST_PASS_ALL,
        new FirstKeyOnlyFilter(),
        new KeyOnlyFilter(),
        new PrefixFilter(myPrefix),
        new PageFilter(limit));

scan.setFilter(filterList);
scan.setCacheBlocks(false);
scan.setLimit(limit);
scan.setReadType(ReadType.PREAD);

scan.setScanMetricsEnabled(true);
ResultScanner scanner = myTable.getScanner(m_scan);

int processed = 0;
for (Result row : m_scanner.next(limit))
{
    // do something with this row
    if (++processed >= limit)
        break;
}

ScanMetrics sm = m_scanner.getScanMetrics();

long scanned = sm.countOfRowsScanned.get();
long filtered = sm.countOfRowsFiltered.get(); // WHAT IS THIS???

scanner.close();

推荐答案

我相信我已经找到了答案:

I believe I have found the answer:

我仅通过指定rowKey来执行Deletes(即使该行中只有一列).在这种情况下,将删除标记放在该行上,并将该行从所有扫描中排除,然后获取该行,但是即使在进行重大压缩后,该行仍会物理存在于基础基础结构中.这样,Scan会花费额外的时间来遍历那些已删除的行,并过滤掉它们以准备排除它们的最终结果.

I was performing Deletes by specifying only the rowKey (even though I only have one column in the row). In this case, a delete marker is put on the row and the row is excluded from all scans and gets, BUT it remains physically present in the underlying infrastructure even after major compactions. This way the Scan spends extra time iterating through those deleted rows and filtering them out to prepare the final result that excludes them.

如果DeleteRowKeyColumnFamilyColumnName AND TimeStamp完全限定,则该行似乎只会从基础基础结构中删除. 全部.

It looks like the row would only get removed from the underlying infrastructure if the Delete was fully qualified by the RowKey, ColumnFamily, ColumnName, AND TimeStamp of ALL of its columns.

更多信息:仅进行重大压实似乎还不够.首先,需要对表进行整理,然后进行大写压缩,然后删除的行完全消失,并且Scan不会花费额外的时间来过滤掉它们.

FURTHERMORE: it seems that it's not sufficient to just do the Major Compaction. First the table needs to be Flushed, and THEN major-compacted, and only then the deleted rows are fully gone and the Scan doesn't spend extra time filtering them out.

这比我想的要难...

This is harder than I thought...

这篇关于使用HBase扫描在ScanMetrics中到底countOfRowsFiltered是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-29 15:32