问题描述
我有一个要进行大量插入和删除操作的表,我需要使用扫描"(仅按行键,没有列值)频繁地对其进行扫描.
I have a table that is subject to heavy insert and delete action, and I need to scan it frequently with Scans (only by row-key, no column values).
我注意到Scan
延迟随着表中数据量的增加而增加.仔细检查ScanMetrics
后,我注意到对于大多数更高延迟的扫描,ScanMetrics.countOfRowsFiltered
的度量值比我实际请求扫描的行数高得多(我在我设置为scan
的FilterList
中的Scan
和PageFilter()
.
I noticed that Scan
latency increases as the amount of data in the table grows. After closer inspection of ScanMetrics
, I noticed that for most higher-latency scans, the measure of ScanMetrics.countOfRowsFiltered
is MUCH higher than the number of rows that I'm actually requesting to scan (which I specify both .setLimit()
in the Scan
and PageFilter()
in the FilterList
that I set to the scan
).
countOfRowsFiltered
的度量到底代表什么?在我的测试环境中,我永远无法重现所扫描的行数高于我设置的限制的情况,因此,该countOfRowsFiltered
始终为零.但是在实际环境中,它通常很高(根据我的计算,这可能是整体扫描延迟逐渐增加的原因.)
What exactly does the measure of countOfRowsFiltered
represent? In my testing environments, I can never reproduce the situation that the number of rows scanned is higher than what I set as a limit, and consequently this countOfRowsFiltered
is always zero. But in the real environment it is frequently quite high (and according to my calculations, this may be the reason for the gradual increase in the overall scan latency).
我在那里找不到该措施的任何描述.有任何使用经验,以及如何将其最小化?
I can't find any description of this measure out there. Any experience with it, and how to minimize it?
我将扫描设置如下:
Scan scan = new Scan().withStartRow(rowKeyStart).withStopRow(rowKeyStop);
scan.setCaching(scanCache);
FilterList filterList = new FilterList(
FilterList.Operator.MUST_PASS_ALL,
new FirstKeyOnlyFilter(),
new KeyOnlyFilter(),
new PrefixFilter(myPrefix),
new PageFilter(limit));
scan.setFilter(filterList);
scan.setCacheBlocks(false);
scan.setLimit(limit);
scan.setReadType(ReadType.PREAD);
scan.setScanMetricsEnabled(true);
ResultScanner scanner = myTable.getScanner(m_scan);
int processed = 0;
for (Result row : m_scanner.next(limit))
{
// do something with this row
if (++processed >= limit)
break;
}
ScanMetrics sm = m_scanner.getScanMetrics();
long scanned = sm.countOfRowsScanned.get();
long filtered = sm.countOfRowsFiltered.get(); // WHAT IS THIS???
scanner.close();
推荐答案
我相信我已经找到了答案:
I believe I have found the answer:
我仅通过指定rowKey
来执行Deletes
(即使该行中只有一列).在这种情况下,将删除标记放在该行上,并将该行从所有扫描中排除,然后获取该行,但是即使在进行重大压缩后,该行仍会物理存在于基础基础结构中.这样,Scan
会花费额外的时间来遍历那些已删除的行,并过滤掉它们以准备排除它们的最终结果.
I was performing Deletes
by specifying only the rowKey
(even though I only have one column in the row). In this case, a delete marker is put on the row and the row is excluded from all scans and gets, BUT it remains physically present in the underlying infrastructure even after major compactions. This way the Scan
spends extra time iterating through those deleted rows and filtering them out to prepare the final result that excludes them.
如果Delete
被RowKey
,ColumnFamily
,ColumnName
, AND 和TimeStamp
完全限定,则该行似乎只会从基础基础结构中删除. 全部.
It looks like the row would only get removed from the underlying infrastructure if the Delete
was fully qualified by the RowKey
, ColumnFamily
, ColumnName
, AND TimeStamp
of ALL of its columns.
更多信息:仅进行重大压实似乎还不够.首先,需要对表进行整理,然后进行大写压缩,然后删除的行完全消失,并且Scan
不会花费额外的时间来过滤掉它们.
FURTHERMORE: it seems that it's not sufficient to just do the Major Compaction. First the table needs to be Flushed, and THEN major-compacted, and only then the deleted rows are fully gone and the Scan
doesn't spend extra time filtering them out.
这比我想的要难...
This is harder than I thought...
这篇关于使用HBase扫描在ScanMetrics中到底countOfRowsFiltered是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!