问题描述
我们目前在单节点群集上使用Cassandra来测试应用程序开发。现在,我们有一个非常庞大的数据集,包括大约70M行的文本,我们希望转储到Cassandra。
我们已经尝试了以下所有: / p>
- 使用python Cassandra驱动程序逐行插入
- 复制Cassandra命令
- 将sstable的压缩设置为无
我们已经探索了sstable批量加载器的选项,有适当的.db格式为此。我们要加载的文本文件有70M行,看起来像:
2f8e4787-eb9c-49e0-9a2d-23fa40c177a4磁铁程序成功在吸引申请人和到20世纪90年代中期只有#######应用#applied被接受的学生。
我们打算插入的列族具有以下创建语法:
CREATE TABLE post(
postid uuid,
posttext text,
PRIMARY KEY(postid)
)WITH
bloom_filter_fp_chance = 0.010000 AND
caching ='KEYS_ONLY'AND
comment =''AND
dclocal_read_repair_chance = 0.000000 AND
gc_grace_seconds = 864000 AND
index_interval = 128 AND
read_repair_chance = 0.100000 AND
replicate_on_write ='true'AND
populate_io_cache_on_flush ='false'AND
default_time_to_live = 0 AND
speculative_retry = '99 .0PERCENTILE'AND
memtable_flush_period_in_ms = 0 AND
compaction = {'class':'SizeTieredCompactionStrategy'} AND
compression = {};
问题:
将数据加载到一个简单的列族是永远的 - - 插入30M线的5小时。我们想知道是否有任何方式加快这一点,因为70M线的相同的数据被加载到MySQL的性能需要大约6分钟在我们的服务器上。
我们想知道我们是否错过了什么?
/ div>
sstableloader是将数据导入Cassandra的最快方法。你必须编写代码来生成sstables,但是如果你真的关心速度,这将给你最大的损失。
这篇文章是有点老,但基本信息仍适用于您
的方式。
如果你真的不想使用sstableloader,你应该能够更快地通过并行插入。单个节点可以一次处理多个连接,您可以扩展Cassandra群集以提高吞吐量。
We're currently working with Cassandra on a single node cluster to test application development on it. Right now, we have a really huge data set consisting of approximately 70M lines of texts that we would like dump into a Cassandra.
We have tried all of the following:
- Line by line insertion using python Cassandra driver
- Copy command of Cassandra
- Set compression of sstable to none
We have explored the option of the sstable bulk loader, but we don't have an appropriate .db format for this. Our text file to be loaded has 70M lines that look like:
2f8e4787-eb9c-49e0-9a2d-23fa40c177a4 the magnet programs succeeded in attracting applicants and by the mid-1990s only #about a #third of students who #applied were accepted.
The column family that we're intending to insert into has this creation syntax:
CREATE TABLE post (
postid uuid,
posttext text,
PRIMARY KEY (postid)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={};
Problem:The loading of the data into even a simple column family is taking forever -- 5hrs for 30M lines that were inserted. We were wondering if there is any way to expedite this as the performance for 70M lines of the same data being loaded into MySQL takes approximately 6 minutes on our server.
We were wondering if we have missed something? Or if someone could point us in the right direction?
Many thanks in advance!
The sstableloader is the fastest way to import data into Cassandra. You have to write the code to generate the sstables, but if you really care about speed this will give you the most bang for your buck.
This article is a bit old, but the basics still apply to how you generate the SSTables.
If you really don't want to use the sstableloader, you should be able to go faster by doing the inserts in parallel. A single node can handle multiple connections at once, and you can scale out your Cassandra cluster for increased throughput.
这篇关于Cassandra:快速加载大数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!