I have a small script that's written in Scala which is intended to load a MongoDB instance up with 100,000,000 sample records. The idea is to get the DB all loaded, and then do some performance testing (and tune/re-load if necessary).
The problem is that the load-time per 100,000 records increases pretty linearly. At the beginning of my load process it took only 4 seconds to load those records. Now, at nearly 6,000,000 records, it's taking between 300 and 400 seconds to load the same amount (100,000)! That's two orders of magnitude slower! Queries are still snappy, but at this rate, I'll never be able to load the amount of data that I'd like.
如果我将所有记录(全部100,000,000个)写出一个文件,然后使用 mongoimport 导入整个东西?还是我的期望过高,而我正在使用数据库,而不是应该处理的?
Will this work faster if I write a file out with all of my records (all 100,000,000!), and then use mongoimport to import the whole thing? Or are my expectations too high and I'm using the DB beyond what it's supposed to handle?
import java.util.Date
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.commons.MongoDBObject
object MongoPopulateTest {
val random = new scala.util.Random(12345)
val connection = MongoConnection()
val db = connection("mongoVolumeTest")
val collection = db("testData")
val INDEX_KEYS = List("A", "G", "E", "F")
def main(args: Array[String]) {
populateCoacs(ONE_MILLION * 100)
def populateCoacs(count: Int) {
println("Creating indexes: " + INDEX_KEYS.mkString(", "))
INDEX_KEYS.map(key => collection.ensureIndex(MongoDBObject(key -> 1)))
println("Adding " + count + " records to DB.")
val start = (new Date()).getTime()
var lastBatch = start
for(i <- 0 until count) {
if(i % 100000 == 0 && i != 0) {
println(i + ": " + (((new Date()).getTime() - lastBatch) / 1000.0) + " seconds (" + (new Date()).toString() + ")")
lastBatch = (new Date()).getTime()
val elapsedSeconds = ((new Date).getTime() - start) / 1000
println("Done. " + count + " COAC rows inserted in " + elapsedSeconds + " seconds.")
def makeCoac(): MongoDBObject = {
"A" -> random.nextPrintableChar().toString(),
"B" -> scala.math.abs(random.nextInt()),
"C" -> makeRandomPrintableString(50),
"D" -> (if(random.nextBoolean()) { "Cd" } else { "Cc" }),
"E" -> makeRandomPrintableString(15),
"F" -> makeRandomPrintableString(15),
"G" -> scala.math.abs(random.nextInt()),
"H" -> random.nextBoolean(),
"I" -> (if(random.nextBoolean()) { 41 } else { 31 }),
"J" -> (if(random.nextBoolean()) { "A" } else { "B" }),
"K" -> random.nextFloat(),
"L" -> makeRandomPrintableString(15),
"M" -> makeRandomPrintableString(15),
"N" -> scala.math.abs(random.nextInt()),
"O" -> random.nextFloat(),
"P" -> (if(random.nextBoolean()) { "USD" } else { "GBP" }),
"Q" -> (if(random.nextBoolean()) { "PROCESSED" } else { "UNPROCESSED" }),
"R" -> scala.math.abs(random.nextInt())
def makeRandomPrintableString(length: Int): String = {
var result = ""
for(i <- 0 until length) {
result += random.nextPrintableChar().toString()
Creating indexes: A, G, E, F
Adding 100000000 records to DB.
100000: 4.456 seconds (Thu Jul 21 15:18:57 EDT 2011)
200000: 4.155 seconds (Thu Jul 21 15:19:01 EDT 2011)
300000: 4.284 seconds (Thu Jul 21 15:19:05 EDT 2011)
400000: 4.32 seconds (Thu Jul 21 15:19:10 EDT 2011)
500000: 4.597 seconds (Thu Jul 21 15:19:14 EDT 2011)
600000: 4.412 seconds (Thu Jul 21 15:19:19 EDT 2011)
700000: 4.435 seconds (Thu Jul 21 15:19:23 EDT 2011)
800000: 5.919 seconds (Thu Jul 21 15:19:29 EDT 2011)
900000: 4.517 seconds (Thu Jul 21 15:19:33 EDT 2011)
1000000: 4.483 seconds (Thu Jul 21 15:19:38 EDT 2011)
1100000: 4.78 seconds (Thu Jul 21 15:19:43 EDT 2011)
1200000: 9.643 seconds (Thu Jul 21 15:19:52 EDT 2011)
1300000: 25.479 seconds (Thu Jul 21 15:20:18 EDT 2011)
1400000: 30.028 seconds (Thu Jul 21 15:20:48 EDT 2011)
1500000: 24.531 seconds (Thu Jul 21 15:21:12 EDT 2011)
1600000: 18.562 seconds (Thu Jul 21 15:21:31 EDT 2011)
1700000: 28.48 seconds (Thu Jul 21 15:21:59 EDT 2011)
1800000: 29.127 seconds (Thu Jul 21 15:22:29 EDT 2011)
1900000: 25.814 seconds (Thu Jul 21 15:22:54 EDT 2011)
2000000: 16.658 seconds (Thu Jul 21 15:23:11 EDT 2011)
2100000: 24.564 seconds (Thu Jul 21 15:23:36 EDT 2011)
2200000: 32.542 seconds (Thu Jul 21 15:24:08 EDT 2011)
2300000: 30.378 seconds (Thu Jul 21 15:24:39 EDT 2011)
2400000: 21.188 seconds (Thu Jul 21 15:25:00 EDT 2011)
2500000: 23.923 seconds (Thu Jul 21 15:25:24 EDT 2011)
2600000: 46.077 seconds (Thu Jul 21 15:26:10 EDT 2011)
2700000: 104.434 seconds (Thu Jul 21 15:27:54 EDT 2011)
2800000: 23.344 seconds (Thu Jul 21 15:28:17 EDT 2011)
2900000: 17.206 seconds (Thu Jul 21 15:28:35 EDT 2011)
3000000: 19.15 seconds (Thu Jul 21 15:28:54 EDT 2011)
3100000: 14.488 seconds (Thu Jul 21 15:29:08 EDT 2011)
3200000: 20.916 seconds (Thu Jul 21 15:29:29 EDT 2011)
3300000: 69.93 seconds (Thu Jul 21 15:30:39 EDT 2011)
3400000: 81.178 seconds (Thu Jul 21 15:32:00 EDT 2011)
3500000: 93.058 seconds (Thu Jul 21 15:33:33 EDT 2011)
3600000: 168.613 seconds (Thu Jul 21 15:36:22 EDT 2011)
3700000: 189.917 seconds (Thu Jul 21 15:39:32 EDT 2011)
3800000: 200.971 seconds (Thu Jul 21 15:42:53 EDT 2011)
3900000: 207.728 seconds (Thu Jul 21 15:46:21 EDT 2011)
4000000: 213.778 seconds (Thu Jul 21 15:49:54 EDT 2011)
4100000: 219.32 seconds (Thu Jul 21 15:53:34 EDT 2011)
4200000: 241.545 seconds (Thu Jul 21 15:57:35 EDT 2011)
4300000: 193.555 seconds (Thu Jul 21 16:00:49 EDT 2011)
4400000: 190.949 seconds (Thu Jul 21 16:04:00 EDT 2011)
4500000: 184.433 seconds (Thu Jul 21 16:07:04 EDT 2011)
4600000: 231.709 seconds (Thu Jul 21 16:10:56 EDT 2011)
4700000: 243.0 seconds (Thu Jul 21 16:14:59 EDT 2011)
4800000: 310.156 seconds (Thu Jul 21 16:20:09 EDT 2011)
4900000: 318.421 seconds (Thu Jul 21 16:25:28 EDT 2011)
5000000: 378.112 seconds (Thu Jul 21 16:31:46 EDT 2011)
5100000: 265.648 seconds (Thu Jul 21 16:36:11 EDT 2011)
5200000: 295.086 seconds (Thu Jul 21 16:41:06 EDT 2011)
5300000: 297.678 seconds (Thu Jul 21 16:46:04 EDT 2011)
5400000: 329.256 seconds (Thu Jul 21 16:51:33 EDT 2011)
5500000: 336.571 seconds (Thu Jul 21 16:57:10 EDT 2011)
5600000: 398.64 seconds (Thu Jul 21 17:03:49 EDT 2011)
5700000: 351.158 seconds (Thu Jul 21 17:09:40 EDT 2011)
5800000: 410.561 seconds (Thu Jul 21 17:16:30 EDT 2011)
5900000: 689.369 seconds (Thu Jul 21 17:28:00 EDT 2011)
Do not index your collection before inserting, as inserts modify the index which is an overhead. Insert everything, then create index .
代替"save",使用mongoDB"batchinsert" 可以在1个操作中插入许多记录.因此,每批插入大约5000个文档.您将看到显着的性能提升.
instead of "save" , use mongoDB "batchinsert" which can insert many records in 1 operation. So have around 5000 documents inserted per batch.You will see remarkable performance gain .
请参阅插入方法#2 ,它需要插入一系列文档,而不是单个文档.另请参见此主题
see the method#2 of insert here, it takes array of documents to insert instead of single document.Also see the discussion in this thread
And if you want to benchmark more -
This is just a guess, try using a capped collection of a predefined large size to store all your data. Capped collection without index has very good insertion performance.