问题描述
我有一个大约 1.3 GB 的 CSV 文件:
I have a CSV file that's about 1.3 GB in size:
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 10/4/2021 1:23 PM 1397998768 XBTUSD.csv
这是 Kraken 交易所比特币交易数据的完整列表.
This is the complete list of trading data for Bitcoin on the Kraken exchange.
CSV 中的数据如下所示:
Here's what the data in the CSV looks like:
> Get-Content .XBTUSD.csv | Select-Object -First 10
1381095255,122.00000,0.10000000
1381179030,123.61000,0.10000000
1381201115,123.91000,1.00000000
1381201115,123.90000,0.99160000
1381210004,124.19000,1.00000000
1381210004,124.18000,1.00000000
1381311039,124.01687,1.00000000
1381311093,124.01687,1.00000000
1381311094,123.84000,0.82300000
1381431835,125.85000,1.00000000
有关文件的更多信息可在此处获得:
More information about the file is available here:
文件可以从这里下载:
https://drive.google.com/drive/folders/1jI3mZvrPbInNAEaIOoMbWvFfg>
https://drive.google.com/drive/folders/1jI3mZvrPbInNAEaIOoMbWvFfgRDZ44TT
查看文件 XBT.zip
.该存档内是 XBTUSD.csv
.
See the file XBT.zip
. Inside that archive is XBTUSD.csv
.
如果我在 sqlite 中创建下表:
If I create the following table in sqlite:
CREATE TABLE CsvTrades (
"TimeStamp" TEXT NOT NULL,
"Price" TEXT NOT NULL,
"Volume" TEXT NOT NULL
);
并运行以下命令以导入 CSV(以及所需的时间):
and run the following to import the CSV (as well as time how long it takes):
$a = Get-Date
sqlite3.exe .kraken-trades.db -cmd '.mode csv' '.import C:/Users/dharm/XBTUSD.csv CsvTrades'
$b = Get-Date
($b - $a).TotalMinutes
我得到以下信息:
1.56595191666667
1.5 分钟.不错!
在下面的代码中,我使用了 CsvHelper
包:
In the code below, I'm using the CsvHelper
package:
https://joshclose.github.io/CsvHelper/getting-started/一个>
这是 CSV 文件行的类:
Here's a class for the CSV file rows:
public class CsvRow
{
[CsvHelper.Configuration.Attributes.Index(0)]
public long TimeStamp { get; set; }
[CsvHelper.Configuration.Attributes.Index(1)]
public decimal Price { get; set; }
[CsvHelper.Configuration.Attributes.Index(2)]
public decimal Quantity { get; set; }
}
这是 Trade
实体的类:
[Index(nameof(TimeStamp))]
public class Trade
{
public int Id { get; set; }
public decimal Price { get; set; }
public decimal Quantity { get; set; }
public DateTime TimeStamp { get; set; }
}
DbContext
很简单:
public class AppContext : DbContext
{
public DbSet<Trade> Trades { get; set; }
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
var folder = Environment.SpecialFolder.LocalApplicationData;
var path = Environment.GetFolderPath(folder);
var db_path = $"{path}{System.IO.Path.DirectorySeparatorChar}kraken-trades.db";
optionsBuilder.UseSqlite($"Data Source={db_path}");
}
}
最后,执行导入的函数:
And finally, the function that performs the import:
void initialize_from_csv()
{
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
HasHeaderRecord = false
};
using (var reader = new StreamReader(@"C:UsersdharmXBTUSD.csv"))
using (var csv = new CsvReader(reader, config))
{
var records = csv.GetRecords<CsvRow>().Select(row => new Trade()
{
Price = row.Price,
Quantity = row.Quantity,
TimeStamp = DateTimeOffset.FromUnixTimeSeconds(row.TimeStamp).UtcDateTime
});
using (var db = new AppContext())
{
Console.WriteLine(DateTime.Now);
while (true)
{
//var items = records.Take(10_000).ToList();
var items = records.Take(100_000).ToList();
if (items.Any() == false) break;
Console.WriteLine("{0:yyyy-MM-dd}", items[0].TimeStamp);
db.AddRange(items);
db.SaveChanges();
}
Console.WriteLine(DateTime.Now);
}
}
}
问题
当我让它运行时,它确实会继续将项目添加到数据库中.但是,它很慢;我还没有安排它完成,但我可以看到它需要一个多小时.
The question
When I let this run, it does indeed proceed to add the items to the database. However, it's pretty slow; I haven't timed it to completion but I could see it taking over an hour.
在仍然使用 EF Core 的同时,有没有什么好方法可以加快速度?
Is there a good way to make this faster while still using EF Core?
上面引用的代码可以在单个文件中找到:
The code referenced above is available here in a single file:
这是一个 .NET 6 项目.如果您在构建和运行它时遇到任何问题,请告诉我.
It's a .NET 6 project. Let me know if you have any issues building and running it.
我添加了一些代码来计时批量添加.看起来每 100,000 条记录大约需要 7 秒.
I added some code to time the batch adds. It looks like it's taking around 7 seconds per 100,000 records.
Starting batch at 2013-10-06. Batch took 00:00:08.7689932.
Starting batch at 2015-12-08. Batch took 00:00:06.7453421.
Starting batch at 2016-04-19. Batch took 00:00:06.7833506.
Starting batch at 2016-06-25. Batch took 00:00:06.7083806.
Starting batch at 2016-08-22. Batch took 00:00:06.7826717.
Starting batch at 2016-11-20. Batch took 00:00:06.4212123.
wc
说有 41,695,261 行:
wc
says there are 41,695,261 lines:
$ wc -l XBTUSD.csv
41695261 XBTUSD.csv
所以按照这个速度,大约需要 48 分钟.
So at that rate, it would take around 48 minutes.
有些人问,为什么要使用 EF Core?为什么不直接导入?
Some folks have asked, why use EF Core for this? Why not just a direct import?
上面的例子被有意简化,以关注导入的速度.
The example above was intentionally simplified to focus on the speed of the import.
我有更详细的版本,其中与其他实体有关系.在这种情况下:
I have more elaborate versions where there are relationships with other entities. In that case:
使用 EF Core 可以更直接地设置其他表和外键属性.
Setting up the other tables and foreign key properties is more straightforward with EF Core.
我可以更轻松地在数据库后端(SQL Server、PostgreSQL、sqlite)之间切换.
I can more easily switch between database backends (SQL Server, PostgreSQL, sqlite).
例如,请参阅导入多个符号的此分支.Trade
和 Symbol
之间存在关系.也可能存在其他关系.
See for example this branch where multiple symbols are imported. There's a relationship between the Trade
and Symbol
there. There could be other relationships as well.
https://github.com/dharmatech/kraken-trades-database/blob/006/KrakenTradesDatabase/Program.cs
推荐答案
EFCore.BulkExtensions
使用以下内容:
https://github.com/borisdj/EFCore.BulkExtensions
然后改变这一行:
db.AddRange(items);
到:
db.BulkInsert(items);
使导入时间从 48 分钟缩短到 5.7 分钟.
makes the import go from 48 minutes to 5.7 minutes.
此版本的项目可在此处获得:
This version of the project is available here:
感谢 Caius Jard 在上面的评论中推荐了 EFCore.BulkExtensions.
Thanks to Caius Jard who suggested EFCore.BulkExtensions in a comment above.
这篇关于通过 EF Core 将 1.3GB CSV 文件导入 sqlite的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!