问题描述
我们每年生成 20.000.000 个文本文件,每个平均大小约为 250 Kb(35 Kb 压缩).
We have 20.000.000 generated textfiles every year, average size is approx 250 Kb each (35 Kb zipped).
我们必须将这些文件放入某种存档中 10 年.不需要在文本文件中搜索,但我们必须能够通过搜索 5-10 个元数据字段(例如productname"、creationdate"等)来找到一个 texfile.
We must put these files in some kind of archive for 10 years. No need to search inside textfiles, but we must be able to find one texfile by searching on 5-10 metadata fields such as "productname", "creationdate", etc.
我正在考虑压缩每个文件并将它们存储在 SQL Server 数据库中,该数据库具有 5-10 个可搜索(索引)列和一个用于压缩文件数据的 varbinary(MAX) 列.
I'm considering zipping each file and storing them in a SQL Server database with 5-10 searchable (indexed) columns and a varbinary(MAX) column for the zipped file data.
数据库会随着时间的推移变得庞大;5-10 TB.所以我认为我们需要对数据进行分区,例如每年保留一个数据库.
The database will be grow huge over the years; 5-10 Tb. So I think we need to partition data for example by keeping one database per year.
我一直在研究在 SQL Server 中对包含数据的 varbinary 列使用 FILESTREAM,但似乎这更适合大于 1 Mb 的 blob?
I've been looking into using FILESTREAM in SQL Server for the varbinary column that holds the data, but it seems this is more suitable for blobs > 1 Mb?
有关如何管理此类数据量的任何其他建议?
Any other suggestions on how to manage such data volumes?
推荐答案
Filestream 绝对更适合更大的 blob (750kB-1MB),因为打开外部文件所需的开销开始影响读写性能 vs. vb(max) 小文件的 blob 存储.如果这不是什么大问题(即,在初始写入后读取 blob 数据的频率很低,并且 blob 实际上是不可变的),那么这绝对是一个选择.
Filestream is definitely more suited to larger blobs (750kB-1MB) as the overhead required to open the external file begins to impact read and write performance vs. vb(max) blob storage for small files. If this is not so much of an issue (ie. reads of blob data after the initial write are infrequent, and the blobs are effectively immutable) then it's definitely an option.
我可能会建议将文件直接保存在 vb(max) 列中,如果您可以保证它们的大小不会变大,但是使用 TEXTIMAGE_ON 选项将此表存储在单独的文件组中,这将允许您如有必要,将其从元数据的其余部分移至不同的存储.此外,请确保设计您的架构,以便可以使用分区或通过某些多表方案将 blob 的实际存储拆分到多个文件组,以便您可以在将来必要时扩展到不同的磁盘.
I would probably suggest keeping the files directly in a vb(max) column if you can guarantee they won't get much larger in size, but have this table stored in a seperate filegroup using the TEXTIMAGE_ON option which would allow you to move it to different storage from the rest of the metadata if necessary. Also, make sure to design your schema so the actual storage of blobs can be split over multiple filegroups either using partitions or via some multiple table scheme so you can scale to different disks if necessary in the future.
通过 Filestream 或直接 vb(max) 存储使 blob 与 SQL 元数据直接相关比处理文件系统/SQL 不一致具有许多优势,不仅限于易于备份和其他管理操作.
Keeping the blobs directly associated with the SQL metadata either via Filestream or direct vb(max) storage has many advantages over dealing with filesystem / SQL inconsistencies not limited to ease of backup and other management operations.
这篇关于庞大的 SQL Server 数据库中的 Blob 数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!