问题描述
所以场景如下:
我有一个 Web 服务的多个实例,用于将大量数据写入 Azure 存储.我需要能够根据接收时间将 blob 分组到一个容器(或虚拟目录)中.每隔一段时间(最糟糕的情况下每天),旧的 blob 将被处理然后删除.
我有两个选择:
选项 1
我制作了一个名为blob"的容器(例如),然后将所有博客存储到该容器中.每个 blob 都将使用一个目录样式名称,目录名称是接收它的时间(例如hr0min0/data.bin"、hr0min0/data2.bin"、hr0min30/data3.bin"、hr1min45/data.bin""、...、"hr23min0/dataN.bin" 等 - 每 X 分钟一个新目录).处理这些 blob 的东西将首先处理 hr0min0 blob,然后是 hr0minX 等等(并且 blob 在处理时仍在写入).
选项 2
我有许多容器,每个容器都有一个基于到达时间的名称(所以首先是一个名为 blob_hr0min0 的容器,然后是 blobs_hr0minX 等),并且容器中的所有 blob 都是在指定时间到达的 blob.处理这些博客的东西一次只能处理一个容器.
所以我的问题是,哪个选项更好?选项 2 是否给了我更好的并行化(因为容器可以在不同的服务器上)还是选项 1 更好,因为许多容器可能导致其他未知问题?
我认为这并不重要(从可扩展性/并行化的角度来看),因为 Win Azure blob 存储中的分区是在 blob 级别完成的,而不是容器.分布在不同容器中的原因更多地与访问控制(例如 SAS)或总存储大小有关.
(向下滚动到分区").
引用:
Blob – 由于分区键归结为 Blob 名称,我们可以加载在尽可能多的服务器上平衡对不同 blob 的访问,以便扩展对它们的访问.这允许容器增长到最大根据您的需要(在存储帐户空间限制内).这权衡是我们不提供执行原子的能力跨多个 blob 的事务.
So the scenario is the following:
I have a multiple instances of a web service that writes a blob of data to Azure Storage. I need to be able to group blobs into a container (or a virtual directory) depending on when it was received. Once in a while (every day at the worst) older blobs will get processed and then deleted.
I have two options:
Option 1
I make one container called "blobs" (for example) and then store all the blogs into that container. Each blob will use a directory style name with the directory name being the time it was received (e.g. "hr0min0/data.bin", "hr0min0/data2.bin", "hr0min30/data3.bin", "hr1min45/data.bin", ... , "hr23min0/dataN.bin", etc - a new directory every X minutes). The thing that processes these blobs will process hr0min0 blobs first, then hr0minX and so on (and the blobs are still being written when being processed).
Option 2
I have many containers each with a name based on the arrival time (so first will be a container called blobs_hr0min0 then blobs_hr0minX, etc) and all the blobs in the container are those blobs that arrived at the named time. The thing that processes these blogs will process one container at a time.
So my question is, which option is better? Does option 2 give me better parallelization (since a containers can be in different servers) or is option 1 better because many containers can cause other unknown issues?
I don't think it really matters (from a scalability/parallelization perspective), because partitioning in Win Azure blobs storage is done at the blob level, not the container. Reasons to spread out across different containers have more to do with access control (e.g. SAS) or total storage size.
See here for more details: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/05/10/windows-azure-storage-abstractions-and-their-scalability-targets.aspx
(Scroll down to "Partitions").
Quoting:
这篇关于是拥有多个小型 Azure 存储 blob 容器(每个容器都有一些 blob)还是一个非常大的带有大量 blob 的容器更好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!