本文介绍了数据存储大型天体物理模拟数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我是天体物理学的研究生。我使用十多年来大部分人开发的代码来运行大型的模拟。有关这些代码的示例,您可以查看小工具 http://www.mpa-garching.mpg .de / gadget / 和enzo http://code.google.com/p/enzo / 。这些绝对是两个最成熟的代码(它们使用不同的方法)。 这些模拟的输出是巨大的。根据你的代码,你的数据有点不一样,但总是大数据。你通常需要数十亿粒子和细胞来做任何事情。最大的运行速度是每个快照的TB和每个模拟的数百个快照。目前,读取和写入这种数据的最好方法是使用HDF5 http://www.hdfgroup.org/HDF5/ ,这基本上是使用二进制文件。这是一个巨大的改进与无格式二进制文件与自定义标题块(仍然给我做恶梦),但我不禁想到有可能是一个更好的方式来做到这一点。 我想像数据的大小是这里的问题,但是有什么样的数据存储能够有效地处理TB数据的二进制数据,或者二进制文件是唯一的方法吗? 如果有帮助的话,我们通常会按列存储数据。也就是说,你有一个所有粒子ID的块,所有粒子位置的块,粒子速度块等等。它不是最漂亮的,但它是在一些体积中做粒子查找的最快的方法。 编辑:抱歉,对这些问题含糊不清。史蒂夫说得对,这可能只是数据结构问题,而不是数据存储方法。我必须现在跑,但我会在今晚或明天晚些时候提供更多的细节。 edit 2:所以我越看越这个,我意识到这可能不再是一个数据存储问题。未格式化的二进制文件的主要问题是所有头痛的阅读正确的数据(获取块大小和顺序,并确保关于它)。 HDF5几乎解决了这个问题,在文件系统限制得到改善之前不会有更快的选择(感谢Matt Turk)。到数据结构。 HDF5就像我们可以得到的一样,即使它不是最好的接口来查询。习惯了数据库,我认为能够查询诸如随时给我所有速度超过x的粒子是非常有趣的。你现在可以做这样的事情,但你必须在较低的水平上工作。当然,考虑到数据量有多大,取决于你在做什么,在性能方面低水平工作可能是一件好事。解决方案 MongoDB: http://www.mongodb.org / Netezza 产品: Hadoop: http://hadoop.apache.org/ 维基百科的分布式文件列表系统: http://en.wikipedia.org/wiki/List_of_file_systems #Distributed_file_systems 编辑 我缺乏解释的理由/等等: OP表示:[HDF5]对自定义标题块的无格式二进制文件有很大的改进(仍然给我做恶梦),但我不禁想到可能有更好的方法要做到这一点。 更好是什么意思?更好的结构?他似乎暗指未格式化的二进制文件是一个问题 - 所以也许他的意思是说更好。如果是这样,他将需要一些结构 - 因此,第一夫妇的建议。 OP表示:我想象数据的大小是这里的问题,但是是否有某种数据存储可以处理数TB的二进制数据有效率,还是二进制文件的唯一方式在这一点? 是的,有几个。无论是结构化的还是非结构化的 - 他是否想要结构,还是很乐意让它们以某种未格式化的二进制格式?我们仍然不知道 - 所以我建议检查一下分布式文件系统。 OP表示:如果有帮助,我们通常将数据按列存储,也就是说,您有一个所有粒子ID的块,所有的粒子位置,粒子速度块等等。它不是最漂亮的,但它是在某些体积中做粒子查找的最快速度。 同样,OP是否想要更好的结构,还是不?似乎他想要两个 - 更好的结构和更快....也许缩放OUT会给他这个。这进一步加强了我列出的前几个选项。 OP说(在评论中):我不知道我们是否可以尽管在io上也是如此。 是否有IO要求?成本限制?他们是什么? 在这里我们不能得到任何东西。没有银弹存储解决方案。我们在这里所要求的是大量的数据和我不知道我是否喜欢缺乏结构,但是我不愿意增加我的IO来容纳任何额外的结构。我不知道他期待什么样的回答。他还没有列出一个关于目前解决方案的投诉,而不是缺乏结构 - 而且他已经表示,他不愿意付出任何开销来做任何事情......所以......? I'm a grad student in astrophysics. I run big simulations using codes mostly developed by others over a decade or so. For examples of these codes, you can check out gadget http://www.mpa-garching.mpg.de/gadget/ and enzo http://code.google.com/p/enzo/. Those are definitely the two most mature codes (they use different methods).The outputs from these simulations are huge. Depending on your code, your data is a bit different, but it's always big data. You usually take billions of particles and cells to do anything realistic. The biggest runs are terabytes per snapshot and hundreds of snapshots per simulation.Currently, it seems that the best way to read and write this kind of data is to use HDF5 http://www.hdfgroup.org/HDF5/, which is basically an organized way of using binary files. It's a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this.I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume.edit: Sorry for being vague about the issues. Steve is right that this might just be an issue of data structure rather than the data storage method. I have to run now, but I will provide more details late tonight or tomorrow.edit 2: So the more I look into this, the more I realize that this probably isn't a datastore issue anymore. The main issue with unformatted binary was all the headaches reading the data correctly (getting the block sizes and order right and being sure about it). HDF5 pretty much fixed that and there isn't going to be a faster option until the file system limitations are improved (thanks Matt Turk).The new issues probably come down to data structure. HDF5 is as performant as we can get, even if it is not the nicest interface to query against. Being used to databases, I thought it would be really interesting/powerful to be able to query something like "give me all particles with velocity over x at any time". You can do something like that now, but you have to work at a lower level. Of course, given how big the data is and depending on what you are doing with it, it might be a good thing to work at a low level for performance sake. 解决方案MongoDB: http://www.mongodb.org/NetezzaProducts:http://www.netezza.com/data-warehouse-appliance-products/skimmer.aspxHadoop: http://hadoop.apache.org/Wikipedia's List of Distributed FileSystems:http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systemsEDITRationale for my lack of explanation / etc.:OP says: "[HDF5]'s a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this."What does "better" mean? Better structured? He seems to allude to the "unformatted binary files" as being an issue - so maybe that's what he means by better. If so, he'll need something with some structure - hence the first couple suggestions.OP says: "I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?"Yes, there are several. Both structured, and "unstructured" - does he want structure, or is he happy to leave them in some sort of "unformatted binary format"? We still don't know - so I suggest checking out some Distributed File Systems.OP says: "If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume."Again, Does the OP want better structure, or doesn't he? Seems like he wants both - better structure AND faster.... maybe scaling OUT will give him this. This further reinforces the first few options I listed.OP says (in comments): "I don't know if we can take the hit on io though."Are there IO requirements? Cost restrictions? What are they?We can't get something for nothing here. There is no "silver-bullet" storage solution. All we have to go on here for requirements is "lots of data" and "I don't know if I like the lack of structure, but I'm not willing to increase my IO to accommodate any additional structure"... so I don't know what kind of answer he's expecting. He hasn't listed a single complaint about the current solution he has other than the lack of structure - and he's already said he's not willing to pay any overhead to do anything about that... so.... ? 这篇关于数据存储大型天体物理模拟数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-05 19:21