问题描述
我涉及生物学,特别DNA和常存在与来自测序基因组中的数据的大小的问题。
I am involved with biology, specifically DNA and often there is a problem with the size of the data that comes from sequencing a genome.
对于那些不具有生物学背景的谁,我会给DNA测序的快速概览。 DNA是由四个字母A,T,G,和C,它们的具体顺序的确定在细胞会发生什么。
For those of you who don't have a background in biology, I'll give a quick overview of DNA sequencing. DNA consists of four letters: A, T, G, and C, the specific order of which determines what happens in the cell.
与DNA测序技术的一个主要问题然而是,产生的数据的大小,(为一个全基因组,往往比千兆字节得多)。
A major problem with DNA sequencing technology however is the size of the data that results, (for a whole genome, often much more than gigabytes).
我知道,在C int的大小从计算机到计算机各不相同,但它仍然有超过四种选择方式更多信息存储可能性。有没有一种方法来定义一个类/方法来定义一个'基地'只占用2个或3个位?我已经搜查了定义结构,但恐怕这不是我要找的。谢谢。
I know that the size of an int in C varies from computer to computer, but it still has way more information storage possibility than four choices. Is there a way to define a type/way to define a 'base' that only takes up 2 or 3 bits? I've searched for defining a structure, but am afraid this isn't what I'm looking for. Thanks.
此外,将这项工作更好地在其他语言(可能更高水平如Java)?
Also, would this work better in other languages (maybe higher level like java)?
推荐答案
你不能只是把这些东西2 ATGC套入,然后一个字节?这样的:
Can't you just stuff two ATGC sets into one byte then? Like:
0 1 0 1 1 0 0 1
A T G C A T G C
所以这一个字节将重新present TC,AC?
So this one byte would represent TC,AC?
这篇关于声明为DNA中的新的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!