c++ - 快速计数大量序列中的核苷酸类型

首先，关于我的问题的背景知识。
我是一名生物信息学家，这意味着我会进行信息学治疗以尝试回答一个生物学问题。在我的问题中，我必须操纵一个名为FASTA文件的文件，该文件如下所示:

>Header 1
ATGACTGATCGNTGACTGACTGTAGCTAGC
>Header 2
ATGCATGCTAGCTGACTGATCGTAGCTAGC
ATCGATCGTAGCT

因此，FASTA文件基本上只是一个 header ，后跟一个'>'字符，然后是一个或多个行上由核苷酸组成的序列。核苷酸是可以采用5个可能值的字符:A，T，C，G或N。

我想做的是计算每种核苷酸类型出现的次数，因此，如果我们考虑这个虚拟FASTA文件:

>Header 1
ATTCGN

结果，我应该拥有:A:1 T:2 C:1 G:1 N:1
这是到目前为止我得到的:

ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;

while(getline(sequence_file, line)) {
    if(line[0] != '>') {
        sequence += line;
    }
    else {
        nucleotide_counts['A'] = boost::count(sequence, 'A');
        nucleotide_counts['T'] = boost::count(sequence, 'T');
        nucleotide_counts['C'] = boost::count(sequence, 'C');
        nucleotide_counts['G'] = boost::count(sequence, 'G');
        nucleotide_counts['N'] = boost::count(sequence, 'N');
        sequence = "";
    }
}

因此，它逐行读取文件，如果遇到“>”作为该行的第一个字符，它将知道序列已完成并开始计数。现在我面临的问题是我有数以百万计的序列，总共有数十亿个核苷酸。我可以看到我的方法没有优化，因为我在同一序列上调用了boost::count五次。

我尝试过的其他方法:

解析序列以增加每种核苷酸类型的计数器。我尝试使用map<char, double>将每个核苷酸映射到一个值，但这比Boost解决方案要慢。

使用算法库的std::count，但这也太慢了。

我在互联网上搜索了解决方案，但是如果序列数少，我发现的每个解决方案都是好的，这不是我的情况。您有什么想法可以帮助我加快速度吗？

编辑1 :
我也尝试过这个版本，但是它比增强版本慢了2倍:

ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;

while(getline(sequence_file, line)) {
    if(line[0] != '>') {
        sequence += line;
    }
    else {
        for(int i = 0; i < sequence.size(); i++) {
           nucleotide_counts[sequence[i]]++;
        }
        sequence = "";
    }
}

编辑2 :感谢该线程中的每个人，与boost原始解决方案相比，我能够获得大约30倍的加速。这是代码:

#include <map> // std::array
#include <fstream> // std::ifstream
#include <string> // std::string

void count_nucleotides(std::array<double, 26> &nucleotide_counts, std::string sequence) {
    for(unsigned int i = 0; i < sequence.size(); i++) {
        ++nucleotide_counts[sequence[i] - 'A'];
    }
}

std::ifstream sequence_file(input_file.c_str());
std::string line;
std::string sequence = "";
std::array<double, 26> nucleotide_counts;

while(getline(sequence_file, line)) {
    if(line[0] != '>') {
        sequence += line;
    }
    else {
        count_nucleotides(nucleotide_counts, sequence);
        sequence = "";
    }
}

最佳答案

如果需要速度并且可以使用数组，请不要使用 map 。另外， std::getline 可以使用自定义定界符(而不是\n)。

ifstream sequence_file(input_file.c_str());
string sequence = "";
std::array<int, 26> nucleotide_counts;

// For one sequence
getline(sequence_file, sequence, '>');
for(auto&& c : sequence) {
    ++nucleotide_counts[c-'A'];
}

// nucleotide_counts['X'-'A'] contains the count of nucleotide X in the sequence

Demo

关于c++ - 快速计数大量序列中的核苷酸类型，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/53156250/