c++ - (Qt与C++)读取文件(〜25Mb)和比较字符串结果比python慢

我有两个文件，“测试”和“样本”。每个文件都包含“ rs-numbers”，后跟“ genotypes”。

测试文件小于示例文件。仅具有约150个rs-numbers +他们的基因型。

但是，样本文件包含超过900k的rs-numbers +它们的基因型。

readTest（）打开“ test.tsv”，逐行读取文件，并返回一个元组向量。元组保存（rs-number，基因类型）。

analytics（）从readTest（）中获取结果，打开示例文件，逐行读取文件，然后进行比较。

示例文件中的示例：

rs12124811 \ t1 \ t776546 \ tAA \ r \ n

rs11240777 \ t1 \ t798959 \ tGG \ r \ n

rs12124811和rs11240777是rs-number。 AA和GG是它们的基因型。

运行时间为n * m。在我的程序的c ++版本中，它需要30秒，而python版本仅需要15秒和5秒即可进行多处理。

vector<tuple<QString, QString>> readTest(string test){
// readTest can be done instantly
return tar_gene;
}

// tar_gene is the result from readTest()
// now call analyze. it reads through the sample.txt by line and does
// comparison.
QString analyze(string sample_name,
         vector<tuple<QString, QString>> tar_gene
         ){

QString data_matches;

QFile file(QString::fromStdString(sample_name));
file.open(QIODevice::ReadOnly);
//skip first 20 lines
for(int i= 0; i < 20; i++){
    file.readLine();
}

while(!file.atEnd()){ // O(m)
    const QByteArray line = file.readLine();
    const QList<QByteArray> tokens = line.split('\t');
    // tar_gene is the result from readTest()
    for (auto i: tar_gene){ // O(n*m)
        // check if two rs-numbers are matched
        if (get<0>(i) == tokens[0]){
            QString i_rs = get<0>(i);
            QString i_geno = get<1>(i);
            QByteArray cur_geno = tokens[3].split('\r')[0];
            // check if their genotypes are matched
            if(cur_geno.length() == 2){
                if (i_geno == cur_geno.at(0) || i_geno == cur_geno.at(1)){
                    data_matches += i_rs + '-' + i_geno + '\n';
                    break; // rs-numbers are unique. we can safely break
                           // the for loop
                }
            }
            // check if their genotypes are matched
            else if (cur_geno.length() == 1) {
                if (i_geno == cur_geno.at(0)){
                    data_matches += i_rs + '-' + i_geno + '\n';
                    break; // rs-numbers are unique. we can safely break
                           // the for loop
                }
            }
        }
    }
}
return data_matches; // QString data_matches will be used in main() and
                     // printed out in text browser
}

这是完整的源代码

#include "mainwindow.h"
#include "ui_mainwindow.h"

MainWindow::MainWindow(QWidget *parent) :
    QMainWindow(parent),
    ui(new Ui::MainWindow)
{
    ui->setupUi(this);
}

MainWindow::~MainWindow()
{
    delete ui;
}

QString analyze(string sample_name,
             vector<tuple<QString, QString>> tar_gene,
             int start, int end){

    QString rs_matches, data_matches;

    QFile file(QString::fromStdString(sample_name));
    file.open(QIODevice::ReadOnly);
    //skip first 20 lines
    for(int i= 0; i < 20; i++){
        file.readLine();
    }

    while(!file.atEnd()){
        const QByteArray line = file.readLine();
        const QList<QByteArray> tokens = line.split('\t');
        for (auto i: tar_gene){
            if (get<0>(i) == tokens[0]){
                QString i_rs = get<0>(i);
                QString i_geno = get<1>(i);
                QByteArray cur_geno = tokens[3].split('\r')[0];
                if(cur_geno.length() == 2){
                    if (i_geno == cur_geno.at(0) || i_geno == cur_geno.at(1)){
                        data_matches += i_rs + '-' + i_geno + '\n';
                        break;
                    }
                }
                else if (cur_geno.length() == 1) {
                    if (i_geno == cur_geno.at(0)){
                        data_matches += i_rs + '-' + i_geno + '\n';
                        break;
                    }
                }
            }
        }
    }
    return data_matches;
}

vector<tuple<QString, QString>> readTest(string test){
    vector<tuple<QString, QString>> tar_gene;
    QFile file(QString::fromStdString(test));
    file.open(QIODevice::ReadOnly);
    file.readLine(); // skip first line
    while(!file.atEnd()){
        QString line = file.readLine();
        QStringList templist;
        templist.append(line.split('\t')[20].split('-'));
        tar_gene.push_back(make_tuple(templist.at(0),
                                      templist.at(1)));
    }
    return tar_gene;
}

void MainWindow::on_pushButton_analyze_clicked()
{
    if(ui->comboBox_sample->currentIndex() == 0){
        ui->textBrowser_rs->setText("Select a sample.");
        return;
    }
    if(ui->comboBox_test->currentIndex() == 0){
        ui->textBrowser_rs->setText("Select a test.");
        return;
    }
    string sample = (ui->comboBox_sample->currentText().toStdString()) + ".txt";
    string test = ui->comboBox_test->currentText().toStdString() + ".tsv";

    vector<tuple<QString, QString>> tar_gene;

    QFile file_test(QString::fromStdString(test));
    if (!file_test.exists()) {
        ui->textBrowser_rs->setText("The test file doesn't exist.");
        return;
    }

    tar_gene = readTest(test);

    QFile file_sample(QString::fromStdString(sample));
    if (!file_sample.exists()) {
        ui->textBrowser_rs->setText("The sample file doesn't exist.");
        return;
    }
    clock_t t1,t2;

    t1=clock();
    QString result = analyze(sample, tar_gene, 0, 0);
    t2=clock();
    float diff ((float)t2-(float)t1);

    float seconds = diff / CLOCKS_PER_SEC;
    qDebug() << seconds;
    ui->textBrowser_rsgeno->setText(result);
}

我如何使其运行更快？我用C ++重新制作了程序，因为我希望看到比python版本更好的性能！

在@Felix的帮助下，我的程序现在需要15秒。稍后我将尝试多线程。

以下是源数据文件的示例：

（test.tsv）
rs17760268-C
rs10439884-A
rs4911642-C
rs157640-G
... 和更多。他们没有排序。

（sample.txt）
rs12124811 \ t1 \ t776546 \ tAA \ r \ n
rs11240777 \ t1 \ t798959 \ tGG \ r \ n
... 和更多。他们没有排序。

对更好的数据结构或算法有什么建议吗？

最佳答案

实际上，您可以做很多事情来优化此代码。

确保正在使用的Qt已针对速度（而非大小）进行了优化
在发布模式下构建您的应用程序。请确保设置正确的编译器和链接器标志以优化速度，而不是大小
尝试更多使用参考。这样可以避免不必要的复制操作。例如，使用for(const auto &i : tar_gene)。还有更多实例。基本上，请尽量避免引用任何东西。这也意味着尽可能使用std::move和右值引用。
启用QStringBuilder。它将优化字符串的串联。这样做，将DEFINES += QT_USE_QSTRINGBUILDER添加到您的pro文件中。
在整个代码中使用QString或QByteArray。混合它们意味着Qt必须在每次比较它们时进行转换。

这些只是您可以做的最基本，最简单的事情。试试看，看看您可以获得多少速度。如果还不够，您可以尝试进一步优化正在此处实现的数学算法，或者更深入地研究C / C ++，以了解可以提高速度的所有小技巧。

编辑：您也可以尝试通过在多个线程上拆分代码来提高速度。手工执行此操作不是一个好主意-如果您对此感兴趣，请查看QtConcurrent。

关于c++ - (Qt与C++)读取文件(〜25Mb)和比较字符串结果比python慢，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/54118145/