c++ - 迭代一个字符串，但将分隔符保留在子字符串中，包括其他规则

我正在尝试遍历一个字符串，使其分解为附加到 vector 末尾的子字符串。此外，我还试图制定一些其他规则。 (撇号被认为是字母数字，如果 ',' 出现在数字之间，则可以，如果 '.' 出现在数字/空格之前或数字之间，则可以)

例如:

This'.isatest!!!!andsuch .1,00,0.011#$%@

会出现:

myvector[This'][.][isatest][!!!!][andsuch][.1,00,0.011][#$%@]

我可以轻松地拆分非字母数字字符(和撇号)以及“,”和“.”的 if 语句，但是我在保留分隔符方面遇到了麻烦。目前，我得到的东西更像是:

myvector[This'][.][isatest][!][!][!][!][andsuch][.1,00,0.011][#][$][%][@]

有什么有用的提示吗？

最佳答案

因为我可能有点布谷鸟，所以除了 previous answer using Boost Spirit to generate it 之外，我还花时间做另一个手动解析器。

正如你所看到的，它并不完全简单。它很乏味、容易出错、难以维护并且不太通用。你选!

Live On Coliru

#include <string>
#include <iterator>
#include <algorithm>
#include <iostream>

template <typename Out>
Out smart_split(char const* first, char const* last, Out out) {
    auto it = first;
    std::string token;

    auto emit = [&] {
        if (!token.empty())
            *out++ = token;
        token.clear();
        return out;
    };

    enum { NUMBER_LIST, OTHER } state = OTHER;

    while (it != last) {
#ifndef NDEBUG
        std::cout << std::string(it - first, ' ') << std::string(it, last) << " (token: '" << token << "')\n";
#endif

        if (std::isdigit(*it) || *it == '-' || *it == '+' || *it == '.') {
            if (state != NUMBER_LIST)
                emit();

            char* e;
            std::strtod(it, &e);
            if (it < e) {
                token.append(it, static_cast<char const*>(e));
                it = e;

                if (it != last && *it == ',') {
                    token += *it++;
                    state = NUMBER_LIST;
                }
            }
            else {
                token += *it++;
            }
        }
        else if (std::isalpha(*it) || *it == '\'') {
            state = OTHER;
            emit();

            while (it != last && (std::isalpha(*it) || *it == '\'')) {
                token += *it++;
            }

            emit();
        }
        else {
            if (state == NUMBER_LIST)
                emit();
            state = OTHER;
            token += *it++;
        }
    }

    return emit();
}

#include <vector>

typedef std::vector<std::string> Tokens;

int main()
{
    std::string const input = "This'.isatest!!!!andsuch.1,00,0.11#$%@";

    Tokens actual;
    smart_split(input.data(), input.data() + input.size(), back_inserter(actual));

    for (auto& token : actual)
        std::cout << token << "\n";
}

打印:

This'
.
isatest
!!!!
andsuch
.1,00,0.11
#$%@

在 DEBUG 构建的情况下，它还通过循环跟踪进度:

This'.isatest!!!!andsuch.1,00,0.11#$%@ (token: '')
     .isatest!!!!andsuch.1,00,0.11#$%@ (token: '')
      isatest!!!!andsuch.1,00,0.11#$%@ (token: '.')
             !!!!andsuch.1,00,0.11#$%@ (token: '')
              !!!andsuch.1,00,0.11#$%@ (token: '!')
               !!andsuch.1,00,0.11#$%@ (token: '!!')
                !andsuch.1,00,0.11#$%@ (token: '!!!')
                 andsuch.1,00,0.11#$%@ (token: '!!!!')
                        .1,00,0.11#$%@ (token: '')
                           00,0.11#$%@ (token: '.1,')
                              0.11#$%@ (token: '.1,00,')
                                  #$%@ (token: '.1,00,0.11')
                                   $%@ (token: '#')
                                    %@ (token: '#$')
                                     @ (token: '#$%')

关于c++ - 迭代一个字符串，但将分隔符保留在子字符串中，包括其他规则，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/46294973/