c# - 从字符串EOT逗号ETX中删除控制字符序列

我有一些xml文件，其中一些控制序列包含在文本中：EOT，ETX（anotherchar）
EOT逗号ETX之后的其他字符并不总是存在，也不总是相同。
实际示例：

<FatturaElettronicaHeader xmlns="">
</F<EOT>‚<ETX>èatturaElettronicaHeader>

其中<EOT>是04个字符，而<ETX>是03。由于我必须解析xml，所以实际上这是一个大问题。
这是我从未听说过的某种编码吗？

我试图从我的字符串中删除所有控制字符，但是它将留下仍然不需要的逗号。
如果我使用Encoding.ASCII.GetString(file);，不需要的字符将被替换为'？'。可以轻松删除，但仍会留下一些不需要的字符，导致解析问题：

<BIC></WBIC>像这样。

string xml = Encoding.ASCII.GetString(file);
xml = new string(xml.Where(cc => !char.IsControl(cc)).ToArray());

因此，我需要删除所有此类控制字符序列才能解析此类文件，并且我不确定如何以编程方式检查字符是否属于控制序列。

最佳答案

我发现文件中有2个错误的模式：第一个是标题中的模式，第二个是EOT<。
为了使其工作，我查看了以下线程：Remove substring that starts with SOT and ends EOT, from string

并修改了一点代码

private static string RemoveInvalidCharacters(string input)
        {
            while (true)
            {
                var start = input.IndexOf('\u0004');
                if (start == -1) break;
                if (input[start + 1] == '<')
                {
                    input = input.Remove(start, 2);
                    continue;
                }
                if (input[start + 2] == '\u0003')
                {
                    input = input.Remove(start, 4);
                }
            }
            return input;
        }

此代码的进一步清理：

static string StripExtended(string arg)
        {
            StringBuilder buffer = new StringBuilder(arg.Length); //Max length
            foreach (char ch in arg)
            {
                UInt16 num = Convert.ToUInt16(ch);//In .NET, chars are UTF-16
                //The basic characters have the same code points as ASCII, and the extended characters are bigger
                if ((num >= 32u) && (num <= 126u)) buffer.Append(ch);
            }
            return buffer.ToString();
        }

现在，一切看起来都很好解析。

关于c# - 从字符串EOT逗号ETX中删除控制字符序列，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/54168995/