问题描述
我有一个很大的ansi文本文件.该文件包含许多条目(数百万至数十亿).每个条目有4行,如下所示:
I have a large ansi text file. The file contains many entries (millions to billions). Each entry has 4 lines like this:
@Instrument:6:73:941:1973#0/1
other stuff2
other stuff3
other stuff4
我对第一行感兴趣.从第一行中,我需要提取其内容(数字和字符串).我正在使用 StringReplace
将:
和空格替换为#13
,然后将行拆分成这样的记录:
I am interested in the first line. From the first line I need to extract its content (numbers and strings). I am using StringReplace
to replace :
and space with #13
, then I split the line into a record like this:
TYPE
RBlock= record // @Instrument:6:73:941:1973#0/1
Instrument: String; // Instrument
Lane: Integer; // 6
TileNo: Integer; // 73
X: integer; // 941
Y: Integer; // 1973
Pair: Byte; // could be 1 or 2
MultiplexID: AnsiString; // #0 <---- I need it as AnsiString
end;
使用 StrToInto
将文本转换为数字可能很慢,因为它首先将 AnsiString
转换为字符串.
Using StrToInto
to convert the text to numbers may be slow because it first converts the AnsiString
to string.
任何有关如何更快阅读的想法都会受到赞赏.
Any ideas on how could I read it faster will be appreciated.
更新:该行还可以具有其他格式: @Instrument:136:FC6:2:2104:15343:197393 1:Y:18:TACA
Update: the line could also have an alternative format: @Instrument:136:FC6:2:2104:15343:197393 1:Y:18:TACA
推荐答案
您需要检查数据并检查可能会发生哪种数据.就我个人而言,我可能会做这样的事情(对于第一个示例):
You need to examine your data and check what sort of data could occur. Personally I would probably do something like this (for the first example):
procedure ParseLine(const aLine: RawByteString; var aInstrument: string; var
aLane, aTileNo, aX, aY: Integer; var aMultiplexID: Ansistring; var aPair:
Byte);
var
arrayIndex: Integer;
index: Integer;
lineLength: Integer;
NumList: array[0..3] of Integer;
I: Integer;
multiEnd: Integer;
begin
lineLength := Length(aLine);
// Get the aInstrument
index := Pos(':', aLine);
SetLength(aInstrument, index - 2);
for I := 2 to index - 1 do
aInstrument[I-1] := Char(aLine[I]);
// Get the integers
arrayIndex := 0;
FillMemory(@NumList, SizeOf(NumList), 0);
while (index < lineLength) and (arrayIndex < 4) do
begin
Inc(index);
if (aLine[index] = ':') or (aLine[index] = '#') then
Inc(arrayIndex)
else
NumList[arrayIndex] := NumList[arrayIndex] * 10 + Ord(aLine[index]) - Ord('0');
end;
aLane := NumList[0];
aTileNo := NumList[1];
aX := NumList[2];
aY := NumList[3];
// Get the Multiplex
multiEnd := Pos('/', aLine, index);
SetLength(aMultiplexID, multiEnd - index - 1);
Inc(index);
for I := index to multiEnd - 1 do
aMultiplexID[I-index+1] := aLine[I];
// Get the aPair
if (multiEnd+1 < lineLength) then
aPair := Ord(aLine[multiEnd+1]) - Ord('0')
else
aPair := 0;
end;
可以对其进行更多优化,但这将真正影响可读性.这里的问题将是该例程的数据是否有效.它会处理一个太短但在文本中不是无效值的字符串,尽管它在太短时不会返回错误.负数值也将是一个问题.您需要查看的是您的数据,它的外观,损坏或无效数据的机率以及速度对您的重要性.这是一种平衡的行为.您可以删除所有支票并使其更快,也可以添加更多支票以减慢其速度.
This could be optimized more but that would start to really hit the readability. The issue here is going to be whether the data is valid for this routine. It will handle a string that's too short but not invalid values in the text although it won't return an error when it's too short. Negative numeric's would also be a problem. What you need to look at is your data, what it looks like, what the chance of corruptions or invalid data would be and also how important speed is to you. It's a balancing act. You could remove all of the checks and have it faster or add a lot more checks which would slow it down.
这篇关于如何快速解析ANSI字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!