我有这种CSV文件,必须用Java解析。
2012-11-01 00, 1106, 2194.1971066908
2012-11-01 01, 760, 1271.8460526316
.
.
.
2012-11-30 21, 1353, 1464.0014781966
2012-11-30 22, 1810, 1338.8331491713
2012-11-30 23, 1537, 1222.7826935589
720 rows selected.
Elapsed: 00:37:00.23
这是我创建的Java代码,用于分隔各列并将其存储在列表中。public void extractFile(String fileName){
try{
BufferedReader bf = new BufferedReader(new FileReader(fileName));
try {
String readBuff = bf.readLine();
while (readBuff!=null){
Pattern checkData = Pattern.compile("[a-zA-Z]");
Matcher match = checkData.matcher(readBuff);
if (match.find()){
readBuff = null;
}
else if (!match.find()){
String[] splitReadBuffByComma = new String[3];
splitReadBuffByComma = readBuff.split(",");
for (int x=0; x<splitReadBuffByComma.length; x++){
if (x==0){
dHourList.add(splitReadBuffByComma[x]);
}
else if (x==1){
throughputList.add(splitReadBuffByComma[x]);
}
else if (x==2){
avgRespTimeList.add(splitReadBuffByComma[x]);
}
}
}
readBuff = bf.readLine();
}
}
finally{
bf.close();
}
}
catch(FileNotFoundException e){
System.out.println("File not found dude: "+ e);
}
catch(IOException e){
System.out.println("Error Exception dude: "+e);
}
}
问题是我创建的正则表达式有点错误,因为它仍然包含文本“720行已选择”并将其存储在dHourList中。 dHourList应该仅存储以这种方式表示的日期列“2012-11-01 00 ...等”吞吐量列表=“1106,760 ...等” avgResponseTime =“2194.192,1271.846 ...等”正确的reg表达式应该是什么?
更新资料
2012-11-30 21
2012-11-30 22
2012-11-30 23
选择了720行。
已过时间:00:37:00.23
日期小时大小:724吞吐量大小:720平均响应时间大小:720
我在checkData正则表达式中使用了它,因为如果我使用一个斜杠\ d编译会说无效的转义序列
Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\b.+$");
但它仍显示720行被选中,而另一行不应该在那里。更新2
工作代码:
while (readBuff!=null){
Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
Matcher match = checkData.matcher(readBuff);
if (!match.find()){
readBuff = null;
}
else{
String[] splitReadBuffByComma = new String[3];
splitReadBuffByComma = readBuff.split(",");
for (int x=0; x<splitReadBuffByComma.length; x++){
if (x==0){
dHourList.add(splitReadBuffByComma[x]);
}
else if (x==1){
throughputList.add(splitReadBuffByComma[x]);
}
else if (x==2){
avgRespTimeList.add(splitReadBuffByComma[x]);
}
}
}
readBuff = bf.readLine();
}
我删除了if条件,并将其更改为else,并使用了Cylian建议的正则表达式现在我有输出
2012-11-30 21
2012-11-30 22
2012-11-30 23
Size of date-hour: 720 size of throughput: 720 size of avg resp time: 720
非常感谢! 最佳答案
尝试一下[您的代码,但有所修改]:
public void extractFile(String fileName){
try{
BufferedReader bf = new BufferedReader(new FileReader(fileName));
try {
String readBuff = bf.readLine();
while (readBuff!=null){
Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
Matcher match = checkData.matcher(readBuff);
if (!match.find()){
readBuff = null;
}
else if (match.find()){
String[] splitReadBuffByComma = new String[3];
splitReadBuffByComma = readBuff.split(",");
for (int x=0; x<splitReadBuffByComma.length; x++){
if (x==0){
dHourList.add(splitReadBuffByComma[x]);
}
else if (x==1){
throughputList.add(splitReadBuffByComma[x]);
}
else if (x==2){
avgRespTimeList.add(splitReadBuffByComma[x]);
}
}
}
readBuff = bf.readLine();
}
}
finally{
bf.close();
}
}
catch(FileNotFoundException e){
System.out.println("File not found dude: "+ e);
}
catch(IOException e){
System.out.println("Error Exception dude: "+e);
}
}
正则表达式解剖
# ^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\b.+$
#
# Options: ^ and $ match at line breaks
#
# Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
# Match the regular expression below and capture its match into backreference number 1 «(19|20)»
# Match either the regular expression below (attempting the next alternative only if this one fails) «19»
# Match the characters “19” literally «19»
# Or match regular expression number 2 below (the entire group fails if this one fails to match) «20»
# Match the characters “20” literally «20»
# Match a single digit 0..9 «\d»
# Match a single digit 0..9 «\d»
# Match the regular expression below and capture its match into backreference number 2 «([-/.])»
# Match a single character present in the list “-/.” «[-/.]»
# Match the regular expression below and capture its match into backreference number 3 «(0[1-9]|1[012])»
# Match either the regular expression below (attempting the next alternative only if this one fails) «0[1-9]»
# Match the character “0” literally «0»
# Match a single character in the range between “1” and “9” «[1-9]»
# Or match regular expression number 2 below (the entire group fails if this one fails to match) «1[012]»
# Match the character “1” literally «1»
# Match a single character present in the list “012” «[012]»
# Match the same text as most recently matched by capturing group number 2 «\2»
# Match the regular expression below and capture its match into backreference number 4 «(0[1-9]|[12][0-9]|3[01])»
# Match either the regular expression below (attempting the next alternative only if this one fails) «0[1-9]»
# Match the character “0” literally «0»
# Match a single character in the range between “1” and “9” «[1-9]»
# Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[12][0-9]»
# Match a single character present in the list “12” «[12]»
# Match a single character in the range between “0” and “9” «[0-9]»
# Or match regular expression number 3 below (the entire group fails if this one fails to match) «3[01]»
# Match the character “3” literally «3»
# Match a single character present in the list “01” «[01]»
# Assert position at a word boundary «\b»
# Match any single character that is not a line break character «.+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at the end of a line (at the end of the string or before a line break character) «$»
更新
据我了解,您的输入字符串包含许多以日期开头的行,但其中不包含逗号。为此,将先前的模式更改为以下内容:
^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\s+\d+,[^,]+,[^,]+$
或
escaped
^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\s+\\d+,[^,]+,[^,]+$