本文介绍了扫描仪的nextLine(),仅获取部分内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,使用类似以下内容的

So, using something like:

for (int i = 0; i < files.length; i++) {
            if (!files[i].isDirectory() && files[i].canRead()) {
                try {
                    Scanner scan = new Scanner(files[i]);
                System.out.println("Generating Categories for " + files[i].toPath());
                while (scan.hasNextLine()) {
                    count++;
                    String line = scan.nextLine();
                    System.out.println("  ->" + line);
                    line = line.split("\t", 2)[1];
                    System.out.println("!- " + line);
                    JsonParser parser = new JsonParser();
                    JsonObject object = parser.parse(line).getAsJsonObject();
                    Set<Entry<String, JsonElement>> entrySet = object.entrySet();
                    exploreSet(entrySet);
                }
                scan.close();
                // System.out.println(keyset);
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }

        }
    }

当人们查看Hadoop输出文件时,中间的JSON对象之一正在损坏...因为scan.nextLine()在将其拆分之前并未获取整行.即输出为:

as one goes over a Hadoop output file, one of the JSON objects in the middle is breaking... because scan.nextLine() is not fetching the whole line before it brings it to split. ie, the output is:

  ->0   {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{   ...  "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
!- {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{   ...  "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~

上面的大多数数据都已经过清理(但是不是URL(大部分是...).

Most of the above data has been sanitized (not the URL (for the most part) however... )

,URL继续为: $(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg ~~ 60_1.JPG?set_id = 8800005007 在文件中....

and the URL continues as: $(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg~~60_1.JPG?set_id=8800005007 in the file....

所以有点slightly.

So its slightly miffing.

这也是条目#112,我已经解析了其他文件而没有错误...但是,这让我很头疼,主要是因为我不知道scan.nextLine()是如何工作的...

This also is entry #112, and I have had other files parse without errors... but this one is screwing with my mind, mostly because I dont see how scan.nextLine() isnt working...

通过调试输出,JSON错误是由字符串未正确拆分引起的.

By debug output, the JSON error is caused by the string not being split properly.

而且几乎忘了,如果我尝试将有问题的行放在其自己的文件中并对其进行解析,它也可以正常工作.

And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.

如果我在几乎相同的位置删除有问题的行,也会炸毁.

Also blows up if I remove the offending line in about the same place.

尝试使用JVM 1.6和1.7

Attempted with JVM 1.6 and 1.7

解决方法:BufferedReader scan = new BufferedReader(new FileReader(files [i]));而不是扫描仪....

Workaround Solution:BufferedReader scan = new BufferedReader(new FileReader(files[i]));instead of scanner....

推荐答案

根据您的代码,我能提供的最佳解释是,根据"~~"之后结束>.

Based on your code, the best explanation I can come up with is that the line really does end after the "~~" according to the criteria used by Scanner.nextLine().

行尾的标准是:

  • 与此正则表达式匹配的内容:"\r\n|[\n\r\u2028\u2029\u0085]"
  • 输入流的结尾

您说文件在"~~"之后继续,所以让EOF放在一边,然后看一下正则表达式.这将符合以下任何条件:

You say that the file continues after the "~~", so lets put EOF aside, and look at the regex. That will match any of the following:

常用的行分隔符:

  • <CR>
  • <NL>
  • <CR><NL>
  • <CR>
  • <NL>
  • <CR><NL>

...以及Scanner也可以识别的三种不常见的行分隔符形式.

... and three unusual forms of line separator that Scanner also recognizes.

  • 0x0085是"ISO C1控件"组中的<NEL>或下一行"控制代码
  • 0x2028是Unicode行分隔符"字符
  • 0x2029是Unicode段落分隔符"字符
  • 0x0085 is the <NEL> or "next line" control code in the "ISO C1 Control" group
  • 0x2028 is the Unicode "line separator" character
  • 0x2029 is the Unicode "paragraph separator" character

我的理论是,您的输入文件中存在一种异常"形式,并且这种形式并没有在....中显示出来,无论您使用什么工具来检查这些文件.

My theory is that you've got one of the "unusual" forms in your input file, and this is not showing up in .... whatever tool it is that you are using to examine the files.

我建议您使用一种工具来检查输入文件,该工具可以向您显示文件的实际字节;例如Linux/Unix系统上的od实用程序.另外,请检查这不是由某种字符编码不匹配引起的...还是尝试以文本形式读取或写入二进制数据.

I suggest that you examine the input file using a tool that can show you the actual bytes of the file; e.g. the od utility on a Linux / Unix system. Also, check that this isn't caused by some kind of character encoding mismatch ... or trying to read or write binary data as text.

如果这些方法无济于事,那么下一步应该是使用IDE的Java调试器运行应用程序,并通过Scanner.hasNextLine()nextLine()调用单步执行以查找代码的实际作用

If these don't help, then the next step should be to run your application using your IDE's Java debugger, and single-step it through the Scanner.hasNextLine() and nextLine() calls to find out what the code is actually doing.

这很有趣.但是,如果用于提取线的工具与未显示(假设的)异常的行分隔符的工具相同,则此证据不可靠.提取过程可能正在改变引起问题的材料".

That's interesting. But if the tool you are using to extract the line is the same one that is not showing the (hypothesized) unusual line separator, then this evidence is not reliable. The process of extraction may be altering the "stuff" that is causing the problems.

这篇关于扫描仪的nextLine(),仅获取部分内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-21 02:42