Java读取文件具有领先的BOM表

Java读取文件具有领先的BOM表

本文介绍了Java读取文件具有领先的BOM表[ï]¿的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在逐行读取包含关键字的文件,发现一个奇怪的问题.我希望如果内容相同,则彼此遵循的行仅应处理一次.像

I am reading a file containing keywords line by line and found a strange problem.I hope lines that following each other if their contents are the same, they should be handled only once. Like

sony
sony

只有第一个被处理.但是问题是,java不能将它们平等对待.

only the first one is getting processed.but the problems is, java doesn't treat them as equals.

INFO: [, s, o, n, y]
INFO: [s, o, n, y]

我的代码如下所示,问题出在哪里?

My code looks like the following, where's the problem?

    FileReader fileReader = new FileReader("some_file.txt");
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String prevLine = "";
    String strLine
    while ((strLine = bufferedReader.readLine()) != null) {
        logger.info(Arrays.toString(strLine.toCharArray()));
        if(strLine.contentEquals(prevLine)){
            logger.info("Skipping the duplicate lines " + strLine);
            continue;
        }
        prevLine = strLine;
    }

更新:

似乎在第一行中有一个空格,但实际上不是,并且trim方法对我不起作用.它们不一样:

It seems like there's leading a space in the first line, but actually not, and the trim approach doesn't work for me. They're not the same:

INFO: [, s, o, n, y]
INFO: [ , s, o, n, y]

我不知道Java添加的第一个Char是什么.

I don't know what's the first Char added by java.

已解决:使用解决了问题BalusC的解决方案,感谢您指出了BOM表问题,该问题帮助我快速找到了解决方案.

Solved: the problem was solved with BalusC's solution, thanks for pointing out it's BOM problem which helped me to find out the solution quickly.

推荐答案

字节顺序标记 是Unicode字符.您将在以下位置获得  之类的字符文本流的开头,因为使用BOM是可选的,并且如果使用了BOM,则应出现在文本流的开头.

The Byte Order Mark is a Unicode character. You will get characters like  at the start of a text stream, because BOM use is optional, and, if used, should appear at the start of the text stream.

  • Microsoft编译器和解释器,以及Microsoft Windows上的许多软件(例如 Notepad)将BOM视为必需的幻数 ,而不是使用启发式.这些工具在将文本另存为UTF-8时添加BOM,并且除非存在BOM或文件仅包含ASCII,否则它们无法解释UTF-8.将文档转换为纯文本文件以供下载时,Google Docs还添加了BOM.
File file = new File( csvFilename );
FileInputStream inputStream = new FileInputStream(file);
// [{"Key2":"21","Key1":"11","Key3":"31"} ]
InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

我们可以通过将字符集显式指定为InputStreamReader的UTF-8来解决.然后在UTF-8中,字节序列  解码为一个字符,即U + FEFF( ? ).

We can resolve by explicitly specifying charset as UTF-8 to InputStreamReader. Then in UTF-8, the byte sequence  decodes to one character, which is U+FEFF (?).

使用 Google Guava's CharMatcher ,您可以删除所有不可打印的字符,然后保留所有ASCII字符(删除任何重音符号),例如:

Using Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom( input );
String clean = CharMatcher.ASCII.retainFrom( printable );

完整示例,可将数据从 CSV文件读取到JSON对象:

Full Example to read data from the CSV file to JSON Object:

public class CSV_FileOperations {
    static List<HashMap<String, String>> listObjects = new ArrayList<HashMap<String,String>>();
    protected static List<JSONObject> jsonArray = new ArrayList<JSONObject >();

    public static void main(String[] args) {
        String csvFilename = "D:/Yashwanth/json2Bson.csv";

        csvToJSONString(csvFilename);
        String jsonData = jsonArray.toString();
        System.out.println("File JSON Data : \n"+ jsonData);
    }

    @SuppressWarnings("deprecation")
    public static String csvToJSONString( String csvFilename ) {
        try {
            File file = new File( csvFilename );
            FileInputStream inputStream = new FileInputStream(file);

            String fileExtensionName = csvFilename.substring(csvFilename.indexOf(".")); // fileName.split(".")[1];
            System.out.println("File Extension : "+ fileExtensionName);

            // [{"Key2":"21","Key1":"11","Key3":"31"} ]
            InputStreamReader inputStreamReader = new InputStreamReader( inputStream, "UTF-8" );

            BufferedReader buffer = new BufferedReader( inputStreamReader );
            Stream<String> readLines = buffer.lines();
            boolean headerStream = true;

            List<String> headers = new ArrayList<String>();
            for (String line : (Iterable<String>) () -> readLines.iterator()) {
                String[] columns = line.split(",");
                if (headerStream) {
                    System.out.println(" ===== Headers =====");

                    for (String keys : columns) {
                        //  - UTF-8 - ? « https://stackoverflow.com/a/11021401/5081877
                        String printable = CharMatcher.INVISIBLE.removeFrom( keys );
                        String clean = CharMatcher.ASCII.retainFrom(printable);
                        String key = clean.replace("\\P{Print}", "");
                        headers.add( key );
                    }
                    headerStream = false;
                    System.out.println(" ===== ----- Data ----- =====");
                } else {
                    addCSVData(headers, columns );
                }
            }
            inputStreamReader.close();
            buffer.close();


        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    @SuppressWarnings("unchecked")
    public static void addCSVData( List<String> headers, String[] row ) {
        if( headers.size() == row.length ) {
            HashMap<String,String> mapObj = new HashMap<String,String>();
            JSONObject jsonObj = new JSONObject();
            for (int i = 0; i < row.length; i++) {
                mapObj.put(headers.get(i), row[i]);
                jsonObj.put(headers.get(i), row[i]);
            }
            jsonArray.add(jsonObj);
            listObjects.add(mapObj);
        } else {
            System.out.println("Avoiding the Row Data...");
        }
    }
}

json2Bson.csv文件数据.

Key1    Key2    Key3
11  21  31
12  22  32
13  23  33

这篇关于Java读取文件具有领先的BOM表[ï]¿的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 13:30