问题描述
我有一个作业,我们在其中读取文本文件并计算每个单词的出现次数(忽略标点符号).我们不必使用流,但我想练习使用它们.
I have an assignment where we're reading textfiles and counting the occurrences of each word (ignoring punctuation). We don't have to use streams but I want to practice using them.
到目前为止,我已经能够读取文本文件并将每一行放入一个字符串中,并将所有字符串放入一个列表中,如下所示:
So far I am able to read a text file and put each line in a string, and all the strings in a list using this:
try (Stream<String> p = Files.lines(FOLDER_OF_TEXT_FILES)) {
list = p.map(line -> line.replaceAll("[^A-Za-z0-9 ]", ""))
.collect(Collectors.toList());
}
但是,到目前为止,它仅使所有行变成单个String,因此列表中的每个元素都不是单词,而是一行.有没有一种使用流的方法,可以使每个元素成为一个单词,例如使用String的带有regex的split方法?还是我必须在流本身之外处理此问题?
However, so far, it simply makes all the lines a single String, so each element of the list is not a word, but a line. Is there a way using streams that I can have each element be a single word, using something like String's split method with regex? Or will I have to handle this outside the stream itself?
推荐答案
一个人可以使用Pattern.splitAsStream
以高性能的方式拆分字符串,并同时替换所有非单词字符,然后再创建出现次数映射:
one could use a Pattern.splitAsStream
to split a string in a performant way and at the same time replace all non word characters before creating a map of occurrence counts:
Pattern splitter = Pattern.compile("(\\W*\\s+\\W*)+");
String fileStr = Files.readString(Path.of(FOLDER_OF_TEXT_FILES));
Map<String, Long> collect = splitter.splitAsStream(fileStr)
.collect(groupingBy(Function.identity(), counting()));
System.out.println(collect);
为了拆分和删除非单词字符,我们使用模式(\W*\s+\W*)+
,在该模式中,我们查找可选的非单词字符,然后查找空格,然后再次查找可选的非单词字符.
For splitting and removal of non word characters we are using the pattern (\W*\s+\W*)+
where we look for optional non word characters then a space and then again for optional non word characters.
这篇关于在Java流中拆分字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!