问题描述
假设,我正在编写一个函数来在文本文件中查找重复的单词".例如,在aaa aaa bb cc cc bb dd
中重复的单词是aaa
和cc
,但不是bb
,因为两个bb
实例不会彼此相邻出现.
Suppose, I'm writing a function to find "repeated words" in a text file. For example, in aaa aaa bb cc cc bb dd
repeated words are aaa
and cc
but not bb
, because two bb
instances don't appear next to each other.
该函数接收一个迭代器,并返回如下迭代器:
The function receives an iterator and returns iterator like that:
def foo(in: Iterator[String]): Iterator[String] = ???
foo(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb")) // Iterator("aaa", "cc")
foo(Iterator("a", "a", "a", "b", "c", "b")) // Iterator("a")
您将如何编写foo
?请注意,输入量很大,所有单词都不适合存储(但是重复单词的数量相对较小).
How would you write foo
? Note that the input is huge and all words do not fit in memory (but the number of repeated words is relatively small).
P.S.我还要在以后增强foo
以返回重复单词的位置,重复次数等.
P.S. I would like also to enhance foo
later to return also positions of the repeated words, the number of repetitions, etc.
推荐答案
更新:
然后确定.让我们指定您想要的内容:
OK then. Let specify bit what you want:
input | expected
|
a |
aa | a
abc |
aabc | a
aaabbbbbbc | ab
aabaa | aa
aabbaa | aba
aabaa | aa
是真的吗?如果是这样,这是可行的解决方案.不确定性能,但至少它是惰性的(不要将所有内容加载到内存中).
Is it true? If so this is working solution. Not sure about performance but at least it is lazy (don't load everything into memory).
//assume we have no nulls in iterator.
def foo[T >: Null](it:Iterator[T]) = {
(Iterator(null) ++ it).sliding(3,1).collect {
case x @ Seq(a,b,c) if b == c && a != b => c
}
}
我们需要这个难看的Iterator(null) ++
,因为我们正在寻找3个元素,并且需要一种方法来查看前两个元素是否相同.
We need this ugly Iterator(null) ++
because we are looking for 3 elements and we need a way to see if first two are the same.
这是纯实现,与命令式命令相比有一些优势(例如,在其他答案中).最重要的是它很懒:
This is pure implementation and it has some advantages over imperative one (eg. in other answers). Most important one is that it is lazy:
//infinite iterator!!!
val it = Iterator.iterate('a')(s => (s + (if(Random.nextBoolean) 1 else 0)).toChar)
//it'll take only as much as needs to take this 10 items.
//should not blow up
foo(it).take(10)
//imperative implementation will blow up in such situation.
fooImp(it).take(10)
这是此主题中的所有实现以及本主题中其他帖子的所有实现: https://scalafiddle.io/sf/w5yozTA/15
here are all implementations from this and other posts seen in this topic:https://scalafiddle.io/sf/w5yozTA/15
索引和位置
在评论中,您询问添加重复单词的数量及其索引是否容易.我想了一会儿,我做了这样的事情.不知道它是否具有出色的性能,但是应该是懒惰的(例如,应该适用于大文件).
In comment you have asked if it would be easy to add the number of repeated words and their indices. I thought about it for a while and i've made something like this. Not sure if it has great performance but it should be lazy (eg. should work for big files).
/** returns Iterator that replace consecutive items with (item, index, count).
It contains all items from orginal iterator. */
def pack[T >: Null](it:Iterator[T]) = {
//Two nulls, each for one sliding(...)
(Iterator(null:T) ++ it ++ Iterator(null:T))
.sliding(2,1).zipWithIndex
//skip same items
.filter { case (x, _) => x(0) != x(1) }
//calculate how many items was skipped
.sliding(2,1).collect {
case Seq((a, idx1), (b, idx2)) => (a(1), idx1 ,idx2-idx1)
}
}
def foo[T >: Null](it:Iterator[T]) = pack(it).filter(_._3 > 1)
旧答案(在更新问题之前)
另一个(更简单的)解决方案可能是这样的:
Another (simpler) solution could be something like this:
import scala.collection.immutable._
//Create new iterator each time we'll print it.
def it = Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd", "dd", "ee", "ee", "ee", "ee", "ee", "aaa", "aaa", "ff", "ff", "zz", "gg", "aaa", "aaa")
//yep... this is whole implementation :)
def foo(it:Iterator[String]) = it.sliding(2,1).collect { case Seq(a,b) if a == b => a }
println(foo(it).toList) //dont care about duplication
//List(aaa, cc, dd, ee, ee, ee, ff)
println(foo(it).toSet) //throw away duplicats but don't keeps order
//Set(cc, aaa, ee, ff, dd)
println(foo(it).to[ListSet]) //throw away duplicats and keeps order
//ListSet(aaa, cc, dd, ee, ff)
//oh... and keep result longer than 5 items while testing.
//Scala collections (eg: Sets) behaves bit diffrently up to this limit (they keeps order)
//just test with bit bigger Sequences :)
https://scalafiddle.io/sf/w5yozTA/1
(如果答案有帮助,请投票)
(if answer is helpful up-vote please)
这篇关于文件中重复单词的迭代器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!