本文介绍了根据他们在Scala和星火频率更换双字母组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我要全部更换双字母组,他们的频率计数大于这种模式阈值时(word1.concat( - )。CONCAT(字词))
,我已经试过:
进口org.apache.spark {SparkConf,SparkContext}对象替换{ 高清主(参数:数组[字符串]):单位= { VAL的conf =新SparkConf()
.setMaster(本地)
.setAppName(替换) VAL SC =新SparkContext(CONF)
VAL RDD = sc.textFile(数据/ ddd.txt) VAL阈值= 2 VAL searchBigram = {rdd.map
_.split('。')地图{串=过夜。;
//修剪子,然后标记化的空间
substrings.trim.split('')。 //删除非字母数字字符,并转换为小写
图{
_.replaceAll(\\ W,).toLowerCase()
}。
滑动(2) } {.flatMap
身分
}
.MAP {
_.mkString()
}
。通过...分组 {
身分
}
.mapValues {
_。尺寸
}
} {.flatMap
身分
} .reduceByKey(_ + _)。收集
.sortBy(-_._ 2)
.takeWhile(_._ 2 - =阈值)
.MAP(X => x._1.split(''))
.MAP(X =>(X(0),X(1)))。toVector
VAL SAMPLE1 = sc.textFile(数据/ ddd.txt)
VAL SAMPLE2 = sample1.map(S = GT; s.split()//空间分割
.sliding(2)//拍摄连续的对
.MAP {情况下阵列(A,B)=> (A,B)}
.MAP(ELEM =方式>如果(searchBigram.contains(ELEM))(elem._1.concat( - )的concat(elem._2),)其他ELEM)
.MAP {壳体(E1,E2)= GT; E1} .mkString())
sample2.foreach(的println)
}
}
但这种code删除每个文档的最后一个字,并表现出一定的错误,当我在文件上运行它包含了很多的文件。
想我的输入文件包含这些文件:
惊喜听到扑通开的门小邋遢的男子紧握着包缠。升级系统中发现检讨春天2000问题穆迪音频抵押担保。OMG尽快离开得包装查看订单。明白穆迪发出专人送交达赖喇嘛讲手戴耳塞的生活。听保持联系长。缓冲闪电2000伏的电缆烧毁还原到位。伏电缆线终于能听到听觉问题穆迪宝石传闻已久的音乐。
和我最喜欢的输出是:
惊喜听到扑通开的门小男人抱茎裹包。升级系统中发现检讨春季两千年问题,喜怒无常音频抵押担保。OMG尽快离开得包装查看订单。明白的问题,喜怒无常专人送达达赖喇嘛讲手戴耳塞的生活。听保持长期联系的小男人。缓冲闪电两千年伏电缆烧毁还原到位。电缆伏电缆终于能听到听觉问题,喜怒无常宝石传闻已久的音乐。
任何人可以帮助我吗?
解决方案
高清getNgrams(句子):
OUT = []
森= sentence.split()
对于(LEN(SEN)-1)K的范围:
out.append((SEN [K],仙[K + 1]))
返回了
如果__name__ =='__main__': 尝试:
LSC = LocalSparkContext.LocalSparkContext(建议,火花:// BigData:7077)
SC = lsc.getBaseContext()
SSC = lsc.getSQLContext()
INFILE =bigramstxt.txt
孙中山= sc.textFile(INFILE,1)
V = 1
BRV = sc.broadcast(v)的
。wordgroups = sen.flatMap(getNgrams).MAP(拉姆达T:(T,1))reduceByKey(添加).filter(拉姆达T:T [1] GT; brv.value)
双字母组= wordgroups.collect()
sc.stop()
INP =开(INFILE,'R')。阅读()
打印INP
对B的双字母组:
印片b
INP = inp.replace(。加入(B [0]), - 。加入(B [0])) 打印INP 除:
提高
sc.stop()
I want to replace all bigrams which their frequency count is greater than a threshold with this pattern (word1.concat("-").concat(word2))
, and i've tried:
import org.apache.spark.{SparkConf, SparkContext}
object replace {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("replace")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data/ddd.txt")
val threshold = 2
val searchBigram=rdd.map {
_.split('.').map { substrings =>
// Trim substrings and then tokenize on spaces
substrings.trim.split(' ').
// Remove non-alphanumeric characters and convert to lowercase
map {
_.replaceAll( """\W""", "").toLowerCase()
}.
sliding(2)
}.flatMap {
identity
}
.map {
_.mkString(" ")
}
.groupBy {
identity
}
.mapValues {
_.size
}
}.flatMap {
identity
}.reduceByKey(_ + _).collect
.sortBy(-_._2)
.takeWhile(_._2 >= threshold)
.map(x=>x._1.split(' '))
.map(x=>(x(0), x(1))).toVector
val sample1 = sc.textFile("data/ddd.txt")
val sample2 = sample1.map(s=> s.split(" ") // split on space
.sliding(2) // take continuous pairs
.map{ case Array(a, b) => (a, b) }
.map(elem => if (searchBigram.contains(elem)) (elem._1.concat("-").concat(elem._2)," ") else elem)
.map{case (e1,e2) => e1}.mkString(" "))
sample2.foreach(println)
}
}
but this code remove last word of every document and show some errors when i run it on a file contains a lot of documents.
suppose my input file contains these documents :
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring two thousand issue moody audio mortgage backed.
omg left gotta wrap review order asap . understand issue moody hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
buffered lightning two thousand volts cables burned revivification place .
cables volts cables finally able hear auditory issue moody gem long rumored music .
and my favorite output is :
surprise heard thump opened door small-man clasping package wrapped.
upgrading system found review spring two-thousand issue-moody audio mortgage backed.
omg left gotta wrap review order asap . understand issue-moody hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long small-man .
buffered lightning two-thousand volts-cables burned revivification place .
cables volts-cables finally able hear auditory issue-moody gem long rumored music .
Can anybody help me?
解决方案
def getNgrams(sentence):
out = []
sen = sentence.split(" ")
for k in range(len(sen)-1):
out.append((sen[k],sen[k+1]))
return out
if __name__ == '__main__':
try:
lsc = LocalSparkContext.LocalSparkContext("Recommendation","spark://BigData:7077")
sc = lsc.getBaseContext()
ssc = lsc.getSQLContext()
inFile = "bigramstxt.txt"
sen = sc.textFile(inFile,1)
v = 1
brv = sc.broadcast(v)
wordgroups = sen.flatMap(getNgrams).map(lambda t: (t,1)).reduceByKey(add).filter(lambda t: t[1]>brv.value)
bigrams = wordgroups.collect()
sc.stop()
inp = open(inFile,'r').read()
print inp
for b in bigrams:
print b
inp = inp.replace(" ".join(b[0]),"-".join(b[0]))
print inp
except:
raise
sc.stop()
这篇关于根据他们在Scala和星火频率更换双字母组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!