Spark中组件Mllib的学习之基础概念篇 
1、解释 
分层抽样的概念就不讲了,具体的操作: 
RDD有个操作可以直接进行抽样:sampleByKey和sample等,这里主要介绍这两个 
(1)将字符串长度为2划分为层2,字符串长度为3划分为层1,对层1和层2按不同的概率进行抽样 
数据

aa
bb
cc
dd
ee
aaa
bbb
ccc
ddd
eee

比如: 
val fractions: Map[Int, Double] = List((1, 0.2), (2, 0.8)).toMap //设定抽样格式 
sampleByKey(withReplacement = false, fractions, 0) 
fractions表示在层1抽0.2,在层2中抽0.8 
withReplacement false表示不重复抽样 
0表示随机的seed

源码:

 /**
* Return a subset of this RDD sampled by key (via stratified sampling).
*
* Create a sample of this RDD using variable sampling rates for different keys as specified by
* `fractions`, a key to sampling rate map, via simple random sampling with one pass over the
* RDD, to produce a sample of size that's approximately equal to the sum of
* math.ceil(numItems * samplingRate) over all key values.
*
* @param withReplacement whether to sample with or without replacement
* @param fractions map of specific keys to sampling rates
* @param seed seed for the random number generator
* @return RDD containing the sampled subset
*/
def sampleByKey(withReplacement: Boolean,
fractions: Map[K, Double],
seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope { require(fractions.values.forall(v => v >= 0.0), "Negative sampling rates.") val samplingFunc = if (withReplacement) {
StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, false, seed)
} else {
StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, false, seed)
}
self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true)
}

2、代码:

import org.apache.spark.{SparkConf, SparkContext}

object StratifiedSamplingLearning {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass.getSimpleName.filter(!_.equals('$')))
val sc = new SparkContext(conf)
println("First:")
val data = sc.textFile("D:\\TestData\\StratifiedSampling.txt") //读取数
.map(row => {
//开始处理
if (row.length == ) //判断字符数
(row, ) //建立对应map
else (row, ) //建立对应map
}).map(each => (each._2, each._1))
data.foreach(println) println("sampleByKey:")
val fractions: Map[Int, Double] = List((, 0.2), (, 0.8)).toMap //设定抽样格式
val approxSample = data.sampleByKey(withReplacement = false, fractions, ) //计算抽样样本
approxSample.foreach(println) println("Second:")
val randRDD = sc.parallelize(List((, "cat"), (, "mouse"), (, "cup"), (, "book"), (, "tv"), (, "screen"), (, "heater")))
val sampleMap = List((, 0.4), (, 0.8)).toMap
val sample2 = randRDD.sampleByKey(false, sampleMap, ).collect
sample2.foreach(println) println("Third:")
val a = sc.parallelize( to , )
val b = a.sample(true, 0.8, )
val c = a.sample(false, 0.8, )
println("RDD a : " + a.collect().mkString(" , "))
println("RDD b : " + b.collect().mkString(" , "))
println("RDD c : " + c.collect().mkString(" , "))
sc.stop
}
}

3、结果:

First:
(,aa)
(,bbb)
(,bb)
(,ccc)
(,cc)
(,ddd)
(,dd)
(,eee)
(,ee)
(,aaa)
sampleByKey:
(,aa)
(,bb)
(,cc)
(,ee)
Second:
(,cat)
(,mouse)
(,book)
(,screen)
(,heater)
Third:
RDD a : , , , , , , , , , , , , , , , , , , ,
RDD b : , , , , , , ,
RDD c : , , , , , , , , , , , , , ,
05-08 07:58