相关问题
这个问题很相关,但是已经有2岁了:In memory OLAP engine in Java
背景
我想在内存中根据给定的表格数据集创建类似于数据透视表的矩阵
例如按婚姻状况计数的年龄(行是年龄,列是婚姻状况)。
输入:具有年龄和某些布尔属性(例如已婚)的人员列表,
所需的输出:按年龄(行)和已婚(列)的人数
我尝试过的(Scala)
case class Person(val age:Int, val isMarried:Boolean)
...
val people:List[Person] = ... //
val peopleByAge = people.groupBy(_.age) //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
我天真地做到了,首先按年龄分组,然后按婚姻状态按
map
分组,并输出结果,然后按count
进行汇总TreeMap(peopleByAge.toSeq: _*).map(x => {
val age = x._1
val rows = x._2
val numMarried = rows.count(_.isMarried())
val numNotMarried = rows.length - numMarried
(age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
val cumMarried = row._2+
(if (list.isEmpty) 0 else list.last.cumMarried)
val cumNotMarried = row._3 +
(if (list.isEmpty) 0 else l.last.cumNotMarried)
list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
}.reverse
我不喜欢上面的代码,它效率不高,难以阅读,而且我相信还有更好的方法。
问题
如何按“两者”分组?以及如何对每个子组进行计数,例如
到底有多少人已经30岁结婚?
另一个问题是,我该如何总计,以回答该问题:
30岁以上有多少人结婚?
编辑:
感谢您提供的所有出色答案。
只是为了澄清一下,我希望输出包含一个带有以下各列的“表”
年龄(升序)
结婚人数
没结婚
已婚总人数
未结婚总人数
不仅回答这些特定的查询,而且要生成一份报告,以允许回答所有此类问题。
最佳答案
这是一个较为冗长的选项,但是以通用方式执行,而不是使用严格的数据类型。您当然可以使用泛型来使其变得更好,但我认为您明白了。
/** Creates a new pivot structure by finding correlated values
* and performing an operation on these values
*
* @param accuOp the accumulator function (e.g. sum, max, etc)
* @param xCol the "x" axis column
* @param yCol the "y" axis column
* @param accuCol the column to collect and perform accuOp on
* @return a new Pivot instance that has been transformed with the accuOp function
*/
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
// create list of indexes that correlate to x, y, accuCol
val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))
// group by x and y, sending the resulting collection of
// accumulated values to the accuOp function for post-processing
val data = body.groupBy(row => {
(row(colsIdx(0)), row(colsIdx(1)))
}).map(g => {
(g._1, accuOp(g._2.map(_(colsIdx(2)))))
}).toMap
// get distinct axis values
val xAxis = data.map(g => {g._1._1}).toList.distinct
val yAxis = data.map(g => {g._1._2}).toList.distinct
// create result matrix
val newRows = yAxis.map(y => {
xAxis.map(x => {
data.getOrElse((x,y), "")
})
})
// collect it with axis labels for results
Pivot(List((yCol + "/" + xCol) +: xAxis) :::
newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}
我的Pivot类型非常基本:
class Pivot(val rows: List[List[String]]) {
val headers = rows.head.zipWithIndex.toMap
val body = rows.tail
...
}
要测试它,您可以执行以下操作:
val marriedP = Pivot(
List(
List("Name", "Age", "Married"),
List("Bill", "42", "TRUE"),
List("Heloise", "47", "TRUE"),
List("Thelma", "34", "FALSE"),
List("Bridget", "47", "TRUE"),
List("Robert", "42", "FALSE"),
List("Eddie", "42", "TRUE")
)
)
def accum(values: List[String]) = {
values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))
产生:
Name Age Married
Bill 42 TRUE
Heloise 47 TRUE
Thelma 34 FALSE
Bridget 47 TRUE
Robert 42 FALSE
Eddie 42 TRUE
Married/Age 47 42 34
TRUE 2 2
FALSE 1 1
令人高兴的是,您可以像使用传统excel数据透视表一样使用curring传递任何函数的值。
可在此处找到更多信息:https://github.com/vinsonizer/pivotfun