scala - Java/Scala的内存olap/数据透视表中是否有数据结构/库要做？

相关问题

这个问题很相关，但是已经有2岁了：In memory OLAP engine in Java

背景

我想在内存中根据给定的表格数据集创建类似于数据透视表的矩阵

例如按婚姻状况计数的年龄（行是年龄，列是婚姻状况）。

输入：具有年龄和某些布尔属性（例如已婚）的人员列表，
所需的输出：按年龄（行）和已婚（列）的人数

我尝试过的（Scala）

case class Person(val age:Int, val isMarried:Boolean)

...
val people:List[Person] = ... //

val peopleByAge = people.groupBy(_.age)  //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status

我天真地做到了，首先按年龄分组，然后按婚姻状态按map分组，并输出结果，然后按count进行汇总

TreeMap(peopleByAge.toSeq: _*).map(x => {
    val age = x._1
    val rows = x._2
    val numMarried = rows.count(_.isMarried())
    val numNotMarried = rows.length - numMarried
    (age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
     val cumMarried = row._2+
        (if (list.isEmpty) 0 else list.last.cumMarried)
     val cumNotMarried = row._3 +
        (if (list.isEmpty) 0 else l.last.cumNotMarried)
     list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
}.reverse

我不喜欢上面的代码，它效率不高，难以阅读，而且我相信还有更好的方法。

问题

如何按“两者”分组？以及如何对每个子组进行计数，例如

到底有多少人已经30岁结婚？

另一个问题是，我该如何总计，以回答该问题：

30岁以上有多少人结婚？

编辑：

感谢您提供的所有出色答案。

只是为了澄清一下，我希望输出包含一个带有以下各列的“表”

年龄（升序）
结婚人数
没结婚
已婚总人数
未结婚总人数

不仅回答这些特定的查询，而且要生成一份报告，以允许回答所有此类问题。

最佳答案

这是一个较为冗长的选项，但是以通用方式执行，而不是使用严格的数据类型。您当然可以使用泛型来使其变得更好，但我认为您明白了。

/** Creates a new pivot structure by finding correlated values
  * and performing an operation on these values
  *
  * @param accuOp the accumulator function (e.g. sum, max, etc)
  * @param xCol the "x" axis column
  * @param yCol the "y" axis column
  * @param accuCol the column to collect and perform accuOp on
  * @return a new Pivot instance that has been transformed with the accuOp function
  */
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
  // create list of indexes that correlate to x, y, accuCol
  val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))

  // group by x and y, sending the resulting collection of
  // accumulated values to the accuOp function for post-processing
  val data = body.groupBy(row => {
    (row(colsIdx(0)), row(colsIdx(1)))
  }).map(g => {
    (g._1, accuOp(g._2.map(_(colsIdx(2)))))
  }).toMap

  // get distinct axis values
  val xAxis = data.map(g => {g._1._1}).toList.distinct
  val yAxis = data.map(g => {g._1._2}).toList.distinct

  // create result matrix
  val newRows = yAxis.map(y => {
    xAxis.map(x => {
      data.getOrElse((x,y), "")
    })
  })

 // collect it with axis labels for results
 Pivot(List((yCol + "/" + xCol) +: xAxis) :::
   newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}

我的Pivot类型非常基本：

class Pivot(val rows: List[List[String]]) {

  val headers = rows.head.zipWithIndex.toMap
  val body    = rows.tail
  ...
}

要测试它，您可以执行以下操作：

val marriedP = Pivot(
  List(
    List("Name", "Age", "Married"),
    List("Bill", "42", "TRUE"),
    List("Heloise", "47", "TRUE"),
    List("Thelma", "34", "FALSE"),
    List("Bridget", "47", "TRUE"),
    List("Robert", "42", "FALSE"),
    List("Eddie", "42", "TRUE")

  )
)

def accum(values: List[String]) = {
    values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))

产生：

Name     Age      Married
Bill     42       TRUE
Heloise  47       TRUE
Thelma   34       FALSE
Bridget  47       TRUE
Robert   42       FALSE
Eddie    42       TRUE

Married/Age  47           42           34
TRUE         2            2
FALSE                     1            1

令人高兴的是，您可以像使用传统excel数据透视表一样使用curring传递任何函数的值。

可在此处找到更多信息：https://github.com/vinsonizer/pivotfun