问题描述
我的数据集如下:
salary number
1500-1600 110
1600-1700 180
1700-1800 320
1800-1900 460
1900-2000 850
2000-2100 250
2100-2200 130
2200-2300 70
2300-2400 20
2400-2500 10
如何计算此数据集的中位数?这是我尝试过的:
How can I calculate the median of this dataset? Here's what I have tried:
x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]",
"(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
"(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))
numbers cumsum
[1500-1600] 110 110
(1600-1700] 180 290
(1700-1800] 320 610
(1800-1900] 460 1070
(1900-2000] 850 1920
(2000,2100] 250 2170
(2100-2200] 130 2300
(2200-2300] 70 2370
(2300-2400] 20 2390
(2400-2500] 10 2400
在这里,您可以看到中途频率为2400/2
= 1200
.它在1070
和1920
之间.因此,中位类别是(1900-2000]
组.您可以使用以下公式获得此结果:
Here, you can see the half-way frequency is 2400/2
=1200
. It is between 1070
and 1920
. Thus the median class is the (1900-2000]
group. You can use the formula below to get this result:
其中:
或者,中位类别是通过以下方法定义的:
Alternatively, median class is defined by the following method:
获取其中所在的类.
在代码中:
> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)
[1] 1915.294
现在我要做的是使上面的表达更优雅-即1900+(1200-1070)/(1920-1070)*(2000-1900)
.我该如何实现?
Now what I want to do is to make the above expression more elegant - i.e. 1900+(1200-1070)/(1920-1070)*(2000-1900)
. How can I achieve this?
推荐答案
由于您已经知道公式,因此创建一个函数来为您进行计算应该很容易.
Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
在这里,我创建了一个基本功能来帮助您入门.该函数带有四个参数:
Here, I've created a basic function to get you started. The function takes four arguments:
-
frequencies
:频率的向量(第一个示例中为数字") -
intervals
:2行matrix
,其列数与频率的长度相同,第一行是下层边界,第二行是上层边界.另外,"intervals
"可能是data.frame
中的一列,并且您可以指定sep
(可能还有trim
)以使函数自动为您创建所需的矩阵. -
sep
:data.frame
中"intervals
"列中的分隔符. -
trim
:字符的正则表达式,在尝试强制转换为数字矩阵之前需要将其删除.函数中内置了一种模式:trim = "cut"
.设置正则表达式模式以从输入中删除(,),[和].
frequencies
: A vector of frequencies ("number" in your first example)intervals
: A 2-rowmatrix
with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals
" may be a column in yourdata.frame
, and you may specifysep
(and possibly,trim
) to have the function automatically create the required matrix for you.sep
: The separator character in your "intervals
" column in yourdata.frame
.trim
: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function:trim = "cut"
. This sets the regular expression pattern to remove (, ), [, and ] from the input.
这是功能(带有注释,显示了我如何使用您的说明将其组合在一起):
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
以下是与data.frame
一起使用的示例:
Here's a sample data.frame
to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
现在,我们可以简单地做到:
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
以下是该函数对某些组合数据起作用的示例:
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
顺便说一句,在您提供的示例数据中,我认为您的一个范围内有一个错误(除了破折号以外,其他所有均用破折号隔开,其中一个用逗号隔开),因为strsplit
使用了正则表达式默认会拆分,您可以使用如下函数:
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit
uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
这篇关于如何计算分组数据集的中位数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!