问题描述
我想在R中使用过滤器来过滤所有选定国家代码的行,连续年份从1950年到2014年
的数据就像
countrycode country currency_unit year rgdpe rgdpo pop emp avh1 美国 美元 1950 2279787 2274197 155.5635 62.83500 1983.7382 美国 美元 1951 2440076 2443820 158.2269 65.08094 2024.0023 美国 美元 1952 2530524 2526412 160.9597 65.85582 2020.1834 美国 美元 1953 2655277 2642977 163.6476 66.78711 2014.5005 美国 美元 1954 2640868 2633803 166.5511 65.59514 1991.0196 美国 美元 1955 2844098 2834914 169.5189 67.53133 1997.761
我的代码是:
dat_10 <- filter(data_all_country,countrycode == c("USA","CHN","GBR","IND","JPN","BRA","ZAF","FRA","DEU","ARG"))
令人惊奇的是 dat_10
如下:
countrycode country currency_unit 年份 rgdpe rgdpo pop emp1 ARG 阿根廷阿根廷比索 1954 51117.46 51031.80 18.58889 6.9704722 ARG 阿根廷阿根廷比索 1964 69836.62 68879.08 21.95909 7.9629993 ARG 阿根廷阿根廷比索 1974 113038.73 110358.46 25.64450 9.1352114 ARG 阿根廷 阿根廷比索 1984 148994.61 149928.59 29.92091 10.3459335 ARG 阿根廷阿根廷比索 1994 379470.19 372903.00 34.55811 12.0758726 ARG 阿根廷 阿根廷比索 2004 517308.94 499958.94 38.72878 14.669195
因为即使是有效的时间序列数据也会每 10 年过滤一次,这正是我选择作为逻辑变量的国家/地区的确切数字.
这是怎么发生的,有什么方法可以解决吗?
为什么我们应该使用 %in% 而不是 == ?
让我们更详细地看看 ==
和 %in%
之间的区别.
假设我们有一个像这样的向量.
sample_vec
然后我们返回向量中的所有USA
、CHN
和GBR
.所需的输出是这样的,这对子集或过滤很有用.
#[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
如果我们使用 ==
和 c("USA", "CHN", "GBR")
,我们可以得到以下结果.
sample_vec == c("USA", "CHN", "GBR")#[1] 真真假假假真真真假
好看吗?等等,它并没有按照我们的想法行事.
让我们在原始向量的基础上添加一个新的国家/地区代码来测试此代码.
# 再添加一个国家sample_vec2
警告信息:在 sample_vec2 == c("USA", "CHN", "GBR") 中:更长对象长度不是较短对象长度的倍数
结果可能看起来不错,但请注意警告消息.事实证明,当使用 ==
比较两个向量时,R 回收短元素为长元素.上面的代码做的事情如下.每对字符单独求值.
位置 1 2 3 4 5 6 7 8 9 10Vector1 "USA" "CHN" "GBR" "IND" "JPN" "BRA" "USA" "CHN" "GBR" "IND"Vector2 "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA"结果 TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
R 计算 Vector1
和 Vector2
在 Position 1
上的字符串(如果它们相同).如果相同则返回TRUE
,否则返回FALSE
,然后移动到Position 2
,以此类推.这就是为什么会出现警告消息.sample_vec2
的长度为10,而目标向量的长度只有3,因此R需要回收目标向量中的元素进行一一比较.
现在如果我们在使用==
时意识到R在做recycle和一对一的比较,很明显,如果我们要过滤向量中的元素,它不是合适的.让我们看看下面的例子.
sample_vec == c("CHN", "GBR", "USA")#[1] 假假假假假假假假假假
代码与sample_vec == c("USA", "CHN", "GBR")
几乎一样,只是我改变了国家/地区代码的顺序.但它返回所有FALSE
!这是因为回收和一对一比较发现没有任何位置是相同的.这可能不是我们想要的结果.
但是,如果我们使用以下代码.
sample_vec %in% c("CHN", "GBR", "USA")#[1] 真真假假假真真真假
它返回预期的结果.这是因为 %in%
是 R 中 match
函数的接口.如果它返回 TRUE
或 FALSE
匹配是否存在.
Hi I want to use filter in R to filter all the row with selected countrycode, and the data with continuous year from 1950 to 2014
is like
countrycode country currency_unit year rgdpe rgdpo pop emp avh
1 USA United States US Dollar 1950 2279787 2274197 155.5635 62.83500 1983.738
2 USA United States US Dollar 1951 2440076 2443820 158.2269 65.08094 2024.002
3 USA United States US Dollar 1952 2530524 2526412 160.9597 65.85582 2020.183
4 USA United States US Dollar 1953 2655277 2642977 163.6476 66.78711 2014.500
5 USA United States US Dollar 1954 2640868 2633803 166.5511 65.59514 1991.019
6 USA United States US Dollar 1955 2844098 2834914 169.5189 67.53133 1997.761
And my code is :
dat_10 <- filter(data_all_country,countrycode == c("USA","CHN","GBR","IND","JPN","BRA","ZAF","FRA","DEU","ARG"))
The amazing thing is the dat_10
is as the following:
countrycode country currency_unit year rgdpe rgdpo pop emp
1 ARG Argentina Argentine Peso 1954 51117.46 51031.80 18.58889 6.970472
2 ARG Argentina Argentine Peso 1964 69836.62 68879.08 21.95909 7.962999
3 ARG Argentina Argentine Peso 1974 113038.73 110358.46 25.64450 9.135211
4 ARG Argentina Argentine Peso 1984 148994.61 149928.59 29.92091 10.345933
5 ARG Argentina Argentine Peso 1994 379470.19 372903.00 34.55811 12.075872
6 ARG Argentina Argentine Peso 2004 517308.94 499958.94 38.72878 14.669195
as even the valid time-series data is filtered every 10 years, which is the exact number of the country I select as logical variable.
How does this happen and any methods to fix it up ?
Why Should We Use %in% not == ?
Let's look at the difference between ==
and %in%
in more details.
Assuming that we have a vector looks like this.
sample_vec <- c("USA", "CHN", "GBR", "IND", "JPN", "BRA", "USA", "CHN", "GBR")
And we what to return all USA
, CHN
, and GBR
in the vector. The desired output is like this, which would be useful for subsetting or filtering.
#[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
If we use ==
with c("USA", "CHN", "GBR")
, we can get the following.
sample_vec == c("USA", "CHN", "GBR")
#[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
Looks good? Wait, it is not doing what we think.
Let's test this code with one additional new country code to the original vector.
# Add one more country
sample_vec2 <- c(sample_vec, "IND")
sample_vec2 == c("USA", "CHN", "GBR")
#[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
The result may look good, but pay attention to the warning message. It turns out that when using ==
to compare two vectors, R recycles the short element to the long one. The above code is doing something as follows. Each pair of character is evaluated separately.
Position 1 2 3 4 5 6 7 8 9 10
Vector1 "USA" "CHN" "GBR" "IND" "JPN" "BRA" "USA" "CHN" "GBR" "IND"
Vector2 "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA" "CHN" "GBR" "USA"
Result TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
R evaluates the string from Vector1
and Vector2
on Position 1
if they are the same. If they are the same, returns TRUE
, otherwise returns FALSE
, and then move to Position 2
, and so on. This is why there is a warning message. The length of sample_vec2
is 10, while the length of the target vector is only 3. R thus needs to recycle the elements in the target vector to perform one-to-one comparison.
Now if we realized that R is doing recycle and one-to-one comparison when we use ==
, it is clear that it if we want to filter element in a vector, it is not suitable. Let's see the following example.
sample_vec == c("CHN", "GBR", "USA")
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The code is almost the same as sample_vec == c("USA", "CHN", "GBR")
, except that I changed the order of the country code. But it returns all FALSE
! This is because recycling and one-to-one comparison found none of any positions are the same. This is probably not the results we want.
However, if we use the following code.
sample_vec %in% c("CHN", "GBR", "USA")
#[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
It returns the expected results. This is because %in%
is an interface of the match
function in R. It returns TRUE
or FALSE
if matches exist or not.
这篇关于过滤器导致 R 中的数据丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!