问题描述
我正在尝试整理我拥有的关于恐龙及其年龄范围的大量数据.到目前为止,我的数据包含一列名称,然后是两列过去数百万年的最大和最小日期,如下所示:
I'm trying to sort out a bunch of data that I have about dinosaurs and their age ranges. So far, my data consists of a column of names, and then two columns of maximum and minimum dates in millions of years in the past, as you can see here:
GENUS ma_max ma_min ma_mid
Abydosaurus 109 94.3 101.65
Achelousaurus 84.9 70.6 77.75
Acheroraptor 70.6 66.043 68.3215
地质时间分为不同的时代(如侏罗纪和白垩纪),这些也细分为阶段.这些阶段有特定的年龄范围,我制作了一个数据框来显示这些:
Geological time is split into different ages (such as the Jurassic and Cretaceous) and these are also subdivided into stage. These stages have specific age ranges and I have made a dataframe to display these:
Stage ma_max ma_min ma_mid
Hettangian 201.6 197.0 199.30
Sinemurian 197.0 190.0 193.50
Pliensbachian 190.0 183.0 186.50
Toarcian 183.0 176.0 179.50
Aalenian 176.0 172.0 174.00
Bajocian 172.0 168.0 170.00
Bathonian 168.0 165.0 166.50
Callovian 165.0 161.0 163.00
Oxfordian 161.0 156.0 158.50
Kimmeridgian 156.0 151.0 153.50
Tithonian 151.0 145.5 148.25
Berriasian 145.5 140.0 142.75
Valanginian 140.0 136.0 138.00
Hauterivian 136.0 130.0 133.00
Barremian 130.0 125.0 127.50
Aptian 125.0 112.0 118.50
Albian 112.0 99.6 105.80
Cenomanian 99.6 93.5 96.55
Turonian 93.5 89.3 91.40
Coniacian 89.3 85.8 87.55
Santonian 85.8 83.5 84.65
Campanian 83.5 70.6 77.05
Maastrichtian 70.6 66.5 68.05
我试图找出每个阶段有多少属.问题是范围——例如,一个属的范围可以跨越 3 个或更多阶段,我希望每个阶段都记录一个属的存在.有没有简单的方法可以做到这一点?我考虑过按照此处的类似讨论中的建议使用格子包中的shingle",但我对 R 很陌生,不确定它是否可以以数据具有范围的方式实现.
I'm trying to find out how many genus' are in each stage. Problem is the range - for example, a genus can have a range that spans 3 or more stages, and I want each of those stages to record the presence of a genus. Is there any simple way to do this? I thought about using 'shingle' from the lattice packages as suggested in a similar discussion on here, but I'm very new to R and not sure if it can be implemented in a way where data has range.
推荐答案
假设你的数据框被称为 genus
和 stage
,首先创建一个包含,对于每个Stage
,在那个 Stage
期间生活的属名.然后我们将其添加到 stage
数据框中,并添加另一列,用于计算每个 Stage
期间存活的属数.
Assuming your data frames are called genus
and stage
, first create a list that contains, for each Stage
, the names of the genera that lived during that Stage
. Then we'll add that to the stage
data frame and also add another column that counts the number of genera living during each Stage
.
在下面的代码中,sapply
依次获取Stage
的每个值并测试GENUS
的哪些值落入该Stage
的时间范围通过将 Stage
的 ma_max
和 ma_min
与 ma_max
和 GENUS的code>ma_min.
In the code below, sapply
takes each value of Stage
in turn and tests what values of GENUS
fall within that Stage
's time range by comparing the Stage
's ma_max
and ma_min
with the ma_max
and ma_min
for each GENUS
.
# List of genera that lived during each Stage
stages.genus = sapply(stage$Stage, function(x){
genus$GENUS[which((stage$ma_max[stage$Stage==x] <= genus$ma_max &
stage$ma_max[stage$Stage==x] >= genus$ma_min) |
(stage$ma_min[stage$Stage==x] >= genus$ma_min &
stage$ma_min[stage$Stage==x] <= genus$ma_max))]
})
对于 stages.genus
的每个元素,将适用于该 Stage
的 GENUS
的所有值粘贴在一起,用逗号分隔,给出us 向量包含与 Stage
的每个值对应的属.将该向量分配为 stage
的新列,我们将其称为 genera
.
For each element of stages.genus
, paste together all values of GENUS
that apply to that Stage
, separated by a comma, giving us vector containing the genera that go with each value of Stage
. Assign that vector as a new column of stage
that we'll call genera
.
# Add list of genera by stage to the stage data frame
stage$genera = lapply(stages.genus, paste, sep=", ")
要计算每个 Stage
中的属数,只需计算 stages.genus
的每个元素中的属数并将其分配给新列stage
我们将称之为 Ngenera
:
To get a count of the number of genera in each Stage
, just count the number of genera in each element of stages.genus
and assign that to a new column of stage
that we'll call Ngenera
:
# Add count of genera for each Stage to the stage data frame
stage$Ngenera = lapply(stages.genus, length)
结果如下:
> stage
Stage ma_max ma_min ma_mid genera Ngenera
1 Hettangian 201.6 197.0 199.30 0
2 Sinemurian 197.0 190.0 193.50 0
...
16 Aptian 125.0 112.0 118.50 0
17 Albian 112.0 99.6 105.80 Abydosaurus 1
18 Cenomanian 99.6 93.5 96.55 Abydosaurus 1
19 Turonian 93.5 89.3 91.40 0
20 Coniacian 89.3 85.8 87.55 0
21 Santonian 85.8 83.5 84.65 Achelousaurus 1
22 Campanian 83.5 70.6 77.05 Achelousaurus, Acheroraptor 2
23 Maastrichtian 70.6 66.5 68.05 Achelousaurus, Acheroraptor 2
另一个选项是在 stage
中为每个 GENUS
创建一列,如果 GENUS
生活在那个阶段,则将值设置为 1否则为零:
An additional option is to create a column in stage
for each GENUS
and set the value to 1 if the GENUS
lived during that stage or zero otherwise:
stage[, genus$GENUS] = lapply(genus$GENUS, function(x) {
ifelse(grepl(x, stages.genus), 1, 0)
})
以下是我们刚刚添加的附加列:
Here are the additional columns we just added:
> stage[ , c(1,7:9)] # Just show the Stage plus the three new GENUS columns
Stage Abydosaurus Achelousaurus Acheroraptor
1 Hettangian 0 0 0
2 Sinemurian 0 0 0
...
16 Aptian 0 0 0
17 Albian 1 0 0
18 Cenomanian 1 0 0
19 Turonian 0 0 0
20 Coniacian 0 0 0
21 Santonian 0 1 0
22 Campanian 0 1 1
23 Maastrichtian 0 1 1
最后一步还将让您按阶段对属进行良好的可视化.例如:
The last step will also set you up for a nice visualization of genera by stage. For example:
library(reshape2)
library(ggplot2)
# Melt data into long format
stage.m = melt(stage[,c(1:4,7:9)], id.var=1:4)
# Tile plot where height of each Stage is proportional to how long it lasted
ggplot(stage.m, aes(variable, ma_mid, fill=factor(value))) +
geom_tile(aes(height=ma_max - ma_min), colour="grey20", lwd=0.2) +
scale_fill_manual(values=c("white","blue")) +
scale_y_continuous(breaks=stage$ma_mid, labels=stage$Stage) +
xlab("Genus") + ylab("Stage") +
theme_bw(base_size=15) +
guides(fill=FALSE)
如果您希望蓝色仅覆盖时间范围,则还可以修改前面的代码以使用来自 stage
和 genus
数据帧的时间范围每个 GENUS
生活,而不是他们生活的每个 Stage
的全部范围.
The previous code can also be modified to use time ranges from both the stage
and genus
data frames if you want the blue coloring to cover only the time-range when each GENUS
lived, rather than the full range of each Stage
in which they lived.
这篇关于如何使用R中的范围数据显示离散类别中的频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!