从向量子集整理数据

本文介绍了从向量子集整理数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用R分析来自多家医院的抗生素使用数据。

I'm using R to analyse data about antibiotic use from a number of hospitals.

根据整洁的数据，我已将此数据导入到框架中

I've imported this data into a frame, according to the tidy data principles.

>head(data)
        date   antibiotic  usage  hospital
1 2006-01-01   amikacin 0.000000 hospital1
2 2006-02-01   amikacin 0.000000 hospital1
3 2006-03-01   amikacin 0.000000 hospital1
4 2006-04-01   amikacin 0.000000 hospital1
5 2006-05-01   amikacin 0.937119 hospital1
6 2006-06-01   amikacin 1.002961 hospital1

（数据设置为每月数据x 5医院x 40种抗生素）

(the data set is monthly data x 5 hospitals x 40 antibiotics)

我想做的第一件事是将抗生素归类。

The first thing I would like to do is aggregate the antibiotics into classes.

> head(distinct(select(data, antibiotic)))
                antibiotic
1                 amikacin
2  amoxicillin-clavulanate
3              amoxycillin
4               ampicillin
5             azithromycin
6         benzylpenicillin
7                cefalotin
8                cefazolin

> penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
> ceph1 <- c("cefalotin", "cefazolin")

我想做的是然后根据这些抗生素类载体对数据进行子集化：

What I would like to do is then subset the data based on these antibiotic class vectors:

filter(data, antibiotic =(any one of the values in the vector "penicillins")

感谢thelatemail指出了这样做的方法：

Thanks to thelatemail for pointing out the way to do this is:

d <- filter(data, antibiotic %in% penicillins)

我想通过多种方式分析数据：

What I would like the data to do is to be analysed in a number of ways:

关键分析（和ggplot输出）为：

The key analysis (and ggplot output) is:

x =日期

y =抗生素的使用，分层为（药物|类），按医院过滤

y = usage of antibiotic(s) stratified by (drug | class), filtered by hospital

我现在不清楚的是如何汇总此类数据。

What I'm not clear on now is how to aggregate the data for this sort of thing.

示例：

我想分析该地区所有医院中 ceph1类的使用，导致（抱歉，我知道这是不适当的de）

Example:
I want to analyse the use of class "ceph1" across all the hospitals in the district, resulting in (apologies - i know this is not proper code)

   x         y
Jan-2006   for all in hospitals(usage of cephazolin + usage of cephalotin)
Feb-2006   for all in hospitals(usage of cephazolin + usage of cephalotin)
etc

并且从长远来看，能够将参数传递给该函数，该函数使我可以选择哪些医院，哪种抗生素或哪种抗生素。

And, in the long-run, to be able to pass arguments to a function which will let me select which hospitals and which antibiotic or class of antibiotics.

再次感谢-我知道这比原始问题要复杂一个数量级！

Thanks again - I know this is an order of magnitude more complicated than the original question!

推荐答案

因此，经过大量的试验和错误以及堆积

So after lots of trial and error and heaps of reading, I've managed to sort it out.

>str(data)
'data.frame':   23360 obs. of  4 variables:
 $ date      : Date, format: "2007-09-01" "2012-06-01" ...
 $ antibiotic: Factor w/ 41 levels "amikacin","amoxicillin-clavulanate",..: 17 3 19 30 38 20 20 20 7 25 ...
 $ usage     : num  21.368 36.458 7.226 3.671 0.917 ...
 $ hospital  : Factor w/ 5 levels "hospital1","hospital2",..: 1 3 2 1 4 1 4 3 5 1 ...

所以我可以先对数据进行子集化：

So I can subset the data first:

>library(dplyr)
>penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
>d <- filter(data, antibiotic %in% penicillins)

然后使用更多dplyr进行汇总（感谢哈德利！）

And then make the summary using more of dplyr (thanks, Hadley!)

>d1 <- summarise(group_by(d, date), total = sum(usage))
>d1
Source: local data frame [122 x 2]

         date    total
       (date)    (dbl)
1  2006-01-01 1669.177
2  2006-02-01 1901.749
3  2006-03-01 2311.008
4  2006-04-01 1921.436
5  2006-05-01 1594.781
6  2006-06-01 2150.997
7  2006-07-01 2052.517
8  2006-08-01 2132.501
9  2006-09-01 1959.916
10 2006-10-01 1751.667
..        ...      ...
>
> qplot(date, total, data = d1) + geom_smooth()
> [scatterplot as desired!]

下一步是尝试将其全部构建为一个函数和/或尝试以我在这里所做的工作为基础，在线进行子集设置。

Next step will be to try and build it all into a function and/or to try and do the subsetting in-line, building on what I've worked out here.

这篇关于从向量子集整理数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！