![253 253]()
本文介绍了R:应用累积和函数和填充数据空白与NA进行绘图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我有一个数据框看起来像这样,我正在尝试计算行VALUE的累积和。输入文件也可以在这里找到: https://dl.dropboxusercontent.com/u/ 16277659 / input.csv df< -read.csv(input.csv,sep = ;,header = TRUE) NAME; ID; SURVEY_YEAR REFERENCE_YEAR; VALUE SAMPLE1; 253; 1880年1879年14 SAMPLE1; 253; 1881年1880年-10 SAMPLE1; 253; 1882年1881年4 SAMPLE1; 253; 1883年1882年10 SAMPLE1; 253; 1884年1883年10 SAMPLE1; 253; 1885年1884年12 SAMPLE1; 253; 1889年1888年11 SAMPLE1; 253; 1890年1889年12 SAMPLE1; 253; 1911年1910年-16 SAMPLE1; 253; 1913年1911年-11 SAMPLE1; 253; 1914年1913年-8 SAMPLE2; 261; 1992年1991年-19 SAMPLE2; 261; 1994年1992年-58 SAMPLE2; 261; 1995; 1994年-40 SAMPLE2; 261; 1996年1995; -21 SAMPLE2; 261; 1997年1996年-50 SAMPLE2; 261; 1998; 1997年-60 SAMPLE2; 261; 2005; 2004; -34 SAMPLE2; 261; 2006; 2005; -23 SAMPLE2; 261; 2007; 2006; -19 SAMPLE2; 261; 2008; 2007; -29 SAMPLE2; 261; 2009; 2008; -89 SAMPLE2; 261; 2013年2009; -14 SAMPLE2; 261; 2014年2013年-16 我目标的最终产品是每个SAMPLE的曲线,在x轴上的是SURVEY_YEAR在y轴上绘制了以后计算的VALUE的累积总和CUMSUM。 我的代码到目前为止整理数据: #按组筛选出小于3个度量的所有值(在这种情况下,什么也不做,但与我的其余数据重要) df< -read.csv(input.csv,sep =;,header = TRUE) rowsn < - with(df,by(VALUE,ID,function(xx)sum(!is.na(xx)))) names(which(rowsn> = 3)) dat #写入新的列,该列定义组的开头(按ID分隔)和cumsum函数(= 0) dat x rownames dat)< - seq_len(nrow(dat)) #将数据写入csv文件进行检查 write.table(dat,dat.csv,sep =;, row.names = FALSE) 这将导致以下数据框,它是计算 NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE SAMPLE1; 253; 1879年1879年0 SAMPLE1; 253; 1880年1879年14 SAMPLE1; 253; 1881年1880年-10 SAMPLE1; 253; 1882年1881年4 SAMPLE1; 253; 1883年1882年; 10 SAMPLE1; 253; 1884年1883年10 SAMPLE1; 253; 1885年1884年12 SAMPLE1; 253; 1889年1888年11 SAMPLE1; 253; 1890年1889年12 SAMPLE1; 253; 1911年1910年-16 SAMPLE1; 253; 1913年1911年-11 SAMPLE1; 253; 1914年1913年-8 SAMPLE2; 261; 1991年1991年0 SAMPLE2; 261; 1992年1991年-19 SAMPLE2; 261; 1994年1992年-58 SAMPLE2; 261; 1995; 1994年-40 SAMPLE2; 261; 1996年1995; -21 SAMPLE2; 261; 1997年1996年-50 SAMPLE2; 261; 1998; 1997年-60 SAMPLE2; 261; 2005; 2004; -34 SAMPLE2; 261; 2006; 2005; -23 SAMPLE2; 261; 2007; 2006; -19 SAMPLE2; 261; 2008; 2007; -29 SAMPLE2; 261; 2009; 2008; -89 SAMPLE2; 261; 2013年2009; -14 SAMPLE2; 261; 2014年2013年-16 现在的问题是我想计算每个年。正如你所看到的,我在某些年份之间有差距(例如在1890年至1911年之间的SAMPLE1和1998年至2005年的SAMPLE2之间),我想填补每年与NA值之间的差距,以便我可以用情节类型绘制='b'(点和线),并且不同的间隙不连接。重要的是,如果相互之间有多个NA值,则在CUMSUM行中,最后一个NA值应替换为..之前的最后一个数值。 正常情况是,REFERENCE_YEAR和SURVEY_YEAR之间的差值等于1(例如,从1880年到1881年的SAMPLE1的第一个例子),但在某些情况下,在REFERENCE_YEAR和SURVEY_YEAR之间有不同的时间段(例如在1911年到1913年的SAMPLE1中,在2009年至2013年的SAMPLE2中)。如果是这种情况,累计金额的功能应该只应用一次,并且在所示期间内的值应该保持不变(在图中,该结果是连接的直线)。 如果我提供一个结果应该是什么样的例子,那么它很难解释一切细节,也许更简单: NAME; ID; SURVEY_YEAR; REFERENCE_YEAR;值; CUMSUM SAMPLE1; 253; 1879年1879年0; 0 SAMPLE1; 253; 1880年1879年14; 14 SAMPLE1; 253; 1881年1880年-10; 4 SAMPLE1; 253; 1882年1881年4; 8 SAMPLE1; 253; 1883年1882年10; 18 SAMPLE1; 253; 1884年1883年10; 28 SAMPLE1; 253; 1885年1884年12; 40 SAMPLE1; 253; 1886年1885年NA; NA SAMPLE1; 253; 1887年1886年NA; NA SAMPLE1; 253; 1888年1887年NA; 40 SAMPLE1; 253; 1889年1888年11; 51 SAMPLE1; 253; 1890年1889年12; 63 SAMPLE1; 253; 1891年1890年NA; NA SAMPLE1; 253; 1892年1891年NA; NA SAMPLE1; 253; 1893年1892年NA; NA SAMPLE1; 253; 1894年1893年NA; NA SAMPLE1; 253; 1895年1894年NA; NA SAMPLE1; 253; 1896年1895年NA; NA SAMPLE1; 253; 1897年1896年NA; NA SAMPLE1; 253; 1898年1897年NA; NA SAMPLE1; 253; 1899年1898年NA; NA SAMPLE1; 253; 1900年1899年NA; NA SAMPLE1; 253; 1901年1900年NA; NA SAMPLE1; 253; 1902年1901年NA; NA SAMPLE1; 253; 1903年1902年NA; NA SAMPLE1; 253; 1904年1903年NA; NA SAMPLE1; 253; 1905年1904年; NA; NA SAMPLE1; 253; 1906年1905年NA; NA SAMPLE1; 253; 1907年1906年NA; NA SAMPLE1; 253; 1908年1907年NA; NA SAMPLE1; 253; 1909年1908年NA; NA SAMPLE1; 253; 1910年1909年NA; 63 SAMPLE1; 253; 1911年; 1910年-16; 47 SAMPLE1; 253; 1912年1911年-11; 36 SAMPLE1; 253; 1913年1912年-11; 36 SAMPLE1; 253; 1914年1913年-8; 28 SAMPLE2; 253; 1991年1991年0; 0 SAMPLE2; 253; 1992年1991年-19; -19 SAMPLE2; 253; 1993年1992年-58; -77 SAMPLE2; 253; 1994年1993年-58; -135 SAMPLE2; 253; 1995; 1994年-40; -175 SAMPLE2; 253; 1996年1995; -21; -196 SAMPLE2; 253; 1997年1996年-50; -246 SAMPLE2; 253; 1998; 1997年-60; -306 SAMPLE2; 253; 1999; 1998; NA; NA SAMPLE2; 253; 2000; 1999; NA; NA SAMPLE2; 253; 2001; 2000; NA; NA SAMPLE2; 253; 2002; 2001; NA; NA SAMPLE2; 253; 2003; 2002; NA; NA SAMPLE2; 253; 2004; 2003; NA; -306 SAMPLE2; 253; 2005; 2004; -34; -340 SAMPLE2; 253; 2006; 2005; -23; -363 SAMPLE2; 253; 2007; 2006; -19; -382 SAMPLE2; 253; 2008; 2007; -29; -411 SAMPLE2; 253; 2009; 2008; -89; -500 SAMPLE2; 253; 2010; 2009; -14; -514 SAMPLE2; 253; 2011; 2010; -14; -514 SAMPLE2; 253; 2012; 2011; -14; -514 SAMPLE2; 253; 2013年2012; -14; -514 SAMPLE2; 253; 2014年2013年-16; -530 帮助这个相当复杂的情况将非常感谢!谢谢!解决方案 BIG EDIT:发布的代码,添加正确的图书馆电话 df = read.csv(input.csv,sep =;,stringsAsFactors = FALSE) #find每个SAMPLE的最小/最大年份 df_minmax = df%>% group_by(NAME)%>% summaryize(min_year = min(SURVEY_YEAR), max_year = max(SURVEY_YEAR)) #创建一个空数据框,我们想要 df2 = data.frame(NAME =, ID = 0, SURVEY_YEAR = min(df $ SURVEY_YEAR):max(df $ SURVEY_YEAR), REFERENCE_YEAR = min(df $ SURVEY_YEAR):max(df $ SURVEY_YEAR) - 1, VALUE = NA,stringsAsFactors = FALSE) #填写NAMES数据框 - 可能有一个更好的方法来做这个 for(i in 1:nrow(df_minmax)){ min_year = df_minmax [i,] $ min_year max_year = df_minmax [i,] $ max_year df2 [df2 $ SURVEY_YEAR%%min_y ear:max_year,] $ NAME = df_minmax [i,] $ NAME } #fill在值 #this行有点危险 - 它依赖于事实上,df1和df2具有相同的相对顺序#不要改变df和df2之前的排序。 df2 [df2 $ SURVEY_YEAR%in%df $ SURVEY_YEAR,] $ VALUE = df $ VALUE #在这个例子中,sample1和sample2之间有一段很长的时间,我们可以过滤掉 df2 = df2%>%filter(NAME!=) #现在我们可以为了累积和而将所有累积的东西#设置为0 temp = df2 $ VALUE df2 [is.na(df2)] = 0 df2 = df2%>%group_by(NAME)%>%mutate(csum = cumsum )) #get返回NA值 - 如果NA值对您有用 df2 $ VALUE = temp 这里是'head(df2): NAME ID SURVEY_YEAR REFERENCE_YEAR VALUE csum 1 SAMPLE1 0 1880 1879 14 14 2 SAMPLE1 0 1881 1880 -10 4 3 SAMPLE1 0 1882 1881 4 8 4 SAMPLE1 0 1883 1882 10 18 5 SAMPLE1 0 1884 1883 10 28 6 SAMPLE1 0 1885 1884 1 2 40 7 SAMPLE1 0 1886 1885 NA 40 8 SAMPLE1 0 1887 1886 NA 40 9样本1 0 1888 1887 NA 40 10 SAMPLE1 0 1889 1888 11 51 11 SAMPLE1 0 1890 1889 12 63 12样本1 0 1891 1890 NA 63 13样本1 0 1892 1891 NA 63 14样本1 0 1893 1892 NA 63 15样本1 0 1894 1893 NA 63 16 SAMPLE1 0 1895 1894 NA 63 17 SAMPLE1 0 1896 1895 NA 63 18 SAMPLE1 0 1897 1896 NA 63 19 SAMPLE1 0 1898 1897 NA 63 20 SAMPLE1 0 1899 1898 NA 63 以下是上述步骤的概述,作为快速摘要: / p> 查找NAME中每个组的最小/最大年份。 创建一个空的数据框,具有我们想要的所有年份的总范围。 在新的空数据框中的正确位置填入NAMES。 在新的空数据框中,在正确的地方填入VALUES。 为了累积金额,将NA设置为0 按组查找累计金额。 将0替换为NAs。 为循环。我希望没有人把我绑起来。 I have a dataframe which looks like this and I am trying to calculate the cumulative sum for the row VALUE. The input file can also be found here: https://dl.dropboxusercontent.com/u/16277659/input.csvdf <-read.csv("input.csv", sep=";", header=TRUE)NAME; ID; SURVEY_YEAR REFERENCE_YEAR; VALUESAMPLE1; 253; 1880; 1879; 14SAMPLE1; 253; 1881; 1880; -10SAMPLE1; 253; 1882; 1881; 4SAMPLE1; 253; 1883; 1882; 10SAMPLE1; 253; 1884; 1883; 10SAMPLE1; 253; 1885; 1884; 12SAMPLE1; 253; 1889; 1888; 11SAMPLE1; 253; 1890; 1889; 12SAMPLE1; 253; 1911; 1910; -16SAMPLE1; 253; 1913; 1911; -11SAMPLE1; 253; 1914; 1913; -8SAMPLE2; 261; 1992; 1991; -19SAMPLE2; 261; 1994; 1992; -58SAMPLE2; 261; 1995; 1994; -40SAMPLE2; 261; 1996; 1995; -21SAMPLE2; 261; 1997; 1996; -50SAMPLE2; 261; 1998; 1997; -60SAMPLE2; 261; 2005; 2004; -34SAMPLE2; 261; 2006; 2005; -23SAMPLE2; 261; 2007; 2006; -19SAMPLE2; 261; 2008; 2007; -29SAMPLE2; 261; 2009; 2008; -89SAMPLE2; 261; 2013; 2009; -14SAMPLE2; 261; 2014; 2013; -16The end product I am aiming for are plots for each SAMPLE where on the x axis the SURVEY_YEAR is plotted and on the y axis the later calculated cumulative sum CUMSUM of the VALUE. My code so far to sort out the data:# Filter out all values with less than 3 measurements by group (in this case does nothing, but is important with the rest of my data)df <-read.csv("input.csv", sep=";", header=TRUE)rowsn <- with(df,by(VALUE,ID,function(xx)sum(!is.na(xx))))names(which(rowsn>=3))dat <- df[df$ID %in% names(which(rowsn>=3)),]# write new column which defines the beginning of the group (split by ID) and for the cumsum function(=0)dat <- do.call(rbind, lapply(split(dat, dat$ID), function(x){x <- rbind(x[1,],x); x[1, "VALUE"] <- 0; x[1, "SURVEY_YEAR"] <- x[1, "SURVEY_YEAR"] -1; return(x)}))rownames(dat) <- seq_len(nrow(dat))# write dat to csv file for inspectionwrite.table(dat, "dat.csv", sep=";", row.names=FALSE)This results in the following dataframe which is the starting point for the calculation of the cumulative sum of the row VALUE.NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUESAMPLE1; 253; 1879; 1879; 0SAMPLE1; 253; 1880; 1879; 14SAMPLE1; 253; 1881; 1880; -10SAMPLE1; 253; 1882; 1881; 4SAMPLE1; 253; 1883; 1882; 10SAMPLE1; 253; 1884; 1883; 10SAMPLE1; 253; 1885; 1884; 12SAMPLE1; 253; 1889; 1888; 11SAMPLE1; 253; 1890; 1889; 12SAMPLE1; 253; 1911; 1910; -16SAMPLE1; 253; 1913; 1911; -11SAMPLE1; 253; 1914; 1913; -8SAMPLE2; 261; 1991; 1991; 0SAMPLE2; 261; 1992; 1991; -19SAMPLE2; 261; 1994; 1992; -58SAMPLE2; 261; 1995; 1994; -40SAMPLE2; 261; 1996; 1995; -21SAMPLE2; 261; 1997; 1996; -50SAMPLE2; 261; 1998; 1997; -60SAMPLE2; 261; 2005; 2004; -34SAMPLE2; 261; 2006; 2005; -23SAMPLE2; 261; 2007; 2006; -19SAMPLE2; 261; 2008; 2007; -29SAMPLE2; 261; 2009; 2008; -89SAMPLE2; 261; 2013; 2009; -14SAMPLE2; 261; 2014; 2013; -16The problem now is that I would like to calculate the cumulative sum of the row VALUE for each year. As you can see I have gaps between certain years (for example in SAMPLE1 between 1890 and 1911 and in SAMPLE2 between 1998 and 2005) and I would like to fill the gaps for each year inbetween with NA values so that I can plot with plot type='b' (points and lines) and so that the different gaps are not connected. What is important that if there are multiple NA values after each other, in the CUMSUM row the last NA value should be replaced with the last numerical value before..The normal case is that the difference between the REFERENCE_YEAR and the SURVEY_YEAR equals 1 (e.g for the first example of SAMPLE1 from 1880 to 1881), but in some cases there are varying periods between the REFERENCE_YEAR and the SURVEY_YEAR (e.g. in SAMPLE1 from 1911 to 1913 and in SAMPLE2 from 2009 to 2013). If this is the case the function of cumulative sum should only be applied once and the value should stay the same for the period indicated (in the plot this results in a straight line that is connected).Its difficult to explain everything in detail and maybe its easier if I provide an example of what the result should look like:NAME; ID; SURVEY_YEAR; REFERENCE_YEAR; VALUE; CUMSUMSAMPLE1; 253; 1879; 1879; 0; 0SAMPLE1; 253; 1880; 1879; 14; 14SAMPLE1; 253; 1881; 1880; -10; 4SAMPLE1; 253; 1882; 1881; 4; 8SAMPLE1; 253; 1883; 1882; 10; 18SAMPLE1; 253; 1884; 1883; 10; 28SAMPLE1; 253; 1885; 1884; 12; 40SAMPLE1; 253; 1886; 1885; NA; NASAMPLE1; 253; 1887; 1886; NA; NASAMPLE1; 253; 1888; 1887; NA; 40SAMPLE1; 253; 1889; 1888; 11; 51SAMPLE1; 253; 1890; 1889; 12; 63SAMPLE1; 253; 1891; 1890; NA; NASAMPLE1; 253; 1892; 1891; NA; NASAMPLE1; 253; 1893; 1892; NA; NASAMPLE1; 253; 1894; 1893; NA; NASAMPLE1; 253; 1895; 1894; NA; NASAMPLE1; 253; 1896; 1895; NA; NASAMPLE1; 253; 1897; 1896; NA; NASAMPLE1; 253; 1898; 1897; NA; NASAMPLE1; 253; 1899; 1898; NA; NASAMPLE1; 253; 1900; 1899; NA; NASAMPLE1; 253; 1901; 1900; NA; NASAMPLE1; 253; 1902; 1901; NA; NASAMPLE1; 253; 1903; 1902; NA; NASAMPLE1; 253; 1904; 1903; NA; NASAMPLE1; 253; 1905; 1904; NA; NASAMPLE1; 253; 1906; 1905; NA; NASAMPLE1; 253; 1907; 1906; NA; NASAMPLE1; 253; 1908; 1907; NA; NASAMPLE1; 253; 1909; 1908; NA; NASAMPLE1; 253; 1910; 1909; NA; 63SAMPLE1; 253; 1911; 1910; -16; 47SAMPLE1; 253; 1912; 1911; -11; 36SAMPLE1; 253; 1913; 1912; -11; 36SAMPLE1; 253; 1914; 1913; -8; 28SAMPLE2; 253; 1991; 1991; 0; 0SAMPLE2; 253; 1992; 1991; -19; -19SAMPLE2; 253; 1993; 1992; -58; -77SAMPLE2; 253; 1994; 1993; -58; -135SAMPLE2; 253; 1995; 1994; -40; -175SAMPLE2; 253; 1996; 1995; -21; -196SAMPLE2; 253; 1997; 1996; -50; -246SAMPLE2; 253; 1998; 1997; -60; -306SAMPLE2; 253; 1999; 1998; NA; NASAMPLE2; 253; 2000; 1999; NA; NASAMPLE2; 253; 2001; 2000; NA; NASAMPLE2; 253; 2002; 2001; NA; NASAMPLE2; 253; 2003; 2002; NA; NASAMPLE2; 253; 2004; 2003; NA; -306SAMPLE2; 253; 2005; 2004; -34; -340SAMPLE2; 253; 2006; 2005; -23; -363SAMPLE2; 253; 2007; 2006; -19; -382SAMPLE2; 253; 2008; 2007; -29; -411SAMPLE2; 253; 2009; 2008; -89; -500SAMPLE2; 253; 2010; 2009; -14; -514SAMPLE2; 253; 2011; 2010; -14; -514SAMPLE2; 253; 2012; 2011; -14; -514SAMPLE2; 253; 2013; 2012; -14; -514SAMPLE2; 253; 2014; 2013; -16; -530 Help with this rather complicated case would be very much appreciated! Thank you! 解决方案 BIG EDIT: Posted code, added correct library callslibrary(dplyr)df = read.csv("input.csv", sep=";", stringsAsFactors=FALSE)#find min/max year for each SAMPLEdf_minmax = df %>% group_by(NAME) %>% summarise(min_year = min(SURVEY_YEAR), max_year = max(SURVEY_YEAR))#create an empty dataframe with what we wantdf2 = data.frame(NAME = "", ID = 0, SURVEY_YEAR = min(df$SURVEY_YEAR):max(df$SURVEY_YEAR), REFERENCE_YEAR = min(df$SURVEY_YEAR):max(df$SURVEY_YEAR) - 1, VALUE = NA, stringsAsFactors=FALSE)#fill in the NAMES dataframe - there's probably a better way to do thisfor(i in 1:nrow(df_minmax)) { min_year = df_minmax[i, ]$min_year max_year = df_minmax[i, ]$max_year df2[df2$SURVEY_YEAR %in% min_year:max_year, ]$NAME = df_minmax[i, ]$NAME}#fill in the values#this line is a bit dangerous -- it relies on the fact that df1 and df2 have the same relative ordering#don't change the ordering of df and df2 before this line.df2[df2$SURVEY_YEAR %in% df$SURVEY_YEAR, ]$VALUE = df$VALUE#in this example there is a long period between sample1 and sample2 we can filter those outdf2 = df2 %>% filter(NAME != "")#Now we can do all the cumulative stuff#for purposes of cumulative sums, set NA to 0temp = df2$VALUEdf2[is.na(df2)] = 0df2 = df2 %>% group_by(NAME) %>% mutate(csum = cumsum(VALUE))#get back the NA values -- in case the NA values are useful to youdf2$VALUE = tempHere's `head(df2): NAME ID SURVEY_YEAR REFERENCE_YEAR VALUE csum1 SAMPLE1 0 1880 1879 14 142 SAMPLE1 0 1881 1880 -10 43 SAMPLE1 0 1882 1881 4 84 SAMPLE1 0 1883 1882 10 185 SAMPLE1 0 1884 1883 10 286 SAMPLE1 0 1885 1884 12 407 SAMPLE1 0 1886 1885 NA 408 SAMPLE1 0 1887 1886 NA 409 SAMPLE1 0 1888 1887 NA 4010 SAMPLE1 0 1889 1888 11 5111 SAMPLE1 0 1890 1889 12 6312 SAMPLE1 0 1891 1890 NA 6313 SAMPLE1 0 1892 1891 NA 6314 SAMPLE1 0 1893 1892 NA 6315 SAMPLE1 0 1894 1893 NA 6316 SAMPLE1 0 1895 1894 NA 6317 SAMPLE1 0 1896 1895 NA 6318 SAMPLE1 0 1897 1896 NA 6319 SAMPLE1 0 1898 1897 NA 6320 SAMPLE1 0 1899 1898 NA 63Here's the outline of the steps I did above as a quick summary:Find the min/max year for each group in NAME.Create an empty dataframe that has the total range of all the years we want.Fill in the NAMES in the correct place in new empty dataframe.Fill in the VALUES in the correct place in new empty dataframe.Set NA's to 0 for purposes of cumulative sumsFind cumulative sums by group.Replace the 0 back into NAs.It's a bit hackish with the for loop. I'm hoping no one strings me up for it. 这篇关于R:应用累积和函数和填充数据空白与NA进行绘图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 10-24 15:16