R通过选择的rownumbers动态分割/数据帧子集 - 分析textgrid praat

本文介绍了R通过选择的rownumbers动态分割/数据帧子集 - 分析textgrid praat的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述我正在尝试处理一个称为 .TextGrid （由Praat程序生成）的分割文件。）原始格式如下所示：文件类型= ooTextFile对象类=TextGrid xmin = 0 xmax = 243.761375 层级？ < exists> size = 17 item []： item [1]： class =IntervalTier name =phones xmin = 0 xmax = 243.761 间隔：size = 2505 间隔[1]： xmin = 0 xmax = 0.4274939687384032 text =_间隔[2]： xmin = 0.4274939687384032 xmax = 0.472 text =v间隔[3]： [...] （然后重复到EOF，间隔[3到n]为n项（注释层）一个文件。有人使用 rPython R软件包提出了解决方案。不幸的是：我对Python不太了解 rPython的版本不适用于我使用的R.3.0.2 。我的目标是为我的分析开发此解析器在R环境下。现在我的目的是将这个文件分割成多个数据帧。每个数据帧应包含一个项目（注释层）。＃加载数据 txtgrid< - read.delim（./ xxx_01_xx.textgrid，sep = c（=，\\\），dec =。，header = FALSE）＃删除空格（使用stringr包） txtgrid [，1]< - str_trim txtgrid [，1]）$ b $ b＃将row.names转换为数字 num.row< - as.numeric（row.names（txtgrid））＃重新定义原始的textgrid并添加这些行（我想保留以备以后处理） txtgrid< - data.frame（num.row，txtgrid） colnames（txtgrid）< - c（num.row object，value） head（txtgrid） head（txtgrid）是非常原始的，所以这里是textgrid的前20行 txtgrid [1:20，] num.row对象值 1 1文件类型ooTextFile 2 2对象类TextGrid 3 3 xmin 0 4 4 xmax 243.761 375 5 5层？ < exists> 6 6尺寸17 7 7项[]： 8 8项[1]： 9 9 class IntervalTier 10 10名称手机 11 11 xmin 0 12 12 xmax 243.761 13 13间隔：大小2505 14 14间隔[1]： 15 15 xmin 0 16 16 xmax 0.4274939687384032 17 17文本_ 18 18间隔[2]： 19 19 xmin 0.4274939687384032 20 20 xmax 0.472 现在我已经预处理了，我可以：＃查找要分割的行数（即项目） tier.begining< - txtgrid [grep（item ，txtgrid $ object，perl = TRUE）]] ＃将这些数字保存在一个变量x< - as.numeric（row.names（tier.begining））此变量 x 给出了我的数据应该是数字-1拆分成几个数据帧。我有18个项目-1（第一个项目是项目[]，并包含所有其他项目）所以矢量 x 是： x [1] 7 8 10034 14624 19214 22444 25674 28904 31910 35140 38146 38156 38566 39040 39778 40222 44800 [18] 45018 / strong>：在多个数据框 textgrids $ nameoftheItem 中分割此数据框，使得我获得的数据帧数量与我有一样的数量，例如： p> textgrid $ phones item [1]： class =IntervalTier name =手机 xmin = 0 xmax = 243.761 间隔：size = 2505 间隔[1]： xmin = 0 xmax = 0.4274939687384032 text =_间隔[2]： xmin = 0.4274939687384032 xm ax = 0.472 text =v [...] 间隔[n]： textgrid $ syllable 项[2]： class =IntervalTier name =syllable xmin = 0 xmax = 243.761 间隔：size = 1200 间隔[1]： xmin = 0 xmax = 0.500 text =ve间隔[2]： [...] 间隔[n]： textgrid $ item [n] 我想使用 txtgrid.new< - split（txtgrid，f = x） / pre> 但是这个消息是正确的：警告信息：在split.default（x = seq_len（nrow（x）），f = f，drop = drop，...）：数据长度不是拆分变量的倍数 pre> 我没有得到所需的输出，似乎行号我还尝试了一些，其中， daply （来自 plyr ）& 子集功能，但从来没有让他们正常工作！我很欢迎任何想法来正确构建这些数据有效率的。理想情况下，我应该能够在它们之间链接项目（注释层）（xmin和不同层的xmax）以及多个textgrid文件，这只是开始。解决方案 split vector的长度应等于的 data.frame 中的行。尝试以下操作： txtgrid.sub< - txtgrid [ - （1：grep（item，txtgrid $ object）[1]）]] grep（item，txtgrid.sub $ object）[ - 1] 拆分< - unlist（mapply（rep，seq_along（grep（item，txtgrid.sub $ object））， diff（c（grep（item，txtgrid.sub $ object） nrow（txtgrid.sub）+ 1）））） df.list< - split（txtgrid.sub，list（split））编辑：然后，您可以通过执行以下操作简化数据： l tmp< - as.data.frame（t（x [，3，drop = FALSE]），stringsAsFactors = FALSE） names（tmp）< - make。唯一（make.names（x [，2]）） tmp }）库（plyr） do.call（rbind.fill，l） item..1 ..类名xmin xmax间隔..size 1< NA> IntervalTier手机0 243.761 2505 2< NA> IntervalTier音节0 243.761 2505 间隔..1 .. xmin.1 xmax.1文本间隔..2 .. 1< NA> 0 0.4274939687384032 _< NA> 2< NA> 0 0.4274939687384032 _< NA> xmin.2 xmax.2 1 0.4274939687384032 0.472 2< NA> < NA> 注意：我使用了上述的虚拟数据。 I am trying to process a "segmentation file" called .TextGrid (generated by Praat program). )The original format looks like this:File type = "ooTextFile"Object class = "TextGrid"xmin = 0 xmax = 243.761375 tiers? <exists> size = 17 item []: item [1]: class = "IntervalTier" name = "phones" xmin = 0 xmax = 243.761 intervals: size = 2505 intervals [1]: xmin = 0 xmax = 0.4274939687384032 text = "_" intervals [2]: xmin = 0.4274939687384032 xmax = 0.472 text = "v" intervals [3]:[...](This is then repeted to EOF, with intervals[3 to n] for n Item (layer of annotation) in a file. Somebody proposed a solution using rPython R package. Unfortunately :I don't have a good knowledge of PythonThe version of rPython is not available for R.3.0.2 (which I am using).My aim is to develop this parser for my analysis exclusively under R environment.Right now my aim is to segment this file into multiple data frame. Each dataframe should contain one item (layer of annotation). # Load the Datatxtgrid <- read.delim("./xxx_01_xx.textgrid", sep=c("=","\n"), dec=".", header=FALSE)# Erase White spaces (use stringr package)txtgrid[,1] <- str_trim(txtgrid[,1])# Convert row.names to numeric num.row<- as.numeric(row.names(txtgrid))# Redefine the original textgrid and add those rows (I want to "keep them in case for later process)txtgrid <- data.frame(num.row,txtgrid)colnames(txtgrid) <- c("num.row","object", "value")head(txtgrid)The output of head(txtgrid) is very raw, so here is the first 20 lines of the textgrid txtgrid[1:20,]: num.row object value1 1 File type ooTextFile2 2 Object class TextGrid3 3 xmin 0 4 4 xmax 243.761375 5 5 tiers? <exists> 6 6 size 17 7 7 item []: 8 8 item [1]: 9 9 class IntervalTier 10 10 name phones 11 11 xmin 0 12 12 xmax 243.761 13 13 intervals: size 2505 14 14 intervals [1]: 15 15 xmin 0 16 16 xmax 0.4274939687384032 17 17 text _ 18 18 intervals [2]: 19 19 xmin 0.4274939687384032 20 20 xmax 0.472 Now that I pre-processed it, I can :# Find the number of the rows where I want to split (i.e. Item)tier.begining <- txtgrid[grep("item", txtgrid$object, perl=TRUE), ]# And save those numbers in a variablex <- as.numeric(row.names(tier.begining))This variable x gives me the numbers-1 where my Data should be splitted in several dataframes. I have 18 items -1 (the first item is item[] and include all the other items. So vector x is : x [1] 7 8 10034 14624 19214 22444 25674 28904 31910 35140 38146 38156 38566 39040 39778 40222 44800[18] 45018How can I tell to R : to segment this dataframe in multiple dataframes textgrids$nameoftheItem in such a way that I get as many data frame as I have of items?, for example :textgrid$phones item [1]: class = "IntervalTier" name = "phones" xmin = 0 xmax = 243.761 intervals: size = 2505 intervals [1]: xmin = 0 xmax = 0.4274939687384032 text = "_" intervals [2]: xmin = 0.4274939687384032 xmax = 0.472 text = "v" [...] intervals [n]:textgrid$syllable item [2]: class = "IntervalTier" name = "syllable" xmin = 0 xmax = 243.761 intervals: size = 1200 intervals [1]: xmin = 0 xmax = 0.500 text = "ve" intervals [2]: [...] intervals [n]: textgrid$item[n]I wanted to use txtgrid.new <- split(txtgrid, f=x)But this message is right :Warning message: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : data length is not a multiple of split variableI don't get the desired outputed, it seems that row numbers don't follow each other and that the file is all mixed up.I have also tried some which, daply (from plyr) & subset functions but never got them to work properly!I am welcoming any idea to structure this data properly & efficiently. Ideally I should be able to link items (layers of annotation) between them (xmin & xmax of different layers), as well as multiple textgrid files, this is just the beginning. 解决方案 The length of the split vector should be equal to the number of rows in the data.frame. Try the following:txtgrid.sub <- txtgrid[-(1:grep("item", txtgrid$object)[1]), ]grep("item", txtgrid.sub$object)[-1]splits <- unlist(mapply(rep, seq_along(grep("item", txtgrid.sub$object)), diff(c(grep("item", txtgrid.sub$object), nrow(txtgrid.sub) + 1))))df.list <- split(txtgrid.sub, list(splits))EDIT:You could then simplify the data by doing something like this:l <- lapply(df.list, function(x) { tmp <- as.data.frame(t(x[, 3, drop=FALSE]), stringsAsFactors=FALSE) names(tmp) <- make.unique(make.names(x[, 2])) tmp})library(plyr)do.call(rbind.fill, l) item..1.. class name xmin xmax intervals..size1 <NA> IntervalTier phones 0 243.761 25052 <NA> IntervalTier syllable 0 243.761 2505 intervals..1.. xmin.1 xmax.1 text intervals..2..1 <NA> 0 0.4274939687384032 _ <NA>2 <NA> 0 0.4274939687384032 _ <NA> xmin.2 xmax.21 0.4274939687384032 0.4722 <NA> <NA>NB: I've used dummy data for the above. 这篇关于R通过选择的rownumbers动态分割/数据帧子集 - 分析textgrid praat的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！