




I am trying to use the xml2 package to read many podcast feeds. I want to be able to calculate the 75th percentile for the duration of each podcast in a series, and many similar metrics (eg frequency of episodes). I use data.table a lot. I want to carry on using it. Every time I invoke the read_xml argument to read the urls in a column I get this error:

Error: `x` must be a string of length 1


I can get it to work if I process just one row but that defeats the purpose.


Let me give you an example that is simple. Here is the list of just my statistics podcasts but in real life I subscribe > 100 podcasts across many fields.

statml.opml <- read_xml(x = "https://player.fm/farrelbuch/statistics-ml.opml")
statml.items <- xml_find_all(x = statml.opml, "/opml/body/outline")
statml.dt <- data.table(podcast = xml_attr(statml.items, "text"), url = xml_attr(statml.items, "xmlUrl"))

I start by reading the opml file my podcast aggregator provides. Thank you player.fm. . Then I get a listing of each feed and by looking at the structure I can see what I need to extract out of each feed. I end up with a data.table that has the name of each podcast and its url.

statml.dt[1, url]
pod1 <- read_xml(x = "https://podcasts.files.bbci.co.uk/p02nrss1.rss")


xml_find_all(x = pod1, "/rss/channel/item/itunes:duration")
xml_text(xml_find_all(x = pod1, "/rss/channel/item/itunes:duration"))

list(xml_text(xml_find_all(x = read_xml(x = "https://podcasts.files.bbci.co.uk/p02nrss1.rss"), "/rss/channel/item/itunes:duration")))


So I can easily display just a single URL and read that xml at that URL. xml_find_all will get all the items tagged with itunes:duration and xml_text will isolate the actual time duration and jettison all the tags. One can convert to a list of times which should enable one to store it in a data.table column.


Look what happens when I try these simple lines of code to fast add columns by reference using :=. You will see that everything works well if I set i=1 (in other words I am operating on the first row and the first row only). But alas, if I leave i blank so that it operates on all the rows or even if I set i to 1:2 the operation fails with error about x must be a string of 1.

statml.dt[,times:=list(xml_text(xml_find_all(x = read_xml(url), "/rss/channel/item/itunes:duration")))]
statml.dt[1,times:=list(xml_text(xml_find_all(x = read_xml(url), "/rss/channel/item/itunes:duration")))]


How do I get an argument to work on every row of a data.table when it is not expecting a column of values?


Vectorize(somefunc) will convert a non-vectorized function somefunc from one that accepts at most one argument into one that accepts a vector.

Vectorize(somefunc) returns a function, which you then use in a subsequent call. It is easy to both pre-Vectorize a function and use it inline.

func1 <- function(x) { stopifnot(length(x) == 1L); 2*x; }

data.table(a=1:2)[, b := func1(a) ]
# Error in func1(a) : length(x) == 1L is not TRUE

data.table(a=1:2)[, b := Vectorize(func1)(a) ][]
#    a b
# 1: 1 2
# 2: 2 4

func1_n <- Vectorize(func1)
data.table(a=1:2)[, b := func1_n(a) ][]
#    a b
# 1: 1 2
# 2: 2 4

data.table(a=1:2)[, b := lapply(a, func1)][]
#    a b
# 1: 1 2
# 2: 2 4
str(data.table(a=1:2)[, b := lapply(a, func1)])
# Classes ‘data.table’ and 'data.frame':    2 obs. of  2 variables:
#  $ a: int  1 2
#  $ b:List of 2
#   ..$ : num 2
#   ..$ : num 4
#  - attr(*, ".internal.selfref")=<externalptr>
str(data.table(a=1:2)[, b := sapply(a, func1)])
# Classes ‘data.table’ and 'data.frame':    2 obs. of  2 variables:
#  $ a: int  1 2
#  $ b: num  2 4
#  - attr(*, ".internal.selfref")=<externalptr>

Note that the lapply method looks like it generates a "simple column", but lapply always returns a list, it just happens to render the way one would think. If you know that your function will always return a "scalar" (which in R is actually a vector of length 1), then you can use sapply or perhaps vapply(a, func1, numeric(1)).


09-05 13:03