问题描述
我正在在线书籍 http://tidytextmining.com/ 上学习文本挖掘.在第五章:http://tidytextmining.com/dtm.html#financial
I am studying text mining on the online book http://tidytextmining.com/.In the fifth chapter:http://tidytextmining.com/dtm.html#financial
以下代码:
library(tm.plugin.webmining)
library(purrr)
company <- c("Microsoft", "Apple", "Google", "Amazon", "Facebook",
"Twitter", "IBM", "Yahoo", "Netflix")
symbol <- c("MSFT", "AAPL", "GOOG", "AMZN", "FB", "TWTR", "IBM", "YHOO", "NFLX")
download_articles <- function(symbol) {
WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol)))
}
stock_articles <- data_frame(company = company,
symbol = symbol) %>%
mutate(corpus = map(symbol, download_articles))
给我错误:
StartTag: invalid element name
Extra content at the end of the document
Error: 1: StartTag: invalid element name
2: Extra content at the end of the document
有什么提示吗?有人建议删除与Twitter"相关的公司和符号,但它仍然不起作用并返回相同的错误.非常感谢提前
Any hints?Someone suggested to remove company and symbol related to "Twitter", but it still doesn't work and returns the same error.Many thanks in advance
推荐答案
问题是 tm.plugin.webmining
包已过期.
在回复时只有 YahooFinanceSource
和 YahooNewsSource
还活着.
Only the YahooFinanceSource
and YahooNewsSource
are alive at the time of this reply.
这是一个快速参考和测试.
Here is a quick reference and test.
来自小插图页面 作者写的,应该有 8 个可能的来源站点:
From the Vignette page written by the author, there should be 8 possible source sites:
- GoogleBlogSearchSource
- GoogleFinaceSource
- Google 新闻来源
- 纽约时报来源
- 路透社新闻来源
- YahooFinanceSource
- YahooInplaySource
- 雅虎新闻来源
但根据 Github 页面,第一个GoogleBlogSearchSource"已经被证明已停产.对于剩下的 7 个源,我做了一个简单的测试,看看它们是否有效:
But according to the Github page, the first one "GoogleBlogSearchSource" has already been proven to be discontinued. For the 7 sources remained, I did a simple test to see if they work:
library(tm)
library(tm.plugin.webmining)
googlefinance <- WebCorpus(GoogleFinanceSource("A"))
googlenews <- WebCorpus(GoogleNewsSource("A"))
nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
reutersnews <- WebCorpus(ReutersNewsSource("A"))
yahoofinance <- WebCorpus(YahooFinanceSource("A"))
yahooinplay <- WebCorpus(YahooInplaySource())
yahoonews <- WebCorpus(YahooNewsSource("M"))
结果显示所有雅虎的源在技术上仍在运行,但是无论我选择什么参数,YahooInplaySource
都返回0个文档.
The result shows that all the yahoo's sourses are technically still running, but the YahooInplaySource
returns 0 documents no matter what parameter I chose.
> googlefinance <- WebCorpus(GoogleFinanceSource("NASDAQ:MSFT"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlefinance <- WebCorpus(GoogleFinanceSource("A"))
StartTag: invalid element name
Extra content at the end of the document
Error in inherits(x, "WebSource") : 1: StartTag: invalid element name
2: Extra content at the end of the document
> googlenews <- WebCorpus(GoogleNewsSource("A"))
Unknown IO errorfailed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
Error in inherits(x, "WebSource") :
1: Unknown IO error2: failed to load external entity "http://news.google.com/news?hl=en&q=A&ie=utf-8&num=100&output=rss"
> nytimes <- WebCorpus(NYTimesSource("A", appid = nytimes_appid))
Error in inherits(x, "WebSource") : object 'nytimes_appid' not found
> reutersnews <- WebCorpus(ReutersNewsSource("A"))
Entity 'ldquo' not defined
Entity 'rdquo' not defined
Opening and ending tag mismatch: div line 60 and body
Opening and ending tag mismatch: body line 59 and html
Premature end of data in tag html line 1
Error in inherits(x, "WebSource") : 1: Entity 'ldquo' not defined
2: Entity 'rdquo' not defined
3: Opening and ending tag mismatch: div line 60 and body
4: Opening and ending tag mismatch: body line 59 and html
5: Premature end of data in tag html line 1
> yahoofinance <- WebCorpus(YahooFinanceSource("A"))
> yahoofinance
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 16
> yahooinplay <- WebCorpus(YahooInplaySource())
> yahooinplay
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("A"))
> yahoonews
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 0
> yahoonews <- WebCorpus(YahooNewsSource("M"))
> yahoonews
<<WebCorpus>>
Metadata: corpus specific: 3, document level (indexed): 0
Content: documents: 10
另外值得一提的是,即使 YahooFinanceSourse
正在工作,它也不会返回与 GoogleFinanceSource
应该做的相似的内容.如果您想使用 中的示例,我认为您可以使用带有自定义查询列表的 YahooNewsSource
.
Also it worth to be mentioned that even though YahooFinanceSourse
is working, it won't return the similar content as GoogleFinanceSource
was supposed to do. If you want to play with the examples in , I think you may use YahooNewsSource
with a customized list of queries.
这篇关于使用 GoogleFinanceSource 函数使用 tm.plugin.webmining 包进行文本挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!