问题描述
我想将 openNLP 的解析(POS标记)显示为树状结构可视化.下面我提供了来自 openNLP 的解析树,但我无法将其绘制为 Python的解析.
I would like to display the parsing (POS tagging) from openNLP as a tree structure visualization. Below I provide the parse tree from openNLP but I can not plot as a visual tree common to Python's parsing.
install.packages(
"http://datacube.wu.ac.at/src/contrib/openNLPmodels.en_1.5-1.tar.gz",
repos=NULL,
type="source"
)
library(NLP)
library(openNLP)
x <- 'Scroll bar does not work the best either.'
s <- as.String(x)
## Annotators
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
parse_annotator <- Parse_Annotator()
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
p <- parse_annotator(s, a2)
ptext <- sapply(p$features, `[[`, "parse")
ptext
Tree_parse(ptext)
## > ptext
## [1] "(TOP (S (NP (NNP Scroll) (NN bar)) (VP (VBZ does) (RB not) (VP (VB work) (NP (DT the) (JJS best)) (ADVP (RB either))))(. .)))"
## > Tree_parse(ptext)
## (TOP
## (S
## (NP (NNP Scroll) (NN bar))
## (VP (VBZ does) (RB not) (VP (VB work) (NP (DT the) (JJS best)) (ADVP (RB either))))
## (. .)))
树形结构应类似于此:
有没有办法显示这种树的可视化效果?
Is there a way to display this tree visualization?
我发现了这个相关的树视图问题,用于绘制可能有用但无法归纳为数字表达式的数字表达式句子解析可视化.
I found this related tree viz question for plotting numeric expressions that may be of use but that I could not generalize to sentence parse visualization.
推荐答案
这里是igraph
版本.此函数将Parse_annotator
的结果作为输入,因此在您的示例中为ptext
. NLP::Tree_parse
已经创建了一个不错的树结构,所以这里的想法是递归地遍历它,并创建一个插入到igraph
中的边列表.边列表仅是head-> tail值的2列矩阵.
Here is an igraph
version. This function takes the result from Parse_annotator
as its input, so ptext
in your example. NLP::Tree_parse
already creates a nice tree structure, so the idea here is to traverse it recursively and create an edgelist to plug into igraph
. The edgelist is just a 2-column matrix of head->tail values.
为了使igraph
在适当的节点之间创建边,它们需要具有唯一的标识符.为此,我在使用Tree_parse
之前在文本中的单词后面附加了一个整数序列(使用regmatches<-
).
In order for igraph
to create edges between the proper nodes, they need to have unique identifiers. I did this by appending a sequence of integers (using regmatches<-
) to the words in the text prior to using Tree_parse
.
内部函数edgemaker
遍历树,并随即填充edgelist
.有一些选项可以为叶子其余节点分别着色,但是如果您通过选项vertex.label.color
,它将为它们全部着色.
The internal function edgemaker
traverses the tree, filling in edgelist
as it goes. There are options to color the leaves separately from the rest of the nodes, but if you pass the option vertex.label.color
it will color them all the same.
## Make a graph from Tree_parse result
parse2graph <- function(ptext, leaf.color='chartreuse4', label.color='blue4',
title=NULL, cex.main=.9, ...) {
stopifnot(require(NLP) && require(igraph))
## Replace words with unique versions
ms <- gregexpr("[^() ]+", ptext) # just ignoring spaces and brackets?
words <- regmatches(ptext, ms)[[1]] # just words
regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words)))) # add id to words
## Going to construct an edgelist and pass that to igraph
## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
edgelist <- matrix('', nrow=length(words)-2, ncol=2)
## Function to fill in edgelist in place
edgemaker <- (function() {
i <- 0 # row counter
g <- function(node) { # the recursive function
if (inherits(node, "Tree")) { # only recurse subtrees
if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
for (child in node$children) {
childval <- if(inherits(child, "Tree")) child$value else child
i <<- i+1
edgelist[i,1:2] <<- c(val, childval)
}
}
invisible(lapply(node$children, g))
}
}
})()
## Create the edgelist from the parse tree
edgemaker(Tree_parse(ptext))
## Make the graph, add options for coloring leaves separately
g <- graph_from_edgelist(edgelist)
vertex_attr(g, 'label.color') <- label.color # non-leaf colors
vertex_attr(g, 'label.color', V(g)[!degree(g, mode='out')]) <- leaf.color
V(g)$label <- sub("\\d+", '', V(g)$name) # remove the numbers for labels
plot(g, layout=layout.reingold.tilford, ...)
if (!missing(title)) title(title, cex.main=cex.main)
}
因此,在您的示例中,字符串x
及其带注释的版本ptext
看起来像
So, using your example, the string x
and its annotated version ptext
, which looks like
x <- 'Scroll bar does not work the best either.'
ptext
# [1] "(TOP (S (NP (NNP Scroll) (NN bar)) (VP (VBZ does) (RB not) (VP (VB work) (NP (DT the) (JJS best)) (ADVP (RB either))))(. .)))"
通过调用创建图形
library(igraph)
library(NLP)
parse2graph(ptext, # plus optional graphing parameters
title = sprintf("'%s'", x), margin=-0.05,
vertex.color=NA, vertex.frame.color=NA,
vertex.label.font=2, vertex.label.cex=1.5, asp=0.5,
edge.width=1.5, edge.color='black', edge.arrow.size=0)
这篇关于可视化解析树结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!