问题描述
可以使用 Java 进行文本简化的最佳工具是什么?
What is the best tool that can do text simplification using Java?
这是一个文本简化的例子:
Here is an example of text simplification:
John, who was the CEO of a company, played golf.
↓
John played golf. John was the CEO of a company.
推荐答案
我认为您的问题是将复杂句或复合句转换为简单句的任务.基于文献Sentence Types,一个简单的句子是由一个独立条款.一个复合句和复合句至少由两个从句构成.此外,从句必须有主语和动词.
因此,您的任务是将句子拆分为构成句子的子句.
I see your problem as a task of converting complex or compound sentence into simple sentences.Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
So your task is to split sentence into clauses that form your sentence.
来自斯坦福 CoreNLP 的依赖解析是将复合句和复杂句拆分为简单句的完美工具.您可以在线试用 演示.
从您的例句中,我们将得到Stanford typed dependency (SD) 表示法的解析结果,如下所示:
Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:
nsubj(CEO-6, John-1)
nsubj(played-11, John-1)
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
det(company-9, a-8)
prep_of(CEO-6, company-9)
root(ROOT-0, play-11)
dobj(played-11, Golf-12)
可以从关系(在 SD 中)识别哪个类别是主语,例如nsubj、nsubjpass.请参阅斯坦福依赖手册
基本从句可以从中心部分中提取为动词部分,从从属中提取为主语部分.从上面的 SD 来看,有两个基本子句,即
A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.
- 约翰首席执行官
- 约翰演奏
获得基本从句后,您可以添加另一部分,使您的从句成为一个完整而有意义的句子.为此,请参阅Stanford Dependency Manual.
After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.
顺便说一下,您的问题可能与寻找有意义的子句子中的句子
By the way, your question might be related with Finding meaningful sub-sentences from a sentence
一旦你得到一对主语和动词,即nsubj(CEO-6, John-1)
,得到所有与该依赖相关的依赖,除了任何属于哪个类别的依赖项,然后从这些依赖项中提取唯一的词.
Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1)
, get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.
以nsubj(CEO-6, John-1)
为例,如果你从John-1
开始遍历,你会得到 nsubj(played-11, John-1)
但你应该忽略它,因为它的类别是主题.
Based on example, nsubj(CEO-6, John-1)
, if you start traversing from John-1
, you'll get nsubj(played-11, John-1)
but you should ignore it since its category is subject.
下一步是从CEO-6
部分开始遍历.你会得到
Next step is traversing from CEO-6
part. You'll get
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
从上面的结果中,您有新的依赖项需要遍历(即找到另一个在 head 或依赖项中有 was-4, the-5, company-9
的依赖项).
现在你的依赖是
From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9
in either head or dependent).
Now your dependencies are
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
det(company-9, a-8)
在此步骤中,您已完成遍历与 nsubj(CEO-6, John-1)
相关联的所有依赖项.接下来,从所有中心和从属中提取单词,然后根据附加到这些单词的数字按升序排列单词.这个数字表示原句中的词序.
In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1)
. Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.
John 是一家公司的 CEO
我们的新句子缺少一个部分,即of
.这部分隐藏在prep_of(CEO-6, company-9)
中.如果您阅读Stanford Dependency Manual,则有两种SD,折叠和非折叠.请阅读它们以了解为什么隐藏此of
以及如何获取此隐藏部分的词序.
Our new sentence is missing one part, i.e of
. This part is hidden in prep_of(CEO-6, company-9)
. If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of
is hidden and how to get the word order of this hidden part.
用同样的方法,你会得到第二句话
With same approach, you'll get second sentence
约翰打高尔夫球
这篇关于文本简化工具 (Java)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!