





  1. 建议替代策略提取纯内容,

  2. 会/可以学习自然语言处理有助于从这些文章中提取正确的摘要吗?

  3. 这些研究论文关于同样的问题。


Ankur Gupta




  1. 创建现有摘要的语料库
  2. 注释摘要以有用的方式。例如,您可能想指出是否选择了原文中的每个句子,以及为什么(或为什么)。

  3. 在语料库上训练某种分类器,然后使用它在新文章中对句子进行分类。

我最喜欢的机器学习参考是Tom Mitchell的。它列出了许多实施步骤(3)的方法。


I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc

I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.

Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in

Can you

  1. Suggest an alternative strategy for extraction of pure content,

  2. Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?

  3. How would you approach the above problem ?.

  4. Are these any research papers on the same ?.


Ankur Gupta


For question (1), I am not sure. I haven't done this before. Maybe one of the other answers will help.

For question (2), automatic creation of abstracts is not a developed field. It is usually referred to as 'sentence selection', because the typical approach right now is to just select entire sentences.

For question (3), the basic way to create abstracts from machine learning would be to:

  1. Create a corpus of existing abstracts
  2. Annotate the abstracts in a useful way. For example, you'd probably want to indicate whether each sentence in the original was chosen and why (or why not).
  3. Train a classifier of some sort on the corpus, then use it to classify the sentences in new articles.

My favourite reference on machine learning is Tom Mitchell's Machine Learning. It lists a number of ways to implement step (3).

For question (4), I am sure there are a few papers because my advisor mentioned it last year, but I do not know where to start since I'm not an expert in the field.


09-05 13:07