


For a homework assignment I have to write a program that scraps HTML from a website and then somehow find phrases within the website. When I say phrases I mean some sort of arbitrary way of organizing text so that words that are in close proximity to each other are put in the same group. I know this sounds really unclear, but the assignment states how we do this is up to our own interpretation of how to find "phrases".


Document doc = Jsoup.connect("http://oracle.com/").get();
String html = doc.body().toString();



Which will give me a decent printout of all the different words that appear on some webpage while parsing out all the html.


My main problem is I can't think of a way to parse the HTML so that I can somehow get these arbitrary groups together (and I don't know what kind of criteria I can use to arbitrarily form these "groups" of words).

我知道这个问题听起来很糟糕,但是我不知道该怎么说,而且我真的不知道该做什么.给我的任务非常不清楚,当要求澄清时,我的教授只是告诉我自己解释.我想知道是否有人对如何解析html有任何想法,以便彼此接近的单词(可能在相似的html标签之内或类似的东西)可以类似于我现在的当前输出被过滤掉,除非在每个短语"之后都可以. 就像换行符或我可以解析的内容.

I know this question sounds terrible but I don't know how else I can state it, and I am really out of ideas as to what I can do. The assignment I was given is extremely unclear, and when asked for clarification my professor just tells me to interpret it myself. I was wondering if anyone had any ideas on how to parse the html so that words close to each other (maybe inside similar html tags or something) could be filtered out similar to the current output I have right now, except maybe after every "phrase" there's like a newline or something I can parse.


Thanks for any ideas or advice.



What you are looking for is a concept called stemming. From wikipedia

您为此提供了一个简单的蛮力实施.还要从 Lucene OpenNLP

You an provide a simple brute force implementation for this. Also checkout the stemming algorithm implementations from Lucene and OpenNLP


09-05 12:09