首先下载路透社语料(百度就能够下载):
然后上传Linux 并解压到指定文件夹。Tips:此处我放在可 /usr/hadoop/mahout/reutersTest/reuters
tar -zxvf /usr/hadoop/mahout/reutersTest/reuters/reuters21578.tar.gz
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="">
接下来转换语料格式。要求步骤是:
.sgm文件 ===> .txt文件 ===> sequence文件 ===> vector 文件
结合写一个java代码。使用mahout的org.apache.lucene.benchmark.utils.ExtractReuters类依照
一个新闻一个文档的形式 把格式转换为.txt文件。
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="">
<strong><span style="font-size:18px;">/***
* @author YangXin
* @info 处理路透社语料编程.txt格式
*/
package unitEight; import java.io.File; import org.apache.lucene.benchmark.utils.ExtractReuters; public class TestExtractReuters {
public static void main(String[] args) {
// TODO Auto-generated method stub
File inputFolder = new File("G:\\reuter");
File outputFolder = new File("G:\\reuters-Text");
ExtractReuters extractor = new ExtractReuters(inputFolder, outputFolder);
extractor.extract();
}
}</span></strong>
数据比較多,我就截了一部分:
接着输入:
mahout seqdirectory -c UTF-8 -i /usr/hadoop/mahout/reutersTest/reuters-Text -o reuters-seqfiles
然后能够查看到hdfs上出现了例如以下文件夹:
接着输入:
mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
最后能够下载下来查看。
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="">