linux - 确定特定术语的词频

我是一名非计算机科学专业的学生，在做一个历史论文，其中涉及确定许多文本中特定术语的频率，然后绘制这些频率随时间变化以确定变化和趋势的过程。虽然我已经找到了如何确定给定文本文件的单词频率的方法，但我正在处理(相对于我而言)大量文件(> 100)，并且出于一致性考虑，我想限制频率计数中包含的单词一组特定的术语(类似于“停止列表”的相反术语)

这应该保持非常简单。最后，我所需要的只是我处理的每个文本文件的特定单词的频率，最好是电子表格格式(制表符描绘的文件)，这样我便可以使用该数据创建图形和可视化。

我每天使用Linux，对命令行使用感到很舒服，并且会喜欢开源解决方案(或者可以在WINE中运行的东西)。但是，这不是必需的:

我看到两种解决此问题的方法:

找到一种方法来去除文本文件中除预定义列表以外的所有单词，然后从那里进行频率计数，或者:

找到一种仅使用预定义列表中的术语进行频率计数的方法。

有任何想法吗？

最佳答案

我会同意第二个想法。这是一个简单的Perl程序，它将从提供的第一个文件中读取单词列表，并以制表符分隔的格式从第二个文件中打印列表中每个单词的计数。第一个文件中的单词列表应每行提供一个。

#!/usr/bin/perl

use strict;
use warnings;

my $word_list_file = shift;
my $process_file = shift;

my %word_counts;

# Open the word list file, read a line at a time, remove the newline,
# add it to the hash of words to track, initialize the count to zero
open(WORDS, $word_list_file) or die "Failed to open list file: $!\n";
while (<WORDS>) {
  chomp;
  # Store words in lowercase for case-insensitive match
  $word_counts{lc($_)} = 0;
}
close(WORDS);

# Read the text file one line at a time, break the text up into words
# based on word boundaries (\b), iterate through each word incrementing
# the word count in the word hash if the word is in the hash
open(FILE, $process_file) or die "Failed to open process file: $!\n";

while (<FILE>) {
  chomp;
  while ( /-$/ ) {
    # If the line ends in a hyphen, remove the hyphen and
    # continue reading lines until we find one that doesn't
    chop;
    my $next_line = <FILE>;
    defined($next_line) ? $_ .= $next_line : last;
  }

  my @words = split /\b/, lc; # Split the lower-cased version of the string
  foreach my $word (@words) {
    $word_counts{$word}++ if exists $word_counts{$word};
  }
}
close(FILE);

# Print each word in the hash in alphabetical order along with the
# number of time encountered, delimited by tabs (\t)
foreach my $word (sort keys %word_counts)
{
  print "$word\t$word_counts{$word}\n"
}

如果文件words.txt包含:

linux
frequencies
science
words

文件text.txt包含您的帖子文本，以下命令:

perl analyze.pl words.txt text.txt

将打印:

frequencies     3
linux   1
science 1
words   3

请注意，在所有情况下，使用\b打破单词边界都可能无法达到您想要的方式，例如，如果文本文件包含跨行连字符的单词，则您需要做一些更聪明的事情来匹配这些单词。在这种情况下，您可以检查一行中的最后一个字符是否为连字符，如果是，则只需删除连字符并读取另一行，然后再将该行拆分为单词。

编辑:更新版本，不区分大小写地处理单词，并跨行处理带连字符的单词。

请注意，如果存在带连字符的单词，其中一些单词在行中折断，而另一些单词则不是，这将找不到全部，因为它只删除了行尾的连字符。在这种情况下，您可能只想删除所有连字符，并在删除连字符后匹配单词。您可以通过在split函数之前简单添加以下行来完成此操作:

s/-//g;

关于linux - 确定特定术语的词频，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/315667/