问题描述
我正在从事文本挖掘.我从一个文本文件中提取了23个句子,并从同一个文本文件中提取了6个常用词.
I am working in text mining. I have 23 sentences that I have extracted from a text file along with 6 frequent words extracted from the same text file.
对于频繁出现的单词,我创建了一个一维数组,用于显示单词以及单词出现在句子中.之后,我进行交集以显示哪个单词与句子中其余的每个单词一起出现:
For frequent words, I created 1D array which shows words and in which sentences they occur. After that I took the intersection to show which word occurs with which each of other remaining words in sentence:
OccursTogether = cell(length(Out1));
for ii=1:length(Out1)
for jj=ii+1:length(Out1)
OccursTogether{ii,jj} = intersect(Out1{ii},Out1{jj});
end
end
celldisp(OccursTogether)
输出是这样的:
OccursTogether[1,1]= 4 3
OccursTogether[1,2]= 1 4 3
OccursTogether[1,3]= 4 3
在上面的[1,1]中,在句子4和3中的单词1与单词1一起出现,[1,2]在句子1 2和3中的单词1与单词2一起出现,依此类推.
In above [1,1] shows that word number 1 occurs with word 1 in sentence 4 and 3, [1,2] shows word 1 and word 2 occurs in sentence 1 2 and 3 and so on.
我要做的是实现一种元素吸收技术,该技术将删除所有包含其他单元格超集的单元格.正如我们在[1,1]中看到的4和3是[1,2]的子集一样,因此应删除OccursTogether[1,2]
条目,并输出如下:
What I want to do is to implement an element absorption technique, which will remove all cells which contain supersets of other cells. As we can see above 4 and 3 in [1,1] are subset of [1,2] so OccursTogether[1,2]
entry should be deleted and output should be as follows:
occurs[1,1]= 4 3
occurs[1,3]= 4 3
请记住,这应该检查系统中所有可能的条目子集.
Remember this should check all the possible subsets of entries in the system.
推荐答案
我认为这可以满足您的要求:
I think this does what you want:
[ii, jj] = ndgrid(1:numel(OccursTogether));
s = cellfun(@(x,y) all(ismember(x,y)), OccursTogether(ii), OccursTogether(jj));
s = triu(s,1); %// count each pair just once, and remove self-pairs
result = OccursTogether(~any(s,1));
示例1 :
OccursTogether{1,1} = [4 3]
OccursTogether{1,2} = [1 4 3]
OccursTogether{1,3} = [1 4 3 5];
OccursTogether{1,4} = [1 4 3 5];
给予
>> celldisp(result)
result{1} =
4 3
OccursTogether{1,2}
被删除,因为它是OccursTogether{1,1}
的超集. OccursTogether{1,3}
被删除,因为它是OccursTogether{1,2}
的超集. OccursTogether{1,4}
被删除,因为它是OccursTogether{1,3}
的超集.
OccursTogether{1,2}
is removed because it's a superset of OccursTogether{1,1}
. OccursTogether{1,3}
is removed because it's a superset of OccursTogether{1,2}
. OccursTogether{1,4}
is removed because it's a superset of OccursTogether{1,3}
.
示例2 :
OccursTogether{1,1} = [10 20 30]
OccursTogether{1,2} = [10 20 30]
给予
>> celldisp(result)
result{1} =
10 20 30
OccursTogether{1,2}
被删除是因为它是OccursTogether{1,1}
的超集,但是即使OccursTogether{1,1}
是OccursTogether{1,2}
的超集也不会被删除.仅使用先前的设置(第三行代码)进行比较.
OccursTogether{1,2}
is removed because it's a superset of OccursTogether{1,1}
, but OccursTogether{1,1}
is not removed even if it's a superset of OccursTogether{1,2}
. The comparison is done only with previous sets (third line of code).
这篇关于如何删除所有包含其他单元格超集的单元格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!