本文介绍了如何删除所有包含其他单元格超集的单元格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事文本挖掘.我从一个文本文件中提取了23个句子,并从同一个文本文件中提取了6个常用词.

I am working in text mining. I have 23 sentences that I have extracted from a text file along with 6 frequent words extracted from the same text file.

对于频繁出现的单词,我创建了一个一维数组,用于显示单词以及单词出现在句子中.之后,我进行交集以显示哪个单词与句子中其余的每个单词一起出现:

For frequent words, I created 1D array which shows words and in which sentences they occur. After that I took the intersection to show which word occurs with which each of other remaining words in sentence:

OccursTogether = cell(length(Out1));
for ii=1:length(Out1)
    for jj=ii+1:length(Out1)
        OccursTogether{ii,jj} = intersect(Out1{ii},Out1{jj});
    end
end
celldisp(OccursTogether)

输出是这样的:

OccursTogether[1,1]= 4 3
OccursTogether[1,2]= 1 4 3
OccursTogether[1,3]= 4 3

在上面的[1,1]中,在句子4和3中的单词1与单词1一起出现,[1,2]在句子1 2和3中的单词1与单词2一起出现,依此类推.

In above [1,1] shows that word number 1 occurs with word 1 in sentence 4 and 3, [1,2] shows word 1 and word 2 occurs in sentence 1 2 and 3 and so on.

我要做的是实现一种元素吸收技术,该技术将删除所有包含其他单元格超集的单元格.正如我们在[1,1]中看到的4和3是[1,2]的子集一样,因此应删除OccursTogether[1,2]条目,并输出如下:

What I want to do is to implement an element absorption technique, which will remove all cells which contain supersets of other cells. As we can see above 4 and 3 in [1,1] are subset of [1,2] so OccursTogether[1,2] entry should be deleted and output should be as follows:

occurs[1,1]= 4 3
occurs[1,3]= 4 3

请记住,这应该检查系统中所有可能的条目子集.

Remember this should check all the possible subsets of entries in the system.

推荐答案

我认为这可以满足您的要求:

I think this does what you want:

[ii, jj] = ndgrid(1:numel(OccursTogether));
s = cellfun(@(x,y) all(ismember(x,y)), OccursTogether(ii), OccursTogether(jj));
s = triu(s,1); %// count each pair just once, and remove self-pairs
result = OccursTogether(~any(s,1));

示例1 :

OccursTogether{1,1} = [4 3]
OccursTogether{1,2} = [1 4 3]
OccursTogether{1,3} = [1 4 3 5];
OccursTogether{1,4} = [1 4 3 5];

给予

>> celldisp(result)
result{1} =
     4     3

OccursTogether{1,2}被删除,因为它是OccursTogether{1,1}的超集. OccursTogether{1,3}被删除,因为它是OccursTogether{1,2}的超集. OccursTogether{1,4}被删除,因为它是OccursTogether{1,3}的超集.

OccursTogether{1,2} is removed because it's a superset of OccursTogether{1,1}. OccursTogether{1,3} is removed because it's a superset of OccursTogether{1,2}. OccursTogether{1,4} is removed because it's a superset of OccursTogether{1,3}.

示例2 :

OccursTogether{1,1} = [10 20 30]
OccursTogether{1,2} = [10 20 30]

给予

>> celldisp(result)
result{1} =
    10    20    30

OccursTogether{1,2}被删除是因为它是OccursTogether{1,1}的超集,但是即使OccursTogether{1,1}OccursTogether{1,2}的超集也不会被删除.仅使用先前的设置(第三行代码)进行比较.

OccursTogether{1,2} is removed because it's a superset of OccursTogether{1,1}, but OccursTogether{1,1} is not removed even if it's a superset of OccursTogether{1,2}. The comparison is done only with previous sets (third line of code).

这篇关于如何删除所有包含其他单元格超集的单元格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 00:04