我想创建一个文档术语矩阵.在我的情况下,它不像文档 x 个单词,而是句子 x 个单词,因此句子将充当文档.我正在使用l2"规范化后术语矩阵创建.
I want to create a document term matrix. In my case it is not like documents x words but it is sentences x words so the sentences will act as the documents. I am using 'l2' normalization post doc-term matrix creation.
术语计数对我在后续步骤中使用 SVD 创建摘要很重要.
The term count is important for me to create summarization using SVD in further steps.
My query is which axis will be appropriate to apply 'l2' normalization. With sufficient research I understood:
- Axis=1 :会给我一个句子中单词的重要性(按列归一化)
- Axis=0 :单词在文档中的重要性(按行归一化).
Even after knowing the theory I am not able to decide which alternative to choose because the choice will greatly affect my summarization results. So kindly guide me a solution along with a reason for the same.
L2 规范化是指除以总数吗?如果沿axis=0进行归一化,则x_{i,j}
上的概率(除以全局词数),这取决于句子的长度,因为较长的词可以一遍又一遍地重复某些词,并且该词出现的概率要高得多,因为它们对全局词的贡献很大数数.如果您沿轴 = 1 进行归一化,那么您是在询问句子是否具有与沿句子长度进行归一化相同的单词组成.
By L2 normalization, do you mean division by the total count?If you normalize along axis=0, then the value of x_{i,j}
is the probability of the word j
over all sentences i
(division by the global word count), which is dependent on the length of the sentence, as longer ones can repeat some words over and over again and will have a much higher probability for this word, as they contribute a lot to the global word count.If you normalize along axis=1, then you're asking whether sentences have the same composition of words, as you normalize along the lenght of the sentence.