本文介绍了如何遍历包含字符串列表的 pandas 行以检查每个单词是否为英语?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,其中包含评论文本.经过文本预处理后,我最终得到了每一行中的字符串列表.现在,我要遍历这些字符串列表的每一行,以检查每个字符串是否为英语.我想计算非英语单词的出现次数,以创建另一列出现次数".

I have a pandas dataframe which contains review texts. After text preprocessing I ended up with list of strings in each row. Now I want to iterate over each row of these lists of strings to check whether each string is in english or not. I want to count occurrences of non-english words to create another column "Occurrences".

对于英语检查,我将使用pyenchant库.

For english language checking I will use pyenchant library.

类似于下面的代码



review_text sentiment   error_related
0   [simple, effective, way, new, word, kid]    1   NaN
1   [fh, fcfatgv]   1   NaN
2   [son, loved, easy, even, though, son, first, g...   1   NaN

english_dict = enchant.Dict("en_US")

def enlgish_counter(df, df_text_column):
    number_of_non_english_words = []
    for review in df_text_column:
        for word in review:
            a=0
        if english_dict.check(i)==False:
            a=a+1 
    non_english_words.append(a)

推荐答案

您没有包含示例数据,因此我手动构建了它.请注意,我的数据框格式可能与您的数据框格式不同.

You didn't include example data so I constructed it manually. Note, that my dataframe format can differ from yours.

import pandas as pd
import enchant

english_dict = enchant.Dict("en_US")

# Construct the dataframe
words = ['up and vote', 'wet 0001f914 turtle 0001f602', 'thumbnailшщуй',
       'lobby', 'mods saffron deleted iâ', 'â', 'itâ donâ edit', 'thatâ',
       'didnâ canâ youâ'] 

df = pd.DataFrame()

for word in words:
    record = {'text': word}
    df = df.append(record, ignore_index=True)

# Get texts column
for text in df['text']:
    # Counters
    eng_words = 0
    non_eng_words = 0
    # For every word in text
    for word in text.split(' '):
        # Check if it is english
        if english_dict.check(word) == True:
            eng_words += 1
        else:
            non_eng_words += 1
    # Print the result
    # NOTE that these results are discarded each new text
    print('EN: {}; NON-EN: {}'.format(eng_words, non_eng_words))


如果要修改数据集,则应将此代码包装到一个函数中:


If you want to modify your dataset, you should wrap this code into a function:

def create_occurences(df):
    eng_words_list = []
    non_eng_words_list = []
    for text in df['text']:
        eng_words = 0
        non_eng_words = 0
        for word in text.split(' '):
            if english_dict.check(word) == True:
                eng_words += 1
            else:
                non_eng_words += 1
        eng_words_list.append(eng_words)
        non_eng_words_list.append(non_eng_words)
    df['eng_words'] = pd.Series(eng_words_list, index=df.index)
    df['non_eng_words'] = pd.Series(non_eng_words_list, index=df.index)

create_occurences(df)
df

这篇关于如何遍历包含字符串列表的 pandas 行以检查每个单词是否为英语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 14:23