在包含有关文本文件中字符串的信息的字典中,其中键是字符串,值是文件名。

Dict1 = {'str1A':'file1', 'str1B':'file1', 'str1C':'file1', 'str1D':'file1', 'str2A':'file2', 'str2B':'file2', 'str2C':'file2', 'str2D':'file2', 'str2D':'file2', 'str3A':'file3',


'str3B':'file3','str3C':'file3','str3D':'file3','str3D':'file3','str4A':'file4','str4B':'file4','str4C ':'file4','str4D':'file4','str4E':'file4'}

另一本词典包含有关文本中字符串最佳匹配的信息。

Dict2 = {'str1A':'jump', 'str1B':'fly', 'str1C':'swim', 'str2A':'jump', 'str2B':'fly', 'str2C':'swim', 'str2D':'run', 'str3A':'jump', 'str3B':'fly', 'str3C':'swim', 'str3D':'run'}


第三个字典包含有关文本中字符串出现百分比的信息。

Dict3 = {'str1A':'90', 'str1B':'60', 'str1C':'30', 'str2A':'70', 'str2B':'30', 'str2C':'60', 'str2D':'40', 'str3A':'10', 'str3B':'90', 'str3C':'70', 'str3D':'90'}


现在,我的目标是使用这些不同词典的信息来生成这样的数据框:

       jump     fly     swim    run
file1   90      60      30      NA
file2   70      30      60      40
file3   10      90      70      90


为此,我启动了脚本,但被卡住了:

col_file = ['str', 'file']
df_origin = pd.DataFrame(Dict1.items(), columns=col_file)
#print df_origin

col_bmatch = ['str', 'text']
df_bmatch =  pd.DataFrame(Dict2.items(), columns=col_bmatch)
#print df_bmatch

col_percent = ['str', 'percent']
df_percent = pd.DataFrame(Dict3.items(), columns=col_percent)
#print df_percent


此块已从脚本中删除:


df_origin['text'] = df_origin['str'].map(df_bmatch.set_index('str')['text'])

df_origin['percent'] = df_origin['str'].map(df_percent.set_index('str')['percent'])



并替换为:

data = {}
for k, col in Dict1.items():
    if k in Dict1 and k not in Dict3:
        data.setdefault(k, {})[col] = "NA"
    elif k in Dict1 and k in Dict3:
        data.setdefault(k, {})[col] = Dict3[k]

    df = pd.DataFrame(data)

print(df)


但是最终结果不是很准确:

      str1A str1B str1C str1D str2A str2B str2C str2D str3A str3B  \
file1     90     60     30     NO    NaN    NaN    NaN    NaN    NaN    NaN
file2    NaN    NaN    NaN    NaN     70     30     60     40    NaN    NaN
file3    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN     10     90
file4    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN

      str3C str3D str4A str4B stre4C str4D str4E
file1    NaN    NaN    NaN    NaN    NaN    NaN    NaN
file2    NaN    NaN    NaN    NaN    NaN    NaN    NaN
file3     70     90    NaN    NaN    NaN    NaN    NaN
file4    NaN    NaN     NO     NO     NO     NO     NO


但是预期的表是:

         jump   fly    swim   run   sit
file1    90     60     30     NA    NA
file2    70     30     60     40    NA
file3    10     90     70     90    NA
file4    NA     NA     NA     NA    NA


其中file4中的字符串未检测到。

大块删除


print df_origin

#          str   file  text percent
#    0   str2B  file2   fly      30
#    1   str2C  file2  swim      60
#    2   str3C  file3  swim      70
#    3   str3B  file3   fly      90
#    4   str3D  file3   run      90
#    5   str2D  file2   run      40
#    6   str3A  file3  jump      10
#    7   str1D  file1   NaN     NaN
#    8   str1C  file1  swim      30
#    9   str1B  file1   fly      60
#    10  str1A  file1  jump      90
#    11  str2A  file2  jump      70



这里依赖问题

print pd.get_dummies(df_origin.set_index('file')['text']).max(level=0).max(level=0, axis=1)


但是我得到的唯一结果是:

       fly  jump  run  swim
file
file2    1     1    1     1
file3    1     1    1     1
file1    1     1    0     1


据我了解,pd.getdummies将df_origin中的字段“文件”分组,并使用“文本”检查其存在。

如何重定向命令以在df_origin数据框中绘制列“百分比”?

最佳答案

尝试这个:

import pandas as pd

Dict1 = {'str1A':'file1', 'str1B':'file1', 'str1C':'file1', 'str1D':'file1', 'str2A':'file2', 'str2B':'file2', 'str2C':'file2', 'str2D':'file2', 'str2D':'file2', 'str3A':'file3', 'str3B':'file3','str3C':'file3', 'str3D':'file3', 'str3D':'file3' , 'str4A':'file4', 'str4B':'file4', 'str4C':'file4', 'str4D':'file4', 'str4E':'file4'}
Dict2 = {'str1A':'jump', 'str1B':'fly', 'str1C':'swim', 'str2A':'jump', 'str2B':'fly', 'str2C':'swim', 'str2D':'run', 'str3A':'jump', 'str3B':'fly', 'str3C':'swim', 'str3D':'run'}
Dict3 = {'str1A':'90', 'str1B':'60', 'str1C':'30', 'str2A':'70', 'str2B':'30', 'str2C':'60', 'str2D':'40', 'str3A':'10', 'str3B':'90', 'str3C':'70', 'str3D':'90'}

data = {}
for k, col in Dict2.items():
    if k not in Dict1 or k not in Dict3:
        continue
    data.setdefault(col, {})[Dict1[k]] = Dict3[k]
df = pd.DataFrame(data, index=sorted(set(Dict1.values())), columns=sorted(set(Dict2.values())))

print(df)


输出:

       fly jump  run swim
file1   60   90  NaN   30
file2   30   70   40   60
file3   90   10   90   70
file4  NaN  NaN  NaN  NaN

09-27 19:29