在字符串修改中包括单词边界更具体

本文介绍了在字符串修改中包括单词边界更具体的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景

import pandas as pd
Names =    [list(['ann']),
               list([]),
               list(['elisabeth', 'lis']),
               list(['his','he']),
               list([])]
df = pd.DataFrame({'Text' : ['ann had an anniversery today', 
                                       'nothing here', 
                                       'I like elisabeth and lis 5 lists ',
                                        'one day he and his cheated',
                                        'same here'
                            ], 

                          'P_ID': [1,2,3, 4,5], 
                          'P_Name' : Names

                         })

#rearrange columns
df = df[['Text', 'P_ID', 'P_Name']]
df
                  Text                P_ID  P_Name
0   ann had an anniversery today        1   [ann]
1   nothing here                        2   []
2   I like elisabeth and lis 5 lists    3   [elisabeth, lis]
3   one day he and his cheated          4   [his, he]
4   same here                           5   []

下面的代码有效

m = df['P_Name'].str.len().ne(0)
df.loc[m, 'New'] = df.loc[m, 'Text'].replace(df.loc[m].P_Name,'**BLOCK**',regex=True)

并执行以下操作

1) 使用P_Name中的名称，通过放置**BLOCK**

1) uses the name in P_Name to block the corresponding text in the Text column by placing **BLOCK**

2) 产生一个新列 New

2) produces a new column New

如下图

   Text  P_ID P_Name  New
0                     **BLOCK** had an **BLOCK**iversery today
1                     NaN
2                     I like **BLOCK** and **BLOCK** 5 **BLOCK**ts
3                     one day **BLOCK** and **BLOCK** c**BLOCK**ated
4                     NaN

问题

然而，这段代码有点太好了".

However, this code works a little "too well."

使用P_Name中的['his','he']来屏蔽Text:

示例:有一天他和他的被骗变成了一天**BLOCK**和**BLOCK** c**BLOCK**ated

期望:有一天他和他的被骗变成了有一天**BLOCK**和**BLOCK**被骗

在这个例子中，我希望 cheated 保持 cheated 而不是 c**BLOCK**ated

In this example, I would like cheated to stay as cheated and not become c**BLOCK**ated

期望输出

    Text P_ID P_Name  New
0                     **BLOCK** had an anniversery today
1                     NaN
2                     I like **BLOCK** and **BLOCK**5 lists
3                     one day **BLOCK** and **BLOCK** cheated
4                     NaN

问题

如何实现我想要的输出?

How do I achieve my desired output?

推荐答案

有时for 循环是很好的做法

df['New']=[pd.Series(x).replace(dict.fromkeys(y,'**BLOCK**') ).str.cat(sep=' ')for x , y in zip(df.Text.str.split(),df.P_Name)]
df.New.where(df.P_Name.astype(bool),inplace=True)
df
                                Text  ...                                  New
0       ann had an anniversery today  ...     **BLOCK** had an anniversery today
1                       nothing here  ...                                  NaN
2  I like elisabeth and lis 5 lists   ...   I like **BLOCK** and **BLOCK** 5 lists
3         one day he and his cheated  ...  one day **BLOCK** and **BLOCK** cheated
4                          same here  ...                                  NaN
[5 rows x 4 columns]

这篇关于在字符串修改中包括单词边界更具体的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！