我有一个包含短语的数据框,我只想从数据框中提取由连字符分隔的复合词,然后将其放在另一个数据框中。

df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})


到目前为止,这是到目前为止我得到的:

import pandas as pd

df=pd.DataFrame({'Phrases': ['Trail 1 Yellow-Green','Kim Jong-il was here', 'President Barack Obama', 'methyl-butane', 'Derp da-derp derp', 'Pok-e-mon'],})


new = df['Phrases'].str.extract("(?P<part1>.*?)-(?P<part2>.*)")


结果

>>> new
            part1        part2
0  Trail 1 Yellow        Green
1        Kim Jong  il was here
2             NaN          NaN
3          methyl       butane
4         Derp da    derp derp
5             Pok        e-mon


我想要的只是这个词,所以它应该是(请注意,由于2个连字符,Pok-e-mon显示为Nan):

>>> new
            part1        part2
0          Yellow        Green
1             Jong          il
2             NaN          NaN
3          methyl       butane
4              da         derp
5             NaN          NaN

最佳答案

您可以使用此正则表达式:

(?:[^-\w]|^)(?P<part1>[a-zA-Z]+)-(?P<part2>[a-zA-Z]+)(?:[^-\w]|$)

(?:               # non capturing group
    [^-\w]|^        # a non-hyphen or the beginning of the string
)
(?P<part1>
    [a-zA-Z]+     # at least a letter
)-(?P<part2>
    [a-zA-Z]+
)
(?:[^-\w]|$)        # either a non-hyphen character or the end of the string



您的第一个问题是,没有什么可以阻止.占用空间。 [a-zA-Z]仅选择字母,这样可以避免从一个单词跳到另一个单词。
对于pok-e-mon情况,您需要检查比赛前后是否没有连字符。


Demo here

关于python - Python Pandas 从带有短语的单元格中提取带连字符的单词,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/23132227/

10-12 18:29