切片/分割线系列在不同位置

本文介绍了切片/分割线系列在不同位置的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望拆分字符串系列取决于某些子字符串的长度在不同点:

I'm looking to split a string Series at different points depending on the length of certain substrings:

In [47]: df = pd.DataFrame(['group9class1', 'group10class2', 'group11class20'], columns=['group_class'])
In [48]: split_locations = df.group_class.str.rfind('class')
In [49]: split_locations
Out[49]:
0    6
1    7
2    7
dtype: int64
In [50]: df
Out[50]:
      group_class
0    group9class1
1   group10class2
2  group11class20

我的输出应如下所示:

      group_class    group    class
0    group9class1   group9   class1
1   group10class2  group10   class2
2  group11class20  group11  class20

我半想这可能有效:

In [56]: df.group_class.str[:split_locations]
Out[56]:
0   NaN
1   NaN
2   NaN

如何按split_locations中的可变位置对字符串进行切片?

How can I slice my strings by the variable locations in split_locations?

推荐答案

这可行，通过使用双[[]]，您可以访问当前元素的索引值，以便可以索引到split_locations系列中:

This works, by using double [[]] you can access the index value of the current element so you can index into the split_locations series:

In [119]:
df[['group_class']].apply(lambda x: pd.Series([x.str[split_locations[x.name]:][0], x.str[:split_locations[x.name]][0]]), axis=1)
Out[119]:
         0        1
0   class1   group9
1   class2  group10
2  class20  group11

或者按照@ajcr的建议，您可以extract:

Or as @ajcr has suggested you can extract:

In [106]:

df['group_class'].str.extract(r'(?P<group>group[0-9]+)(?P<class>class[0-9]+)')
Out[106]:
     group    class
0   group9   class1
1  group10   class2
2  group11  class20

编辑

正则表达式说明:

正则表达式来自@ajcr(谢谢！)，它使用 str.extract 提取组，这些组将成为新列.

the regex came from @ajcr (thanks!), this uses str.extract to extract groups, the groups become new columns.

所以 ?P<group> 此处标识了要查找的特定组的ID，如果缺少该ID，则将为列名返回一个整数.

So ?P<group> here identifies an id for a specific group to look for, if this is missing then an int will be returned for the column name.

，因此其余部分应该是不言自明的:group[0-9]查找字符串group，后跟[]所指示的范围[0-9]中的数字，这等效于group\d，其中表示数字.

so the rest should be self-explanatory: group[0-9] looks for the string group followed by the digits in range [0-9] which is what the [] indicate, this is equivalent to group\d where \d means digit.

因此可以将其重写为:

df['group_class'].str.extract(r'(?P<group>group\d+)(?P<class>class\d+)')

这篇关于切片/分割线系列在不同位置的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！