问题描述
我希望拆分字符串系列取决于某些子字符串的长度在不同点:
I'm looking to split a string Series at different points depending on the length of certain substrings:
In [47]: df = pd.DataFrame(['group9class1', 'group10class2', 'group11class20'], columns=['group_class'])
In [48]: split_locations = df.group_class.str.rfind('class')
In [49]: split_locations
Out[49]:
0 6
1 7
2 7
dtype: int64
In [50]: df
Out[50]:
group_class
0 group9class1
1 group10class2
2 group11class20
我的输出应如下所示:
group_class group class
0 group9class1 group9 class1
1 group10class2 group10 class2
2 group11class20 group11 class20
我半想这可能有效:
In [56]: df.group_class.str[:split_locations]
Out[56]:
0 NaN
1 NaN
2 NaN
如何按split_locations
中的可变位置对字符串进行切片?
How can I slice my strings by the variable locations in split_locations
?
推荐答案
这可行,通过使用双[[]]
,您可以访问当前元素的索引值,以便可以索引到split_locations
系列中:
This works, by using double [[]]
you can access the index value of the current element so you can index into the split_locations
series:
In [119]:
df[['group_class']].apply(lambda x: pd.Series([x.str[split_locations[x.name]:][0], x.str[:split_locations[x.name]][0]]), axis=1)
Out[119]:
0 1
0 class1 group9
1 class2 group10
2 class20 group11
或者按照@ajcr的建议,您可以extract
:
Or as @ajcr has suggested you can extract
:
In [106]:
df['group_class'].str.extract(r'(?P<group>group[0-9]+)(?P<class>class[0-9]+)')
Out[106]:
group class
0 group9 class1
1 group10 class2
2 group11 class20
编辑
正则表达式说明:
正则表达式来自@ajcr(谢谢!),它使用 str.extract
提取组,这些组将成为新列.
the regex came from @ajcr (thanks!), this uses str.extract
to extract groups, the groups become new columns.
所以 ?P<group>
此处标识了要查找的特定组的ID,如果缺少该ID,则将为列名返回一个整数.
So ?P<group>
here identifies an id for a specific group to look for, if this is missing then an int will be returned for the column name.
,因此其余部分应该是不言自明的:group[0-9]
查找字符串group
,后跟[]
所指示的范围[0-9]
中的数字,这等效于group\d
,其中表示数字.
so the rest should be self-explanatory: group[0-9]
looks for the string group
followed by the digits in range [0-9]
which is what the []
indicate, this is equivalent to group\d
where \d
means digit.
因此可以将其重写为:
df['group_class'].str.extract(r'(?P<group>group\d+)(?P<class>class\d+)')
这篇关于切片/分割线系列在不同位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!