问题描述
我有一个熊猫数据框,我想按列中的特定单词(测试)进行过滤.我尝试过:
I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:
df[df[col].str.contains('test')]
但是它返回一个仅包含列名的空数据框.对于输出,我正在寻找一个包含所有包含单词"test"的行的数据框.我能做些什么?
But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?
编辑(以添加示例):
data = pd.read_csv(/...csv)
数据有5个列,包括'BusinessDescription'
,我想提取所有在Business Description
列中带有单词'dental'(不区分大小写)的行,所以我使用了:
data has 5 cols, including 'BusinessDescription'
, and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Description
col, so I used:
filtered = data[data['BusinessDescription'].str.contains('dental')==True]
,我得到一个空的数据框,其中只有5个列的标题名称.
and I get an empty dataframe, with just the header names of the 5 cols.
推荐答案
似乎您需要flags .Series.str.contains.html"rel =" noreferrer> contains
:
It seems you need parameter flags
in contains
:
import re
filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
另一种解决方案,谢谢 Anton vBR 首先转换为小写:
Another solution, thanks Anton vBR is convert to lowercase first:
filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]
示例:
对于以后的编程,我建议在引用数据帧时使用关键字df代替数据.使用该表示法是SO的常见方法.
Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.
import pandas as pd
data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]
BusinessDescription
0 dental fluss
1 DENTAL
时间:
d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)
#print (data)
In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop
In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop
注意事项:
性能实际上取决于数据-DataFrame
的大小和匹配条件的值的数量.
Performance really depend on the data - size of DataFrame
and number of values matching condition.
这篇关于如何通过字符串过滤 pandas 数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!