如何通过字符串过滤 pandas 数据框?

本文介绍了如何通过字符串过滤 pandas 数据框?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个熊猫数据框，我想按列中的特定单词(测试)进行过滤.我尝试过:

I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:

df[df[col].str.contains('test')]

但是它返回一个仅包含列名的空数据框.对于输出，我正在寻找一个包含所有包含单词"test"的行的数据框.我能做些什么?

But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?

编辑(以添加示例):

data = pd.read_csv(/...csv)

数据有5个列，包括'BusinessDescription'，我想提取所有在Business Description列中带有单词'dental'(不区分大小写)的行，所以我使用了:

data has 5 cols, including 'BusinessDescription', and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Description col, so I used:

filtered = data[data['BusinessDescription'].str.contains('dental')==True]

，我得到一个空的数据框，其中只有5个列的标题名称.

and I get an empty dataframe, with just the header names of the 5 cols.

推荐答案

似乎您需要flags .Series.str.contains.html"rel =" noreferrer> contains :

It seems you need parameter flags in contains:

import re

filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]

另一种解决方案，谢谢 Anton vBR 首先转换为小写:

Another solution, thanks Anton vBR is convert to lowercase first:

filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]

示例:
对于以后的编程，我建议在引用数据帧时使用关键字df代替数据.使用该表示法是SO的常见方法.

Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.

import pandas as pd

data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]

  BusinessDescription
0        dental fluss
1              DENTAL

时间:

d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)

#print (data)

In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop

In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop

注意事项:

性能实际上取决于数据-DataFrame的大小和匹配条件的值的数量.

Performance really depend on the data - size of DataFrame and number of values matching condition.

这篇关于如何通过字符串过滤 pandas 数据框?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！