问题描述
我有一个函数 find_country_from_connection_ip
它需要一个ip,一些处理返回一个国家。如下所示:
I have a function find_country_from_connection_ip
which takes an ip, and after some processing returns a country. Like below:
def find_country_from_connection_ip(ip):
# Do some processing
return county
我正在使用 apply
方法中的函数。如下:
I am using the function inside apply
method. like below:
df['Country'] = df.apply(lambda x: find_country_from_ip(x['IP']), axis=1)
由于它很简单,我想要的是评估一个新列DataFrame中的一个现有的列,它具有> 400000
行。
As it is pretty straightforward, what I want is to evaluate a new column from an existing column in the DataFrame which has >400000
rows.
它运行但非常慢,例外如下:
It runs, but terribly slow and throws an exception like below:
请参阅文档中的注意事项:
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if name ==' main ':
In [38]:
if name == 'main': In [38]:
我明白了这个问题,但不能弄清楚如何使用 loc
与应用
和 lambda
。
I understand the problem, but can't quite figure out how to use loc
with apply
and lambda
.
N.B。请建议您是否有更有效的替代解决方案,这可以带来最终结果。
N.B. Please suggest if you have a more efficient alternative solution, which can bring the end result.
****编辑********
**** EDIT ********
该功能主要是查找 mmdb
数据库如下:
The function is mainly a lookup on mmdb
database like below:
def find_country_from_ip(ip):
result = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
if result:
return re.search(r'\"(.+?)\"', result).group(1)
else:
final_output = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} registered_country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
return re.search(r'\"(.+?)\"', final_output).group(1)
然而,这是一个昂贵的操作,当您有一个DataFrame与> 400000
行时,应该需要时间。但多少钱?就是那个问题。大概需要2个小时,这几乎是我想的。
This is nevertheless a costly operation, and when you have a DataFrame with >400000
rows, it should take time. But how much? That is the question. It takes about 2 hours which is pretty much I think.
推荐答案
我将使用 maxminddb-geolite2
(GeoLite) 。
I would use maxminddb-geolite2
(GeoLite) module for that.
首先安装 maxminddb-geolite2
模块
pip install maxminddb-geolite2
Python代码: / strong>
Python Code:
import pandas as pd
from geolite2 import geolite2
def get_country(ip):
try:
x = geo.get(ip)
except ValueError:
return pd.np.nan
try:
return x['country']['names']['en'] if x else pd.np.nan
except KeyError:
return pd.np.nan
geo = geolite2.reader()
# it took me quite some time to find a free and large enough list of IPs ;)
# IP's for testing: http://upd.emule-security.org/ipfilter.zip
x = pd.read_csv(r'D:\download\ipfilter.zip',
usecols=[0], sep='\s*\-\s*',
header=None, names=['ip'])
# get unique IPs
unique_ips = x['ip'].unique()
# make series out of it
unique_ips = pd.Series(unique_ips, index = unique_ips)
# map IP --> country
x['country'] = x['ip'].map(unique_ips.apply(get_country))
geolite2.close()
输出:
In [90]: x
Out[90]:
ip country
0 000.000.000.000 NaN
1 001.002.004.000 NaN
2 001.002.008.000 NaN
3 001.009.096.105 NaN
4 001.009.102.251 NaN
5 001.009.106.186 NaN
6 001.016.000.000 NaN
7 001.055.241.140 NaN
8 001.093.021.147 NaN
9 001.179.136.040 NaN
10 001.179.138.224 Thailand
11 001.179.140.200 Thailand
12 001.179.146.052 NaN
13 001.179.147.002 Thailand
14 001.179.153.216 Thailand
15 001.179.164.124 Thailand
16 001.179.167.188 Thailand
17 001.186.188.000 NaN
18 001.202.096.052 NaN
19 001.204.179.141 China
20 002.051.000.165 NaN
21 002.056.000.000 NaN
22 002.095.041.202 NaN
23 002.135.237.106 Kazakhstan
24 002.135.237.250 Kazakhstan
... ... ...
计时:对于171.884唯一IP:
Timing: for 171.884 unique IPs:
In [85]: %timeit unique_ips.apply(get_country)
1 loop, best of 3: 14.8 s per loop
In [86]: unique_ips.shape
Out[86]: (171884,)
结论:需要大约35秒,您的硬件上有400K个独特IP的DF:
Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:
In [93]: 400000/171884*15
Out[93]: 34.90726303786274
这篇关于 pandas :最快的方式来解决知识产权国家的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!