pandas ：最快的方式来解决知识产权国家

本文介绍了 pandas ：最快的方式来解决知识产权国家的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个函数 find_country_from_connection_ip 它需要一个ip，一些处理返回一个国家。如下所示：

I have a function find_country_from_connection_ip which takes an ip, and after some processing returns a country. Like below:

def find_country_from_connection_ip(ip):
    # Do some processing
    return county

我正在使用 apply 方法中的函数。如下：

I am using the function inside apply method. like below:

df['Country'] = df.apply(lambda x: find_country_from_ip(x['IP']), axis=1)

由于它很简单，我想要的是评估一个新列DataFrame中的一个现有的列，它具有> 400000 行。

As it is pretty straightforward, what I want is to evaluate a new column from an existing column in the DataFrame which has >400000 rows.

它运行但非常慢，例外如下：

It runs, but terribly slow and throws an exception like below:

请参阅文档中的注意事项：

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

if name ==' main '：
In [38]：

if name == 'main': In [38]:

我明白了这个问题，但不能弄清楚如何使用 loc 与应用和 lambda 。

I understand the problem, but can't quite figure out how to use loc with apply and lambda.

N.B。请建议您是否有更有效的替代解决方案，这可以带来最终结果。

N.B. Please suggest if you have a more efficient alternative solution, which can bring the end result.

****编辑********

**** EDIT ********

该功能主要是查找 mmdb 数据库如下：

The function is mainly a lookup on mmdb database like below:

def find_country_from_ip(ip):
    result = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
    if result:
        return re.search(r'\"(.+?)\"', result).group(1)
    else:
        final_output = subprocess.Popen("mmdblookup --file GeoIP2-Country.mmdb --ip {} registered_country names en".format(ip).split(" "), stdout=subprocess.PIPE).stdout.read()
        return re.search(r'\"(.+?)\"', final_output).group(1)

然而，这是一个昂贵的操作，当您有一个DataFrame与> 400000 行时，应该需要时间。但多少钱？就是那个问题。大概需要2个小时，这几乎是我想的。

This is nevertheless a costly operation, and when you have a DataFrame with >400000 rows, it should take time. But how much? That is the question. It takes about 2 hours which is pretty much I think.

推荐答案

我将使用 maxminddb-geolite2 （GeoLite）。

I would use maxminddb-geolite2 (GeoLite) module for that.

首先安装 maxminddb-geolite2 模块

pip install maxminddb-geolite2

Python代码： / strong>

Python Code:

import pandas as pd
from geolite2 import geolite2

def get_country(ip):
    try:
        x = geo.get(ip)
    except ValueError:
        return pd.np.nan
    try:
        return x['country']['names']['en'] if x else pd.np.nan
    except KeyError:
        return pd.np.nan

geo = geolite2.reader()

# it took me quite some time to find a free and large enough list of IPs ;)
# IP's for testing: http://upd.emule-security.org/ipfilter.zip
x = pd.read_csv(r'D:\download\ipfilter.zip',
                usecols=[0], sep='\s*\-\s*',
                header=None, names=['ip'])

# get unique IPs
unique_ips = x['ip'].unique()
# make series out of it
unique_ips = pd.Series(unique_ips, index = unique_ips)
# map IP --> country
x['country'] = x['ip'].map(unique_ips.apply(get_country))

geolite2.close()

输出：

In [90]: x
Out[90]:
                     ip     country
0       000.000.000.000         NaN
1       001.002.004.000         NaN
2       001.002.008.000         NaN
3       001.009.096.105         NaN
4       001.009.102.251         NaN
5       001.009.106.186         NaN
6       001.016.000.000         NaN
7       001.055.241.140         NaN
8       001.093.021.147         NaN
9       001.179.136.040         NaN
10      001.179.138.224    Thailand
11      001.179.140.200    Thailand
12      001.179.146.052         NaN
13      001.179.147.002    Thailand
14      001.179.153.216    Thailand
15      001.179.164.124    Thailand
16      001.179.167.188    Thailand
17      001.186.188.000         NaN
18      001.202.096.052         NaN
19      001.204.179.141       China
20      002.051.000.165         NaN
21      002.056.000.000         NaN
22      002.095.041.202         NaN
23      002.135.237.106  Kazakhstan
24      002.135.237.250  Kazakhstan
...                 ...         ...

计时：对于171.884唯一IP：

Timing: for 171.884 unique IPs:

In [85]: %timeit unique_ips.apply(get_country)
1 loop, best of 3: 14.8 s per loop

In [86]: unique_ips.shape
Out[86]: (171884,)

结论：需要大约35秒，您的硬件上有400K个独特IP的DF：

Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:

In [93]: 400000/171884*15
Out[93]: 34.90726303786274

这篇关于 pandas ：最快的方式来解决知识产权国家的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！