本文介绍了使用KDTree/KNN返回最近的邻居的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个python pandas数据框.其中包含自2007年以来所有NFL四分卫的大学橄榄球统计数据,以及他们所处球员类型的标签(精英,平均,低于平均水平).另一个数据框包含本赛季所有大学足球qbs数据以及一个预测标签.

I have two python pandas dataframes. One contains all NFL Quarterbacks' College Football statistics since 2007 and a label on the type of player they are (Elite, Average, Below Average). The other dataframe contains all of the college football qbs' data from this season along with a prediction label.

我想进行某种分析,以便根据标签来确定每个大学橄榄球qb的两个最接近的NFL比较.我想添加两个类似的qb作为第二个数据帧的两个新列.

I want to run some sort of analysis to determine the two closest NFL comparisons for every college football qb based on their labels. I'd like to add to two comparable qbs as two new columns to the second dataframe.

两个数据框中的要素名称相同.数据框如下所示:

The feature names in both dataframes are the same. Here is what the dataframes look like:

Player     Year    Team    GP    Comp %   YDS    TD   INT     Label
Player A   2020     ASU    12     65.5    3053   25    6     Average

对于上面的示例,我想两个找到第一个数据帧中与玩家A最接近的两个邻居,它们的标签也为平均".我想到的方法是使用Scipy的KDTree并运行查询树:

For the example above, I'd like two find the two closest neighbors to Player A that also have the label "Average" from the first dataframe.The way I thought of doing this was to use Scipy's KDTree and run a query tree:

tree = KDTree(nfl[features], leafsize=nfl[features].shape[0]+1)
closest = []

for row in college.iterrows():
    distances, ndx = tree.query(row[features], k=2)
    closest.append(ndx)
print(closest)

但是,print语句返回一个空列表.这是解决我的问题的正确方法吗?

However, the print statement returned an empty list. Is this the right way to solve my problem?

推荐答案

.iterrows(),将返回namedtuples(index,Series),其中index显然是行的索引,而Series是具有这些索引的要素值是列名(请参见下文).

.iterrows(), will return namedtuples (index, Series) where index is obviously the index of the row, and Series is the features values with the index of those being the columns names (see below).

正如您所拥有的那样,row被存储为该元组,因此,当您拥有row[features]时,它实际上不会做任何事情.您真正想要的是具有其特征和值即row[1]的Series.因此,您可以直接调用它,也可以通过执行for idx, row in df.iterrows():在循环中将其分解.然后,您可以调用该系列row.

As you have it, row is being stored as that tuple, so when you have row[features], that won't really do anything. What you're really after is that Series which the features and values Ie row[1]. So you can either call that directly, or just break them up in your loop by doing for idx, row in df.iterrows():. Then you can just call on that Series row.

Scikit Learn是一个很好的软件包(实际上是基于Scipy构建的,因此您会注意到相同的语法).您必须根据自己的规范编辑代码(例如仅使用平均"播放器的过滤器,也许您是对类别列进行一次热编码,在这种情况下可能需要将其添加到功能中,等等),但是要给您一个想法(我只是以示例的方式编排了这些数据框……实际上nfl是准确的,但是学院完全编排了),您可以在下面看到使用kdtree,然后将college数据帧,以查看它在nfl数据帧中最接近的2个值.我显然已经打印出了名称,但是正如您在print(closest)中所看到的那样,原始数组在那里.

Scikit learn is a good package here to use (actually built on Scipy so you'll notice same syntax). You'll have to edit the code to your specifications (like filter to only have the "Average" players, maybe you are one-hot encoding the category columns and in that case may need to add that to the features,etc.), but to give you an idea (And I made up these dataframes just for an example...actually the nfl one is accurate, but the college completely made up), you can see below using the kdtree and then taking each row in the college dataframe to see which 2 values it's closest to in the nfl dataframe. I obviously have it print out the names, but as you can see with print(closest), the raw arrays are there for you.

import pandas as pd

nfl = pd.DataFrame([['Tom Brady','1999','Michigan',11,61.0,2217,16,6,'Average'],
                   ['Aaron Rodgers','2004','California',12,66.1,2566,24,8,'Average'],
                   ['Payton Manning','1997','Tennessee',12,60.2,3819,36,11,'Average'],
                   ['Drew Brees','2000','Perdue',12,60.4,3668,26,12,'Average'],
                   ['Dan Marino','1982','Pitt',12,58.5,2432,17,23,'Average'],
                   ['Joe Montana','1978','Notre Dame',11,54.2,2010,10,9,'Average']],
                    columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])


college = pd.DataFrame([['Joe Smith','2019','Illinois',11,55.6,1045,15,7,'Average'],
                   ['Mike Thomas','2019','Wisconsin',11,67,2045,19,11,'Average'],
                   ['Steve Johnson','2019','Nebraska',12,57.3,2345,9,19,'Average']],
                    columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])


features = ['GP','Comp %','YDS','TD','INT']

from sklearn.neighbors import KDTree
tree = KDTree(nfl[features], leaf_size=nfl[features].shape[0]+1)
closest = []

for idx, row in college.iterrows():

    X = row[features].values.reshape(1, -1)
    distances, ndx = tree.query(X, k=2, return_distance=True)
    closest.append(ndx)

    collegePlayer = college.loc[idx,'Player']
    closestPlayers = [ nfl.loc[x,'Player'] for x in ndx[0] ]

    print ('%s closest to: %s' %(collegePlayer, closestPlayers))

print(closest)

输出:

Joe Smith closest to: ['Joe Montana', 'Tom Brady']
Mike Thomas closest to: ['Joe Montana', 'Tom Brady']
Steve Johnson closest to: ['Dan Marino', 'Tom Brady']

这篇关于使用KDTree/KNN返回最近的邻居的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 20:04