问题描述
我需要使用(我希望)最近邻居算法对一些数据进行分类.我已经用谷歌搜索了这个问题,发现了很多库(包括PyML,mlPy和Orange),但是我不确定从哪里开始.
I need to classify some data with (I hope) nearest-neighbour algorithm. I've googled this problem and found a lot of libraries (including PyML, mlPy and Orange), but I'm unsure of where to start here.
我应该如何使用Python实现k-NN?
How should I go about implementing k-NN using Python?
推荐答案
特别考虑到您在Q中提到的技术(k最近邻),我强烈建议 . [注意:发布此答案后,该项目的首席开发人员通知了我一个新首页项目.]
Particularly given the technique (k-Nearest Neighbors) that you mentioned in your Q, i would strongly recommend scikits.learn. [Note: after this Answer was posted, the lead developer of this Project informed me of a new homepage for this Project.]
我认为该库与其他库有一些区别(至少是我使用的其他Python ML库,这是其中大多数):
A few features that i believe distinguish this library from the others (at least the other Python ML libraries that i have used, which is most of them):
-
广泛的诊断和分析测试库(包括绘图模块,通过Matplotlib)-包括功能选择算法,混淆矩阵,ROC,精确调用等;
an extensive diagnostics & testing library (including plottingmodules, via Matplotlib)--includes feature-selection algorithms,confusion matrix, ROC, precision-recall, etc.;
包含电池的" 数据集(包括手写数字,面部图像等),特别适合ML技术;
a nice selection of 'batteries-included' data sets (includinghandwriting digits, facial images, etc.) particularly suited for ML techniques;
广泛的文档(鉴于该项目是大约只有两年的时间),包括教程和分步指导示例代码(使用提供的数据集);
extensive documentation (a nice surprise given that this Project isonly about two years old) including tutorials and step-by-stepexample code (which use the supplied data sets);
毫无例外(至少我现在可以想到),Python ML库非常出色. (请参阅 PyMVPA homepag e,以获取一打左右最受欢迎的python ML库的列表.)
Without exception (at least that i can think of at this moment) the python ML libraries are superb. (See the PyMVPA homepage for a list of the dozen or so most popular python ML libraries.)
例如,在过去的12个月中,我使用了 ffnet (用于MLP), neurolab (也用于MLP), PyBrain ( Q-Learning), neurolab (MLP)和 PyMVPA (SVM)(均可从 Python软件包索引)-它们的成熟度,范围和所提供的基础结构之间存在很大差异,但我发现它们的质量都很高.
In the past 12 months for instance, i have used ffnet (for MLP), neurolab (also for MLP), PyBrain (Q-Learning), neurolab (MLP), and PyMVPA (SVM) (all available from the Python Package Index)--these vary significantly from each other w/r/t maturity, scope, and supplied infrastructure, but i found them all to be of very high quality.
不过,最好的还是 scikits.learn ;例如,除了scikits.learn之外,我不知道有任何Python ML库,它包含我上面提到的三个功能中的任何一个(尽管有一些具有可靠的示例代码和/或教程,但我所不知道的集成功能)这些都带有研究级数据集和诊断算法的库.
Still, the best of these might be scikits.learn; for instance, i am not aware of any python ML library--other than scikits.learn--that includes any of the three features i mentioned above (though a few have solid example code and/or tutorials, none that i know of integrate these with a library of research-grade data sets and diagnostic algorithms).
第二,考虑到您打算使用的技术(最近邻)scikits.学习是一个特别好的选择. Scikits.learn包含用于回归(返回分数)和(返回分类标签),以及每个分类的详细示例代码.
Second, given you the technique you intend to use (k-nearest neighbor) scikits.learn is a particularly good choice. Scikits.learn includes kNN algorithms for both regression (returns a score) and classification (returns a class label), as well as detailed sample code for each.
使用scikits.learn k-nearest邻居模块(从字面上看)再简单不过了:
Using the scikits.learn k-nearest neighbor module (literally) couldn't be any easier:
>>> # import NumPy and the relevant scikits.learn module
>>> import numpy as NP
>>> from sklearn import neighbors as kNN
>>> # load one of the sklearn-suppplied data sets
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> # the call to load_iris() loaded both the data and the class labels, so
>>> # bind each to its own variable
>>> data = iris.data
>>> class_labels = iris.target
>>> # construct a classifier-builder by instantiating the kNN module's primary class
>>> kNN1 = kNN.NeighborsClassifier()
>>> # now construct ('train') the classifier by passing the data and class labels
>>> # to the classifier-builder
>>> kNN1.fit(data, class_labels)
NeighborsClassifier(n_neighbors=5, leaf_size=20, algorithm='auto')
此外,与几乎所有其他ML技术不同,k最近邻的症结所在不是编码有效的分类器构建器,而是构建生产级k最近邻分类器/回归器的困难步骤是持久层- -存储和快速检索从中选择最近邻居的数据点.对于kNN数据存储层,scikits.learn包含用于 球树 的算法(除了明显优于 kd-树(k-NN的传统数据结构),因为它的性能在高维特征空间中不会降低.
What's more, unlike nearly all other ML techniques, the crux of k-nearest neighbors is not coding a working classifier builder, rather the difficult step in building a production-grade k-nearest neighbor classifier/regressor is the persistence layer--i.e., storage and fast retrieval of the data points from which the nearest neighbors are selected. For the kNN data storage layer, scikits.learn includes an algorithm for a ball tree (which i know almost nothing about other than is apparently superior to the kd-tree (the traditional data structure for k-NN) because its performance doesn't degrade in higher dimensional features space.
此外,k个近邻需要一个适当的相似性度量(欧几里德距离是通常的选择,尽管并不总是最好的). Scikits.learn包含一个独立模块,该模块包含各种距离度量以及用于选择合适度量的测试算法.
Additionally, k-nearest neighbors requires an appropriate similarity metric (Euclidean distance is the usual choice, though not always the best one). Scikits.learn includes a stand-along module comprised of various distance metrics as well as testing algorithms for selection of the appropriate one.
最后,我也没有提到一些库,因为它们超出了范围(PyML,贝叶斯).对于开发人员而言,它们不是主要的库",而是对于最终用户的应用程序(例如,Orange),或者它们具有异常或难以安装的依赖项(例如,mlpy,这需要gsl,而后者又必须从源代码构建) ),至少在我的OS(即Mac OS X)上.
Finally, there are a few libraries that i have not mentioned either because they are out of scope (PyML, Bayesian); they are not primarily 'libraries' for developers but rather applications for end users (e.g., Orange), or they have unusual or difficult-to-install dependencies (e.g., mlpy, which requires the gsl, which in turn must be built from source) at least for my OS, which is Mac OS X.
(注意:我不是scikits.learn的开发人员/提交人.)
(Note: i am not a developer/committer for scikits.learn.)
这篇关于如何使用Python使用最邻近算法对数据进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!