何时使用 numpy 与统计模块

本文介绍了何时使用 numpy 与统计模块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在使用一些统计分析工具时，我发现至少有 3 种 Python 方法来计算均值和标准差(不包括自己动手"技术):

While working on some statistical analysis tools, I discovered there are at least 3 Python methods to calculate mean and standard deviation (not counting the "roll your own" techniques):

np.mean(), np.std()(ddof=0 或 1)
statistics.mean(), statistics.pstdev()(和/或 statistics.stdev)
scipy.statistics 包

np.mean(), np.std() (with ddof=0 or 1)
statistics.mean(), statistics.pstdev() (and/or statistics.stdev)
scipy.statistics package

这让我摸不着头脑.应该有一种明显的方法来做到这一点，对吗? :-) 我发现了一些较旧的 SO 帖子.一种比较 np.mean() 与 statistics.mean() 的性能优势.它还突出了 sum 运算符的差异.那个帖子在这里:为什么-is-statistics-mean-so-slow

That has me scratching my head. There should be one obvious way to do it, right? :-) I've found some older SO posts. One compares the performance advantages of np.mean() vs statistics.mean(). It also highlights differences in the sum operator. That post is here:why-is-statistics-mean-so-slow

我正在处理 numpy 数组数据，我的值落在一个很小的范围内(-1.0 到 1.0，或 0.0 到 10.0)，所以 numpy 函数似乎我的应用程序的明显答案.对于我将要处理的数据，它们在速度、准确性和易于实施之间取得了良好的平衡.

I am working with numpy array data, and my values fall in a small range (-1.0 to 1.0, or 0.0 to 10.0), so the numpy functions seem the obvious answer for my application. They have a good balance of speed, accuracy, and ease of implementation for the data I will be processing.

看来 statistics 模块主要用于那些具有列表(或其他形式)数据或范围广泛的数据 [1e+5, 1.0, 1e-5].这仍然是一个公平的声明吗?是否有任何 numpy 增强功能可以解决 sum 运算符中的差异?最近的发展是否带来了其他优势?

It appears the statistics module is primarily for those that have data in lists (or other forms), or for widely varying ranges [1e+5, 1.0, 1e-5]. Is that still a fair statement? Are there any numpy enhancements that address the differences in the sum operator? Do recent developments bring any other advantages?

数值算法通常有积极和消极的方面:有些更快，或更准确，或者需要更小的内存占用.当面临 3-4 种计算方法的选择时，开发人员的责任是为他/她的应用程序选择最佳"方法.通常，这是竞争优先级和资源之间的平衡行为.

Numerical algorithms generally have positive and negative aspects: some are faster, or more accurate, or require a smaller memory footprint. When faced with a choice of 3-4 ways to do a calculation, a developer's responsibility is to select the "best" method for his/her application. Generally this is a balancing act between competing priorities and resources.

我的目的是征求具有统计分析经验的程序员的回复，以深入了解上述方法(或其他/更好的方法)的优缺点.[我对没有支持事实的猜测或意见不感兴趣.]我将根据我的设计要求做出自己的决定.

My intent is to solicit replies from programmers experienced in statistical analysis to provide insights into the strengths and weaknesses of the methods above (or other/better methods). [I'm not interested in speculation or opinions without supporting facts.] I will make my own decision based on my design requirements.

`推荐答案`

为什么 NumPy 会重复 SciPy 的功能?

来自 SciPy 常见问题解答 NumPy 和 SciPy 有什么区别?:

From the SciPy FAQ What is the difference between NumPy and SciPy?:

在理想的世界中，NumPy 只包含数组数据类型和最基本的操作:索引、排序、整形、基本元素函数等.所有数字代码都将驻留在 SciPy 中.然而，NumPy 的一个重要目标是兼容性，因此 NumPy 试图保留其前任支持的所有功能.

建议使用 SciPy 而不是 NumPy:

It recommends using SciPy over NumPy:

无论如何，SciPy 包含更全功能的线性代数模块版本，以及许多其他数值算法.如果您使用 Python 进行科学计算，您可能应该同时安装 NumPy 和 SciPy.大多数新功能都属于 SciPy 而不是 NumPy.

我应该什么时候使用统计库?

来自统计库文档:

该模块无意成为第三方库(如 NumPy、SciPy)或面向专业统计学家(如 Minitab、SAS 和 Matlab)的专有全功能统计软件包的竞争对手.它针对图形和科学计算器的级别.

因此我不会将它用于严肃的(即资源密集型)计算.

Thus I would not use it for serious (i.e. resource intensive) computation.

statsmodels 和 SciPy 有什么区别?

来自 statsmodels 关于页面:

From the statsmodels about page:

scipy.stats 的模型模块最初是由 Jonathan Taylor 编写的.有一段时间它是 scipy 的一部分，但后来被删除了.在 2009 年 Google Summer of Code 期间，statsmodels 进行了更正、测试、改进并作为新包发布.此后，statsmodels 开发团队不断添加新模型、绘图工具和统计方法.

因此，您可能有一个 SciPy 无法满足的要求，或者可以通过专用库更好地满足.例如 scipy.stats 的 SciPy 文档.probplot 注意到

Thus you may have a requirement that SciPy is not able to fulfill, or is better fulfilled by a dedicated library.For example the SciPy documentation for scipy.stats.probplot notes that

Statsmodels 具有更广泛的此类功能，请参阅 statsmodels.api.ProbPlot.

因此，在这种情况下，您将需要求助于 SciPy 之外的统计库.

Thus in cases like these you will need to turn to statistical libraries beyond SciPy.

                        这篇关于何时使用 numpy 与统计模块的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！