问题描述
我在熊猫中有一个数据框,用于生成散点图,并希望包括该图的回归线.现在,我正在尝试使用polyfit进行此操作.
I have a dataframe in pandas that I'm using to produce a scatterplot, and want to include a regression line for the plot. Right now I'm trying to do this with polyfit.
这是我的代码:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from numpy import *
table1 = pd.DataFrame.from_csv('upregulated_genes.txt', sep='\t', header=0, index_col=0)
table2 = pd.DataFrame.from_csv('misson_genes.txt', sep='\t', header=0, index_col=0)
table1 = table1.join(table2, how='outer')
table1 = table1.dropna(how='any')
table1 = table1.replace('#DIV/0!', 0)
# scatterplot
plt.scatter(table1['log2 fold change misson'], table1['log2 fold change'])
plt.ylabel('log2 expression fold change')
plt.xlabel('log2 expression fold change Misson et al. 2005')
plt.title('Root Early Upregulated Genes')
plt.axis([0,12,-5,12])
# this is the part I'm unsure about
regres = polyfit(table1['log2 fold change misson'], table1['log2 fold change'], 1)
plt.show()
但是出现以下错误:
TypeError: cannot concatenate 'str' and 'float' objects
有人知道我在哪里错吗?我也不确定如何将回归线添加到绘图中.对我的代码的任何其他一般注释也将不胜感激,我仍然是一个初学者.
Does anyone know where I'm going wrong here? I'm also unsure how to add the regression line to my plot. Any other general comments on my code would also be hugely appreciated, I'm still a beginner.
推荐答案
而不是替换#DIV/0!"手动将数据强制为数字.这一次可以做两件事:确保结果是数字类型(不是str),并且它将NaN
替换为无法解析为数字的所有条目.示例:
Instead of replacing '#DIV/0!' by hand, force the data to be numeric. This does two things at once: it ensures that the result is numeric type (not str), and it substitutes NaN
for any entries that cannot be parsed as a number. Example:
In [5]: Series([1, 2, 'blah', '#DIV/0!']).convert_objects(convert_numeric=True)
Out[5]:
0 1
1 2
2 NaN
3 NaN
dtype: float64
这应该可以解决您的错误.但是,在使一条线适合数据的一般主题上,我方便使用两种比polyfit更喜欢的方式来做到这一点.两者中的第二个更健壮(并且可能会返回有关统计信息的更多详细信息),但它需要statsmodels.
This should fix your error. But, on the general subject of fitting a line to data, I keep handy two ways of doing this that I like better than polyfit. The second of the two is more robust (and can potentially return much more detailed information about the statistics) but it requires statsmodels.
from scipy.stats import linregress
def fit_line1(x, y):
"""Return slope, intercept of best fit line."""
# Remove entries where either x or y is NaN.
clean_data = pd.concat([x, y], 1).dropna(0) # row-wise
(_, x), (_, y) = clean_data.iteritems()
slope, intercept, r, p, stderr = linregress(x, y)
return slope, intercept # could also return stderr
import statsmodels.api as sm
def fit_line2(x, y):
"""Return slope, intercept of best fit line."""
X = sm.add_constant(x)
model = sm.OLS(y, X, missing='drop') # ignores entires where x or y is NaN
fit = model.fit()
return fit.params[1], fit.params[0] # could also return stderr in each via fit.bse
要进行绘制,请执行类似的操作
To plot it, do something like
m, b = fit_line2(x, y)
N = 100 # could be just 2 if you are only drawing a straight line...
points = np.linspace(x.min(), x.max(), N)
plt.plot(points, m*points + b)
这篇关于使用Pandas数据框进行线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!