问题描述
我是Python的新手,正在尝试使用sklearn对熊猫数据框执行线性回归.这就是我所做的:
I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:
data = pd.read_csv('xxxx.csv')
之后,我得到了两列的DataFrame,让我们将它们称为"c1","c2".现在,我想对(c1,c2)的集合进行线性回归,所以我输入了
After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered
X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)
这导致了以下错误
IndexError: tuple index out of range
这是怎么了?另外,我想知道
What's wrong here? Also, I'd like to know
- 可视化结果
- 根据结果做出预测?
我已经搜索和浏览了大量站点,但是似乎没有一个站点可以指导初学者使用正确的语法.也许对专家而言显而易见的对像我这样的新手而言并不那么明显.
I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.
可以请您帮忙吗?非常感谢您的宝贵时间.
Can you please help? Thank you very much for your time.
PS:我注意到很多初学者的问题在stackoverflow中都被否决了.请考虑以下事实:对于专家用户而言显而易见的事情可能需要花很多时间才能解决.按下向下箭头时请谨慎使用,以免损害此讨论社区的活力.
PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.
推荐答案
让我们假设您的csv类似于:
Let's assume your csv looks something like:
c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...
我这样生成数据:
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))
此数据保存到test.csv(只是为了知道它来自哪里,显然您将使用自己的数据).
This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).
data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
您需要查看要输入到.fit()
的数据的形状.
You need to take a look at the shape of the data you are feeding into .fit()
.
此处x.shape = (10,)
,但我们需要将其设为(10, 1)
,请参见 sklearn . y
也是如此.因此,我们重塑了:
Here x.shape = (10,)
but we need it to be (10, 1)
, see sklearn. Same goes for y
. So we reshape:
x = x.reshape(length, 1)
y = y.reshape(length, 1)
现在,我们创建回归对象,然后调用fit()
:
Now we create the regression object and then call fit()
:
regr = linear_model.LinearRegression()
regr.fit(x, y)
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y, color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
请参见sklearn线性回归示例.
See sklearn linear regression example.
这篇关于使用Sklearn在Pandas DataFrame上进行线性回归(IndexError:元组索引超出范围)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!