问题描述
不幸的是,这个问题将是重复的,但即使查看了其他类似问题及其相关答案,我也无法解决我的代码中的问题.我需要将我的数据集拆分为训练一个测试数据集.但是,当我添加一个用于预测集群的新列时,我似乎犯了一些错误.我得到的错误是:
This question will be a duplicate unfortunately, but I could not fix the issue in my code, even after looking at the other similar questions and their related answers.I need to split my dataset into train a test a dataset. However, it seems I am doing some error when I add a new column for predicting the cluster.The error that I get is:
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports until
关于这个错误有几个问题,但可能我做错了什么,因为我还没有解决这个问题,我仍然遇到与上面相同的错误.数据集如下:
There are a few questions on this error, but probably I am doing something wrong, as I have not fixed the issue yet and I am still getting the same error as above.The dataset is the following:
Date Link Value
0 03/15/2020 https://www.bbc.com 1
1 03/15/2020 https://www.netflix.com 4
2 03/15/2020 https://www.google.com 10
...
我已将数据集拆分为训练集和测试集,如下所示:
I have split the dataset into train and test sets as follows:
import sklearn
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import string as st
train_data=df.Link.tolist()
df_train=pd.DataFrame(train_data, columns = ['Review'])
X = df_train
X_train, X_test = train_test_split(
X, test_size=0.4).copy()
X_test, X_val = train_test_split(
X_test, test_size=0.5).copy()
print(X_train.isna().sum())
print(X_test.isna().sum())
stop_words = stopwords.words('english')
def preprocessor(t):
t = re.sub(r"[^a-zA-Z]", " ", t())
words = word_tokenize(t)
w_lemm = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
return w_lemm
vect =TfidfVectorizer(tokenizer= preprocessor)
vectorized_text=vect.fit_transform(X_train['Review'])
kmeans =KMeans(n_clusters=3).fit(vectorized_text)
导致错误的代码行是:
cl=kmeans.predict(vectorized_text)
X_train['Cluster']=pd.Series(cl, index=X_train.index)
我觉得这两个问题应该已经可以帮我写代码了:
I think these two questions should have been able to help me with code:
如何将列中的 k-means 预测聚类添加到 Python 中的数据帧
如何处理 Pandas 中的 SettingWithCopyWarning?
但是我的代码中仍然存在一些错误.
but something is still continuing to be wrong within my code.
在关闭这个重复的问题之前,你能看一下它并帮助我解决这个问题吗?
Could you please have a look at it and help me to fix this issue before closing this question as duplicate?
推荐答案
恕我直言,train_test_split
给你一个元组,当你执行 copy()
时,那个 copy()
是 tuple
的操作,而不是 pandas 的操作.这会触发熊猫臭名昭著的复制警告.
IMHO, train_test_split
gives you a tuple, and when you do copy()
, that copy()
is a tuple
's operation, not pandas'. This triggers pandas' infamous copy warning.
所以你只创建元组的浅拷贝,而不是元素.换句话说
So you only create a shallow copy of the tuple, not the elements. In other words
X_train, X_test = train_test_split(X, test_size=0.4).copy()
相当于:
train_test = train_test_split(X, test_size=0.4)
train_test_copy = train_test.copy()
X_train, X_test = train_test_copy[0], train_test_copy[1]
由于 pandas 数据帧是指针,X_train
和 X_test
可能指向也可能不指向与 X
相同的数据.如果你想复制数据帧,你应该在每个数据帧上显式地强制 copy()
:
Since pandas dataframes are pointers, X_train
and X_test
may or may not point to the same data as X
does. If you want to copy the dataframes, you should explicitly force copy()
on each dataframe:
X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()
或
X_train, X_test = [d.copy() for d in train_test_split(X, test_size=0.4)]
然后每个 X_train
和 X_test
都是一个指向新内存数据的新数据帧.
Then each X_train
and X_test
is a new dataframe pointing to new memory data.
更新:在没有任何警告的情况下测试了此代码:
Update: Tested this code without any warnings:
X = pd.DataFrame(np.random.rand(100,3))
X_train, X_test = train_test_split(X, test_size=0.4)
X_train, X_test = X_train.copy(), X_test.copy()
X_train['abcd'] = 1
这篇关于为预测集群创建一个新列:SettingWithCopyWarning的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!