Python新手。在PythonAnywhere中处理大型数据集。我的CSV由于某种原因引入了“年份”作为文本。我能够使用pd.to_numeric使其成为数字。但是现在它是一个浮点数,我想要一个整数。我尝试了.dropna()。apply(np.int64),但它仍作为int传入。我需要dropna,因为显然有一些缺失的值
码:
import pandas as pd
import numpy as np
movies_df = pd.read_csv("movies_All.csv")
recentdf = movies_df.copy()
recentdf['Year'] = pd.to_numeric(recentdf['Year'], errors = 'coerce')
recentdf['Year'] = recentdf['Year'].dropna().apply(np.int64)
#recentdf = recentdf[recentdf['Year'] > 2000]
print(recentdf['Year'].head())
输出:名称:年,dtype:float64
最佳答案
我很困惑。根据您给定的输入,您的代码对我有用:
import pandas as pd, numpy as np
from io import StringIO
input = """
movieId,title,Year
1,Toy Story (1995),1995.0
2,Jumanji (1995),1995.0
"""
df = pd.read_csv(StringIO(input))
df['Year'] = df['Year'].dropna().apply(np.int64)
print(df["Year"].head())
输出量
0 1995
1 1995
Name: Year, dtype: int64
编辑:下面的讨论。
import pandas as pd, numpy as np
from io import StringIO
input = """
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller
11,"American President, The (1995)",Comedy|Drama|Romance
12,Dracula: Dead and Loving It (1995),Comedy|Horror
13,Balto (1995),Adventure|Animation|Children
14,Nixon (1995),Drama
"""
df = pd.read_csv(StringIO(input))
df["Year"] = df["title"].apply(lambda title: title[-5:-1])
df['Year'] = df['Year'].dropna().apply(np.int64)
print(df["Year"].head())
输出量
0 1995
1 1995
2 1995
3 1995
4 1995
...
Name: Year, dtype: int64