Datawhale学数据分析第一章

学习笔记分成基础知识部分和项目两部分

更加具体的基础知识内容请看

pandas基础知识
参考1，2章
https://github.com/datawhalechina/joyful-pandas

需要用到的基础知识

1.导入数据
tsv 制表符作为分隔符的字段符
csv 逗号作为分隔符的字段符
详情见利用python进行数据分析第6章
https://github.com/Knowledge-Discovery-in-Databases/team-learning/blob/master/%E7%AC%AC06%E7%AB%A0%20%E6%95%B0%E6%8D%AE%E5%8A%A0%E8%BD%BD%E3%80%81%E5%AD%98%E5%82%A8%E4%B8%8E%E6%96%87%E4%BB%B6%E6%A0%BC%E5%BC%8F.md

#导入包
import numpy as np
import pandas as pd
import os

#查看当前工作目录，修改当前目录，命令行查看当前工作目录
print(os.getcwd())
os.chdir('/Users/mofashipython')
!pwd

/Users/mofashipython/prog/p
/Users/mofashipython

#当前工作目录导入文件
df = pd.read_csv('train.csv')
df.head() #查看开头5行，可设定参数

#绝对目录导入文件
df = pd.read_csv('/Users/mofashipython/train.csv')
df.tail() #查看末尾5行，可设定参数


#查看信息（数据结构）
df.info()

#分块
chunker = pd.read_csv('train.csv', chunksize=1000)



#修改行标签和列标签
df = pd.read_csv('file', names=name1,index_col='name2',header=0)



#查找空值
df.isnull().head()

#查看列名和行名
df.columns
df.index


#查看常用的统计数值
df.describe()
'''
count : 样本数据大小
mean : 样本数据的平均值
std : 样本数据的标准差
min : 样本数据的最小值
25% : 样本数据25%的时候的值
50% : 样本数据50%的时候的值
75% : 样本数据75%的时候的值
max : 样本数据的最大值
'''

df['列名'].describe()


#保存文件至当前目录
df.to_csv('train_chinese.csv')

2.pandas的基本使用方法
pandas中有两个数据类型DataFrame和Series
series，一维数据结构，由index和value组成。
dataframe，二维结构，拥有index和value和column。
dataframe由多个series组成，可以从series创建

#创建Series
v=[1,2,3,4,5]
i=[2,3,4,5,6]
s = pd.Series(v,index = i)
s


2 1
3 2
4 3
5 4
6 5
dtype: int64



#创建DataFrame
i =[1,2,3]
c =["one", "two", "three"]
v = np.random.rand(9).reshape(3,3)
d = pd.DataFrame(v, index = i, columns = c)
d


one    two    three
1    0.491216    0.826787    0.002878
2    0.751016    0.849535    0.738048
3    0.066599    0.268772    0.210717

#对应的行和列的值会相加，没有对应的会变成空值NaN
frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),columns=['a', 'b', 'c'],index=['one', 'two', 'three'])
frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),columns=['a', 'e', 'c'],index=['first', 'one', 'two', 'second'])
frame1_a + frame1_b

a b c e
firstNaNNaNNaNNaN
one 3.0NaN7.0NaN
secondNaNNaNNaNNaN
threeNaNNaNNaNNaN
two 9.0NaN13.0 NaN

3.常用的3种索引方法
df.iloc 位置（数字）索引
df.loc 名称（标签）索引
[] 切片索引
df.loc[行索引，列索引]
逗号隔开维度，loc 左闭右闭区间，可使用布尔型（true,false)

#全部行，Cabin列，输出前5个
df.loc[:,'Cabin'].head()

#重置索引，使用默认的行索引
#不保留原索引，需要参数drop=True
df= df.reset_index(drop=True)


#100,105,108行，Pclass,Name,Sex列（标签值）
df.loc[[100,105,108],['Pclass','Name','Sex']]



#100,105,108行，2,3,4列（索引值）
df.iloc[[100,105,108],[2,3,4]]



#条件查询（布尔型查询）
df[df["Age"]<10].head(3)
midage = df[(df["Age"]>10)& (df["Age"]<50)]

4.排序
sort_values 数值进行排序
sort_index 标签进行排序

#全部行，Cabin列，输出前5个
df.loc[:,'Cabin'].head()

#重置索引，使用默认的行索引
#不保留原索引，需要参数drop=True
df= df.reset_index(drop=True)


#100,105,108行，Pclass,Name,Sex列（标签值）
df.loc[[100,105,108],['Pclass','Name','Sex']]


#100,105,108行，2,3,4列（索引值）
df.iloc[[100,105,108],[2,3,4]]

泰坦尼克项目内容

#导入包
import numpy as np
import pandas as pd

#绝对目录导入文件
df = pd.read_csv('/Users/mofashipython/train.csv')

#把列标签改成中文
df = pd.read_csv('train.csv', names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)

#查找空值
df.isnull().head()

#查看信息（数据结构）
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 是否幸存 891 non-null int64
1 仓位等级 891 non-null int64
2 姓名 891 non-null object
3 性别 891 non-null object
4 年龄 714 non-null float64
5 兄弟姐妹个数 891 non-null int64
6 父母子女个数 891 non-null int64
7 船票信息 891 non-null object
8 票价 891 non-null float64
9 客舱 204 non-null object
10 登船港口 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

#根据票价，年龄，降序排序
text.sort_values(by=['票价', '年龄'], ascending=False).head(10)

分析：10个票价最高（相对富有）的人中，8个人存活。富有的人存活概率大

#票价列的常用统计值
text['票价'].describe()

count 891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: 票价, dtype: float64

分析：

平均值约为：32.20，
标准差约为49.69，票价波动大

中位数14.45 远低于平均数
75% 31.00 接近平均数

愿君多采撷

Datawhale学数据分析第一章