Pandas章节应用的数据可以在以下链接下载:
https://files.cnblogs.com/files/AI-robort/Titanic_Data-master.zip
Pandas:数据分析处理库¶
In [1]:
import pandas as pd
In [4]:
df=pd.read_csv('./Titanic_Data-master/Titanic_Data-master/train.csv')
.head():可以读取前几条数据,或指定前几条都可以
In [5]:
df.head(6)
Out[5]:
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
.info():返回当前的信息
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
查看表格的各项属性和细节¶
In [7]:
df.index#索引值的属性
Out[7]:
RangeIndex(start=0, stop=891, step=1)
In [8]:
df.columns#每一列的名字
Out[8]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
In [9]:
df.dtypes#每一列的值的类型
Out[9]:
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
In [10]:
df.values#每行的值
Out[10]:
array([[1, 0, 3, ..., 7.25, nan, 'S'], [2, 1, 1, ..., 71.2833, 'C85', 'C'], [3, 1, 3, ..., 7.925, nan, 'S'], ..., [889, 0, 3, ..., 23.45, nan, 'S'], [890, 1, 1, ..., 30.0, 'C148', 'C'], [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)
自己创建data_frame数据
In [11]:
data={'country':['aaa','bbb','ccc'],'population':[10,12,14]} df_data=pd.DataFrame(data) df_data
Out[11]:
0 | aaa | 10 |
---|---|---|
1 | bbb | 12 |
2 | ccc | 14 |
In [12]:
df_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 2 columns): country 3 non-null object population 3 non-null int64 dtypes: int64(1), object(1) memory usage: 128.0+ bytes
In [15]:
age=df['Age']#搜索对应的一列 age[:5]#显示前5行数据
Out[15]:
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 Name: Age, dtype: float64
series:dataframe中的一行/列
In [16]:
age.index
Out[16]:
RangeIndex(start=0, stop=891, step=1)
In [17]:
age.values[:5]
Out[17]:
array([22., 38., 26., 35., 35.])
In [18]:
df.head()
Out[18]:
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [19]:
df['Age'][:5]
Out[19]:
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 Name: Age, dtype: float64
改变索引对象
In [20]:
df=df.set_index('Name') df.head()
Out[20]:
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|---|---|---|---|---|---|---|---|---|---|---|
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [21]:
df['Age'][:5]
Out[21]:
Name Braund, Mr. Owen Harris 22.0 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 Heikkinen, Miss. Laina 26.0 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 Allen, Mr. William Henry 35.0 Name: Age, dtype: float64
In [25]:
age=df['Age'] age[:5]
Out[25]:
Name Braund, Mr. Owen Harris 22.0 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0 Heikkinen, Miss. Laina 26.0 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 Allen, Mr. William Henry 35.0 Name: Age, dtype: float64
In [26]:
age['Allen, Mr. William Henry']#索引名字对应的值
Out[26]:
35.0
In [27]:
age=age+10 age[:5]
Out[27]:
Name Braund, Mr. Owen Harris 32.0 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 48.0 Heikkinen, Miss. Laina 36.0 Futrelle, Mrs. Jacques Heath (Lily May Peel) 45.0 Allen, Mr. William Henry 45.0 Name: Age, dtype: float64
对值统计指标
In [28]:
age.mean()
Out[28]:
39.69911764705882
In [29]:
age.max()
Out[29]:
90.0
In [30]:
age.min()
Out[30]:
10.42
In [31]:
df.describe()####整体一次性统计各项的指标基本统计特性
Out[31]:
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
---|---|---|---|---|---|---|---|
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |