Kaggle之旅1
前言
Kaggle是一个以数据科学竞赛为主题的在线平台。它提供了一个数据科学社区,让数据科学家和机器学习专家可以在这里交流、学习和竞争。Kaggle上有大量的数据集可以供用户使用,这些数据集可以用于挑战、研究和实践。用户可以在Kaggle上提交他们的解决方案,并与其他用户进行比较和讨论。平台还提供了一个排行榜,显示出解决方案的效果和排名。除了数据集和竞赛,Kaggle还提供了各种教程和学习资源,帮助用户提升他们的数据科学技能。Kaggle还有一个社区论坛,用户可以在这里提问、寻求帮助和分享经验。
Kaggle被很多数据科学家和机器学习爱好者视为一个学习和交流的宝贵资源。它提供了一个机会,让用户能够与全球最优秀的数据科学家竞争和合作,共同解决现实世界的问题。
从今天开始我将开启Kaggle之旅,边学边记录。
一、目标?
学习一个新事物需要定下目标,本周目标:
- 先熟练掌握kaggle的使用,并学一些感兴趣的内置课程
- 练习Chess Game Dataset的操作分析
二、课程1 pandas
1. 学和练
学习链接:https://www.kaggle.com/learn/pandas
6个主题,
- 创建、读、写,如果不会读取数据,自然就无法对数据进行操作。
- 索引、选择与赋值
- 从多个数据源中重命名或合并数据
- 分析与映射
- 分组与排序
- 数据类型与异常值处理
每个主题结束,可以进行练习,效果很不错。
2. 一些关键摘要
都是比较简单的英文,就不翻译了
- two core objects in pandas: the DataFrame and the Series.
- The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
- If a DataFrame is a table, a Series is a list.
- A Series is, in essence, a single column of a DataFrame. And a Series does not have a column name, it only has one overall name
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
- So a CSV file is a table of values separated by commas. Hence the name: “Comma-Separated Values”, or CSV.
- we can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.
- index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.
- Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.
reviews.iloc[0]
reviews.iloc[:, 0]
reviews.iloc[-5:]
# 以上这3个函数,第1个取第一行,第二列的数据;
# 第2个取第一列的数据;
# 第3个取倒数五行的数据
- The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it’s the data index value, not its position, which matters.
reviews.loc[0, 'country']
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
# 使用loc根据列名筛选数据
- Choosing between loc and iloc,the two methods use slightly different indexing schemes.
- iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,…,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,…,10.
Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, …, Potatoes, …, and we want to select “all the alphabetical fruit choices between Apples and Potatoes”, then it’s a lot more convenient to index df.loc[‘Apples’:‘Potatoes’] than it is to index something like df.loc[‘Apples’, ‘Potatoet’] (t coming after s in the alphabet).
This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,…,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].
关于这点,还是要解释下,首先,loc是指location的意思,iloc中的i是指integer。这两者的区别如下:loc是根据index来索引,比如读入的df定义了一个index,那么loc就根据这个index来索引对应的行。iloc并不是根据index来索引,而是根据行号来索引,行号从0开始,逐次加1。这里有篇文章帮助理解:https://zhuanlan.zhihu.com/p/129898162
总结
以上就是今天记录的Kaggle学习情况。【未完待续】