问题描述
我能够使用Python中的scikit-learn和NLTK模块进行一些简单的机器学习.但是在使用具有不同值类型(数字,字符串列表,是/否等)的多个功能进行训练时,我遇到了问题.在以下数据中,我有一个单词/短语列,在其中提取信息并创建相关列(例如,length列是'word/phrase'的字符长度).标签栏就是标签.
I am able to do some simple machine learning using scikit-learn and NLTK modules in Python. But I have problems when it comes to training with multiple features that have different value types (number, list of string, yes/no, etc). In the following data, I have a word/phrase column in which I extract the information and create relevant columns (for example, the length column is the character lengths of 'word/phrase'). Label column is the label.
Word/phrase Length '2-letter substring' 'First letter' 'With space?' Label
take action 10 ['ta', 'ak', 'ke', 'ac', 'ct', 'ti', 'io', 'on'] t Yes A
sure 4 ['su', 'ur', 're'] s No A
That wasn't 10 ['th', 'ha', 'at', 'wa', 'as', 'sn', 'nt'] t Yes B
simply 6 ['si', 'im', 'mp', 'pl', 'ly'] s No C
a lot of 6 ['lo', 'ot', 'of'] a Yes D
said 4 ['sa', 'ai', 'id'] s No B
我应该将它们合并为一个词典,然后使用sklearn的DictVectorizer
将其保存在工作存储器中吗?然后在训练ML算法时将这些特征视为一个X向量?
Should I make them into one dictionary and then use sklearn's DictVectorizer
to hold them in a working memory? And then treat these features as one X vector when training the ML algorithms?
推荐答案
大多数机器学习算法都使用数字,因此您可以将分类值和字符串转换为数字.
Majority of machine learning algorithms work with numbers, so you can to transform your categorical values and string into numbers.
受欢迎的python机器学习库scikit-learn的整章专门介绍数据预处理.使用是/否",一切都很容易-只需输入0/1即可.
Popular python machine-learning library scikit-learn has the whole chapter dedicated to preprocessing of the data. With 'yes/no' everything is easy - just put 0/1 instead of it.
在许多其他重要内容中,它解释了分类的过程数据预处理,使用其 OneHotEncoder .
Among many other important things it explains the process of categorical data preprocessing using their OneHotEncoder.
在处理文本时,还必须以适当的方式转换数据.文本的一种常见特征提取策略是 tf-idf 得分,我在这里写了教程.
When you work with text, you also have to transform your data in a suitable way. One of the common feature extraction strategy for text is a tf-idf score, and I wrote a tutorial here.
这篇关于python中具有多种功能类型的机器学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!