特征工程的内容
- 特征抽取
- 特征预处理
- 特征降维
特征提取
将任意数据转化为数字特征
API
sklearn.feature_extraction
字典特征提取
sklearn.feature_extraction.DictVectorizer()
code
def feature_demo():
'''
字典特征抽取
'''
a=[
{'city':'北京', 'temperature':100},
{'city':'上海', 'temperature':60},
{'city':'深圳', 'temperature':30}
]
transfer = DictVectorizer(sparse = False)
new_date = transfer.fit_transform(a)
print(new_date)
print("\n")
print(transfer.feature_names_)
输出
[[ 0. 1. 0. 100.]
[ 1. 0. 0. 60.]
[ 0. 0. 1. 30.]]
['city=上海', 'city=北京', 'city=深圳', 'temperature']
文本特征抽取
方法1:单词次数
API:
sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
from sklearn.feature_extraction.text import CountVectorizer
def text_demo1():
'''
文本特征抽取 CountVectorizer
'''
data=["life is short, I like python","life is too long, I dislike python"]
# 1.实例化转换器类
transfer = CountVectorizer()
# 2.调用fit_transform
res = transfer.fit_transform(data)
print(transfer.get_feature_names())
print(res.toarray())
pass
if __name__ == '__main__':
text_demo1()
输出
['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']
[[0 1 1 1 0 1 1 0]
[1 1 1 0 1 1 0 1]]
-
stop_words=[] 停用词参数
传入停用词参数
transfer = CountVectorizer(stop_words=['is','too'])
输出
['dislike', 'life', 'like', 'long', 'python', 'short'] [[0 1 1 0 1 1] [1 1 0 1 1 0]]
-
处理中文
from matplotlib.pyplot import text from sklearn.feature_extraction.text import CountVectorizer import jieba def cut_words(text): ''' 进行中文分词 ''' a=jieba.cut(text) return " ".join(list(a)) def text_zh(): ''' 中文文本特征提取,自动分词 ''' data = ['意志是一个强壮的盲人,倚靠在明眼的跛子肩上', '学到很多东西的诀窍,就是一下子不要学很多', '重复别人所说的话,只需要教育;而要挑战别人所说的话,则需要头脑' ] data_new=[] for s in data: data_new.append(cut_words(s)) transfer = CountVectorizer(stop_words=['因为', '所以',"一个"]) res = transfer.fit_transform(data_new) print(transfer.get_feature_names()) print(res.toarray()) if __name__ == '__main__': text_zh()
方法2:词频 TF-idf
某个词出现的频率高,并且在其他文本中出现少,则认为该词具有很好的区分能力。
Tf-idf方法 ⭐️
Tf: term frequency,词频
idf: inverse document frequency, 逆向文档频率, i d f = lg 总 文 档 数 包 含 该 词 语 的 文 件 数 idf=\lg{\frac{总文档数}{包含该词语的文件数}} idf=lg包含该词语的文件数总文档数
t f i d f = t f × i d f tfidf=tf\times idf tfidf=tf×idf
代码示例
from sklearn.feature_extraction.text import TfidfVectorizer
def text_tfidf():
'''
文本特征抽取 TFidf
'''
data = ["life is short, I like python",
"life is too long, I dislike python"]
# 1.实例化转换器类
transfer = TfidfVectorizer(stop_words=['is', 'too'])
# 2.调用fit_transform
res = transfer.fit_transform(data)
print(transfer.get_feature_names())
print(res.toarray())
if __name__ == '__main__':
# text_demo1()
text_tfidf()
输出
['dislike', 'life', 'like', 'long', 'python', 'short']
[[0. 0.40993715 0.57615236 0. 0.40993715 0.57615236]
[0.57615236 0.40993715 0. 0.57615236 0.40993715 0.
]]