Spark MLLib 的问题导致概率和预测对所有事物都相同

本文介绍了Spark MLLib 的问题导致概率和预测对所有事物都相同的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在学习如何将机器学习与 Spark MLLib 结合使用，目的是对推文进行情感分析.我从这里得到了一个情绪分析数据集:http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

该数据集包含 100 万条分类为正面或负面的推文.该数据集的第二列包含情绪，第四列包含推文.

这是我当前的 PySpark 代码:

导入csv从 pyspark.sql 导入行从 pyspark.sql.functions 导入兰特从 pyspark.ml.feature 导入 Tokenizer从 pyspark.ml.feature 导入 StopWordsRemover从 pyspark.ml.feature 导入 Word2Vec从 pyspark.ml.feature 导入 CountVectorizer从 pyspark.ml.classification 导入 LogisticRegressiondata = sc.textFile("/home/omar/sentiment-train.csv")标头 = data.first()rdd = data.filter(lambda 行:行 != 标题)r = rdd.mapPartitions(lambda x : csv.reader(x))r2 = r.map(lambda x: (x[3], int(x[1])))部分 = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))零件DF = spark.createDataFrame(零件)零件DF = 零件DF.orderBy(rand()).limit(10000)tokenizer = Tokenizer(inputCol="sentence", outputCol="words")标记化 = tokenizer.transform(partsDF)remover = StopWordsRemover(inputCol="words", outputCol="base_words")base_words = remover.transform(标记化)train_data_raw = base_words.select("base_words", "label")word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")模型 = word2Vec.fit(train_data_raw)final_train_data = model.transform(train_data_raw)final_train_data = final_train_data.select("label", "features")lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)lrModel = lr.fit(final_train_data)lrModel.transform(final_train_data).show()

我正在使用以下命令在 PySpark 交互式 shell 上执行此操作:

pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'

(仅供参考:我有一个 HDFS 集群，其中包含 10 个带有 YARN、Spark 等的虚拟机)

最后一行代码的结果是:

>>>lrModel.transform(final_train_data).show()+-----+--------------------+--------------------+--------------------+-----------+|标签|特点|原始预测|概率|预测|+-----+--------------------+--------------------+--------------------+-----------+|1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...|1.0||0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...|1.0|+-----+--------------------+--------------------+--------------------+-----------+只显示前 20 行

如果我对手动创建的较小数据集执行相同操作，则它可以工作.我不知道发生了什么，一直在处理这个问题.

有什么建议吗?

感谢您的时间！

解决方案

TL;DR 十次迭代对于任何现实生活中的应用来说都是很低的.在大型且非平凡的数据集上，可能需要数千次或更多次迭代(以及调整剩余参数)才能收敛.

二项式 LogisticRegressionModel 有 summary 属性，它可以让您访问 LogisticRegressionSummary 对象.在其他有用的指标中，它包含可用于调试训练过程的 objectiveHistory:

将 matplotlib.pyplot 导入为 pltlrm = LogisticRegression(..., family="binomial").fit(df)plt.plot(lrm.summary.objectiveHistory)plt.show()

I'm learning how to use Machine Learning with Spark MLLib with the purpose of doing Sentiment Analysis of Tweets. I got a Sentiment Analysis dataset from here:http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

That dataset contains 1 million of tweets classified as Positive or Negative. The second column of this dataset contains the sentiment and the fourth column contains the tweet.

This is my current PySpark code:

import csv
from pyspark.sql import Row
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import LogisticRegression

data = sc.textFile("/home/omar/sentiment-train.csv")
header = data.first()
rdd = data.filter(lambda row: row != header)

r = rdd.mapPartitions(lambda x : csv.reader(x))
r2 = r.map(lambda x: (x[3], int(x[1])))

parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))
partsDF = spark.createDataFrame(parts)
partsDF = partsDF.orderBy(rand()).limit(10000)

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(partsDF)

remover = StopWordsRemover(inputCol="words", outputCol="base_words")
base_words = remover.transform(tokenized)

train_data_raw = base_words.select("base_words", "label")

word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")

model = word2Vec.fit(train_data_raw)
final_train_data = model.transform(train_data_raw)
final_train_data = final_train_data.select("label", "features")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(final_train_data)

lrModel.transform(final_train_data).show()

I'm executing this on PySpark interactive shell using this command:

pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'

(FYI: I have a HDFS cluster with 10 VMs with YARN, Spark, etc)

As a result of the last line of code, this is what happens:

>>> lrModel.transform(final_train_data).show()
+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
|    0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...|       1.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows

If I do the same with a smaller dataset that I have created manually it works. I don't know what is happening, have been working with this thru the day.

Any suggestions?

Thanks for your time!

解决方案

TL;DR Ten iterations is way to low for any real life applications. On large and non-trivial datasets it can take thousand or more iterations (as well as tuning remaining parameters) to converge.

Binomial LogisticRegressionModel has summary attribute, which can give you an access to a LogisticRegressionSummary object. Among other useful metrics it contains objectiveHistory which can be used to debug training process:

import matplotlib.pyplot as plt

lrm = LogisticRegression(..., family="binomial").fit(df)
plt.plot(lrm.summary.objectiveHistory)

plt.show()

这篇关于Spark MLLib 的问题导致概率和预测对所有事物都相同的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！