本文介绍了计算每个句子中的单词数Spark Dataframes的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个Spark Dataframe,其中每一行都有一个评论.
I have a Spark Dataframe where each row has a review.
+--------------------+
| reviewText|
+--------------------+
|Spiritually and m...|
|This is one my mu...|
|This book provide...|
|I first read THE ...|
+--------------------+
我尝试过:
SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('reviewText')))
SplitSentences = SplitSentences.select(SplitSentences.split_sent)
然后我创建了函数:
def word_count(text):
return len(text.split())
wordcount_udf = udf(lambda x: word_count(x))
df2 = SplitSentences.withColumn("word_count",
wordcount_udf(col('split_sent')).cast(IntegerType())
我想计算每个评论(行)中每个句子的词数,但这是行不通的.
I want to count the words of each sentence in each review (row) but it doesn't work.
推荐答案
您可以使用 split
内置函数拆分句子并使用 size 内置函数将数组的长度计为
You can use
split
inbuilt function to split the sentences and use the size
inbuilt function to count the length of the array as
df.withColumn("word_count", F.size(F.split(df['reviewText'], ' '))).show(truncate=False)
这样,您就不需要昂贵的udf函数
例如,假设您遵循了一句话数据框
+-----------------------------+
|reviewText |
+-----------------------------+
|this is text testing spliting|
+-----------------------------+
应用以上
size
和 split
函数之后,您应该会得到
After applying above
size
and split
function you should be getting
+-----------------------------+----------+
|reviewText |word_count|
+-----------------------------+----------+
|this is text testing spliting|5 |
+-----------------------------+----------+
如果一行中有多个句子,如下所示
+----------------------------------------------------------------------------------+
|reviewText |
+----------------------------------------------------------------------------------+
|this is text testing spliting. this is second sentence. And this is the third one.|
+----------------------------------------------------------------------------------+
然后,您将必须编写如下的
udf
函数
Then you will have to write a
udf
function as below
from pyspark.sql import functions as F
def countWordsInEachSentences(array):
return [len(x.split()) for x in array]
countWordsSentences = F.udf(lambda x: countWordsInEachSentences(x.split('. ')))
df.withColumn("word_count", countWordsSentences(df['reviewText'])).show(truncate=False)
应该给您
+----------------------------------------------------------------------------------+----------+
|reviewText |word_count|
+----------------------------------------------------------------------------------+----------+
|this is text testing spliting. this is second sentence. And this is the third one.|[5, 4, 6] |
+----------------------------------------------------------------------------------+----------+
我希望答案会有所帮助
这篇关于计算每个句子中的单词数Spark Dataframes的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!