计算每个句子中的单词数Spark Dataframes

本文介绍了计算每个句子中的单词数Spark Dataframes的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Spark Dataframe，其中每一行都有一个评论.

I have a Spark Dataframe where each row has a review.

+--------------------+
|          reviewText|
+--------------------+
|Spiritually and m...|
|This is one my mu...|
|This book provide...|
|I first read THE ...|
+--------------------+

我尝试过:

SplitSentences = df.withColumn("split_sent",sentencesplit_udf(col('reviewText')))
SplitSentences = SplitSentences.select(SplitSentences.split_sent)

然后我创建了函数:

def word_count(text):
    return len(text.split())

wordcount_udf = udf(lambda x: word_count(x))

df2 = SplitSentences.withColumn("word_count",
  wordcount_udf(col('split_sent')).cast(IntegerType())

我想计算每个评论(行)中每个句子的词数，但这是行不通的.

I want to count the words of each sentence in each review (row) but it doesn't work.

推荐答案

您可以使用 split 内置函数拆分句子并使用 size 内置函数将数组的长度计为

You can use split inbuilt function to split the sentences and use the size inbuilt function to count the length of the array as

df.withColumn("word_count", F.size(F.split(df['reviewText'], ' '))).show(truncate=False)

这样，您就不需要昂贵的udf函数

例如，假设您遵循了一句话数据框

+-----------------------------+
|reviewText                   |
+-----------------------------+
|this is text testing spliting|
+-----------------------------+

应用以上 size 和 split 函数之后，您应该会得到

After applying above size and split function you should be getting

+-----------------------------+----------+
|reviewText                   |word_count|
+-----------------------------+----------+
|this is text testing spliting|5         |
+-----------------------------+----------+

如果一行中有多个句子，如下所示

+----------------------------------------------------------------------------------+
|reviewText                                                                        |
+----------------------------------------------------------------------------------+
|this is text testing spliting. this is second sentence. And this is the third one.|
+----------------------------------------------------------------------------------+

然后，您将必须编写如下的 udf 函数

Then you will have to write a udf function as below

from pyspark.sql import functions as F
def countWordsInEachSentences(array):
    return [len(x.split()) for x in array]

countWordsSentences = F.udf(lambda x: countWordsInEachSentences(x.split('. ')))

df.withColumn("word_count", countWordsSentences(df['reviewText'])).show(truncate=False)

应该给您

+----------------------------------------------------------------------------------+----------+
|reviewText                                                                        |word_count|
+----------------------------------------------------------------------------------+----------+
|this is text testing spliting. this is second sentence. And this is the third one.|[5, 4, 6] |
+----------------------------------------------------------------------------------+----------+

我希望答案会有所帮助

                        这篇关于计算每个句子中的单词数Spark Dataframes的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！