值列对pyspark进行排序

值列对pyspark进行排序

本文介绍了值列对pyspark进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我的DataFrame:

I have this DataFrame bellow:

Ref °     | Indice_1 | Indice_2      | 1    |   2   |  indice_from     |    indice_from      |      indice_to    |  indice_to
---------------------------------------------------------------------------------------------------------------------------------------------
1         |   19     |   37.1        |  32       |    62      |  ["20031,10031"]  |   ["13,11/12"]     |     ["40062,30062"] |  ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
2         |   19     |   37.1        |  44       |    12      |  ["40062,30062"]  |   ["13,11/12"]     |     ["40062,30062"] |  ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
3         |   19     |   37.1        |  22       |    64      |  ["20031,10031"]  |   ["13,11/12"]       |     ["20031,10031"] |  ["13,11/12"]
---------------------------------------------------------------------------------------------------------------------------------------------
4         |   19     |   37.1        |  32       |    98      |  ["20032,10032"]  |   ["13,11/12"]     |     ["40062,30062"] |  ["13,11/12"]

我想按升序对indice_from,indice_from,indice_to和indice_to列的值进行排序,并且我不应该触摸DataFrame的其余列.知道indice_from和indice_to的2列有时包含数字和字母,例如:["14,14A"]如果我有一个类似["14,14A"]的示例,则总是应该具有相同的结构,例如,如果我有:

I want sort asc the values of the column indice_from, indice_from, indice_to, and indice_to and I shouldn't touch the rest of the columns of my DataFrame.Knowing that the 2 columns indice_from and indice_to contains some times a number + letter like: ["14,14A"]In case if I have an example like ["14,14A"], always I should have the same structure, for example if I have:

数字15,第二个值应为15 +字母,而15

The number 15, the second value should 15 + letter, and 15 < 15 + letter, if first value is 9, the second value should 9 + letter and 9 < 9+letter

新数据框:

Ref °     | Indice_1 | Indice_2      | 1    |   2   |  indice_from     |    indice_from      |      indice_to     |  indice_to
---------------------------------------------------------------------------------------------------------------------------------------------
1         |   19     |   37.1        |  32       |    62      |  ["10031,20031"]  |   ["11/12,13"]       |     ["30062,40062"] |  ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
2         |   19     |   37.1        |  44       |    12      |  ["30062,40062"]  |   ["11/12,13"]       |     ["30062,40062"] |  ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
3         |   19     |   37.1        |  22       |    64      |  ["10031,20031"]  |   ["11/12,13"]       |     ["10031,20031"] |  ["11/12,13"]
---------------------------------------------------------------------------------------------------------------------------------------------
4         |   19     |   37.1        |  32       |    98      |  ["10031,20031"]  |   ["11/12,13"]       |     ["30062,40062"] |  ["11/12,13"]

有人可以帮助我如何对indice_from,indice_from,indice_to和indice_to列的值进行排序,以获取新的Dataframe,如上面的第二个df一样?谢谢

Someone please can help how can I sort the values of columns indice_from, indice_from, indice_to, and indice_to to obtain new Dataframe like the second df above ?Thank you

推荐答案

如果我理解正确,那么

from pyspark.sql import functions as F

columns_to_sort = ['indice_from', 'indice_from', 'indice_to', 'indice_to']

for c in columns_to_sort:
    df = (
        df
        .withColumn(
            c,
            F.sort_array(c)
        )
    )

可以解决问题.让我知道是否可以

will do the trick. Let me know if it doesn't

这篇关于值列对pyspark进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 21:20