问题描述
我正在 pyspark 中进行循环,并且收到以下消息:
列不可迭代"
这是代码:
(regexp_replace(data_join_result[varibale_choisie],(random.choice(data_join_result.collect()[j][varibale_choisie])),data_join_result.collect()[j][lettre_choisie] ))))
在错误信息中,此时问题来了:
data_join_result.collect()[j][lettre_choisie]
我的输入:
变量A |变量B
蓝色 |白色
粉红色 |黑暗
我的预期输出:
变量A |变量B
BLTE |白色
粉红色 |达姆
如果有人知道如何修复它!谢谢
>最后,我找到了如何创建一个**循环来破坏数据集**.如果有人需要一天,我会分享!
首先,您需要定义要创建的错误,用于替换的字母,例如要损坏的变量,以及我添加带有特殊字符的错误:
lettre = [A"、B"、C"、D"、E"、F"、G"、H"、"I"、J"、K"、L"、M"、N"、O"、P"、Q"、R"、";S"、T"、U"、V"、W"、X"、Y"、Z"]code_erreur= [替换"、插入"、删除"、espace"、caract_spe"、NA"、逆向"]nombre_erreur=[1",1",1",2"]变量 =[VARIABLEA",VARIABLEB"]caract_spe =[_"、^"、¨"、"、."、é"、-"、*"、"ù","ï","à","è","î","â"]
- 我创建了一个列表nombre_erreur",bc 我想要 75% 的数据集有 1 个错误,25% 有 2 个错误.
接下来,创建定义:
def def_code_erreur(code_erreur,varibale,nombre_erreur,lettre,caract_spe):如果 type_erreur==删除":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + col1[(pos+1):]如果 type_erreur==espace":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + ""+ col1[(pos):]如果 type_erreur==插入":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + lettre_choisie + col1[(pos):]如果 type_erreur==caract_spe":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos] + caract_spe_choisi + col1[(pos):]如果 type_erreur==替换":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos-1] + lettre_choisie + col1[(pos):]如果 type_erreur==逆":对于范围内的 i(0,int(nb_erreur)):长 = len(col1)pos = random.choice(range(1,longueur))col1 = col1[:pos-1] + col1[pos:pos+1] + col1[pos-1:pos] + col1[(pos+1):]如果 type_erreur==NA":对于范围内的 i(0,int(nb_erreur)):列 1 = 列 1返回 col1udf_def_code_erreur = udf(def_code_erreur, StringType())
好吧,你必须调用udf_def_code_erreur"!!如果你想破坏整个数据集,你可以在循环中调用它.
i'm making a loop in pyspark, and i have this message:
"Column is not iterable"
This is the code:
(regexp_replace(data_join_result[varibale_choisie],
(random.choice(data_join_result.collect()[j][varibale_choisie])),
data_join_result.collect()[j][lettre_choisie] ))))
in the error message, the problem comes at this moment:
data_join_result.collect()[j][lettre_choisie]
My input:
VARIABLEA | VARIABLEB
BLUE | WHITE
PINK | DARK
My expected output:
VARIABLEA | VARIABLEB
BLTE | WHITE
PINK | DARM
If someone knows how to fix it! Thx
>Finally, I find how to creat a **loop to corrup a dataset**. I'm sharing if someone needs one day!
lettre = [ "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"]
code_erreur= [ "replace","inserte","delete","espace","caract_spe", "NA","inverse"]
nombre_erreur=["1","1","1","2"]
varibale =["VARIABLEA","VARIABLEB"]
caract_spe =["_", "^", "¨", "", ".", "é", "-", "*","ù","ï","à","è","î","â"]
- I creat a list "nombre_erreur", bc I want 75% of my dataset with 1 error and 25% with 2 errors.
def def_code_erreur(code_erreur,varibale ,nombre_erreur,lettre,caract_spe):
if type_erreur=="delete":
for i in range(0,int(nb_erreur)):
longueur = len(col1)
pos = random.choice(range(1,longueur))
col1 = col1[:pos] + col1[(pos+1):]
if type_erreur=="espace":
for i in range(0,int(nb_erreur)):
longueur = len(col1)
pos = random.choice(range(1,longueur))
col1 = col1[:pos] + " " + col1[(pos):]
if type_erreur=="inserte":
for i in range(0,int(nb_erreur)):
longueur = len(col1)
pos = random.choice(range(1,longueur))
col1 = col1[:pos] + lettre_choisie + col1[(pos):]
if type_erreur=="caract_spe":
for i in range(0,int(nb_erreur)):
longueur = len(col1)
pos = random.choice(range(1,longueur))
col1 = col1[:pos] + caract_spe_choisi + col1[(pos):]
if type_erreur=="replace":
for i in range(0,int(nb_erreur)):
longueur = len(col1)
pos = random.choice(range(1,longueur))
col1 = col1[:pos-1] + lettre_choisie + col1[(pos):]
if type_erreur=="inverse":
for i in range(0,int(nb_erreur)):
longueur = len(col1)
pos = random.choice(range(1,longueur))
col1 = col1[:pos-1] + col1[pos:pos+1] + col1[pos-1:pos] + col1[(pos+1):]
if type_erreur=="NA":
for i in range(0,int(nb_erreur)):
col1 = col1
return col1
udf_def_code_erreur = udf(def_code_erreur, StringType())
这篇关于使用 regexp_replace 在 pypsark 上循环的错误消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!