问题描述
我有以下格式的数据:
37101000ssd48800 ^ A1420asd938987 ^ A2011-09-10 ^ A18:47:50.000 ^ A99 00 ^ A1 ^ A0 ^ A
37101000sd48801 ^ A44557asd03082 ^ A2011-09-06 ^ A13:24:58.000 ^ A42.01 ^ A1 ^ A0 ^ A
因此,我首先将它从字面上理解并尝试:
line = line.split(^ A)
还有
line = line.split(\\001)
所以,问题是:
第一种方法适用于我的本地机器,如果我这样做的话:
cat input.txt | python mapper.py
它在本地运行正常(input.txt是以上数据),但失败hadoop streaming cluster。
有人告诉我应该使用
\\001作为分隔符,但在我的本地机器或群集上,这也不起作用。
对于hadoop人员:
如果我在本地调试它:
cat input.txt | python mapper.py |排序| python reducer.py
如果我使用^ A
作为本地分隔符,但是在群集上运行时出现错误,并且错误代码也没有太大帮助...
任何关于如何调试这个问题的建议?
谢谢
解决方案如果原始数据使用control-A作为分隔符,它只是在 ^ A
中打印,无论您用什么来列出数据,您都有两个选择:
-
使用 split('^ A')
。
-
只需使用 split('\u001')
。
后者几乎总是成为你真正想要的东西。这不起作用的原因是你写了 split('\\u001')
,转义反斜杠,所以你在分割字符串 \u001
而不是控制-A。
如果原始数据实际上包含 ^ A
(脱字符后跟 A
)作为分隔符,只需使用 split('^ A')
。
I have data in form:
37101000ssd48800^A1420asd938987^A2011-09-10^A18:47:50.000^A99.00^A1^A0^A
37101000sd48801^A44557asd03082^A2011-09-06^A13:24:58.000^A42.01^A1^A0^A
So first I took it literally and tried:
line = line.split("^A")
and also
line = line.split("\\u001")
So, the issue is:
The first approach works on my local machine if I do this:
cat input.txt | python mapper.py
It runs fine locally (input.txt is the above data), but fails on hadoop streaming clusters.
Someone told me that I should use "\\u001"
as the delimiter, but this is also not working, either on my local machine or on clusters.
For hadoop folks:
If I debug it on local using:
cat input.txt | python mapper.py | sort | python reducer.py
This runs just fine, if I use "^A"
as delimiter on local but I am getting errors when running on clusters, and the error code is not too helpful either...
Any suggestions on how can i debug this?
Thanks
解决方案 If the original data uses a control-A as a delimiter, and it's just being printed as ^A
in whatever you're using to list the data, you have two choices:
Pipe whatever you use the list the data into a Python script that uses split('^A')
.
Just use split('\u001')
to split on actual control-A values.
The latter is almost always going to be what you really want. The reason this didn't work from you is that you wrote split('\\u001')
, escaping the backslash, so you're splitting on the literal string \u001
rather than on control-A.
If the original data actually has ^A
(a caret followed by an A
) as the delimiter, just use split('^A')
.
这篇关于用蟒蛇分隔克拉A.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!