我将跟随DISCO的例子来计算文件中的单词:
Counting Words as a map/reduce job
我没有问题,但我想尝试从一个包含JSON字符串的文本文件中读取一个特定的字段。
文件中有如下行:
{"favorited": false, "in_reply_to_user_id": 306846931, "contributors": null, "truncated": false, "text": "@CataDuarte8 No! av\u00edseme cuando vaya ah salir para yo salir igual!", "created_at": "Wed Apr 04 20:25:37 +0000 2012", "retweeted": false, "in_reply_to_status_id": 187636960632901632, "coordinates": null, "id": 187637067415683073, "entities": {"user_mentions": [{"indices": [0, 12], "id_str": "306846931", "id": 306846931, "name": "Catalina Ria\u00f1o!\u2661", "screen_name": "CataDuarte8"}], "hashtags": [], "urls": []}, "in_reply_to_status_id_str": "187636960632901632", "id_str": "187637067415683073", "in_reply_to_screen_name": "CataDuarte8", "user": {"follow_request_sent": null, "profile_use_background_image": true, "id": 286402064, "description": "Cada quien RECOJE lo que SIEMBRA (:\r\n\u2551\u258c\u2502\u2551\u2502\u2551\u258c\u2502\u2588\u2551\u2502\u2551\u258c\u2502\u2551\u258c\u2551 ", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1858805061/ginri_normal.jpg", "profile_sidebar_fill_color": "525252", "is_translator": false, "geo_enabled": false, "profile_text_color": "ffffff", "followers_count": 620, "protected": false, "location": "", "default_profile_image": false, "id_str": "286402064", "utc_offset": -21600, "statuses_count": 16395, "profile_background_color": "000000", "friends_count": 537, "profile_link_color": "ff0000", "profile_image_url": "http://a0.twimg.com/profile_images/1858805061/ginri_normal.jpg", "notifications": null, "show_all_inline_media": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/419254765/Scan0004.jpg", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/419254765/Scan0004.jpg", "screen_name": "LadyRomeroo", "lang": "es", "profile_background_tile": true, "favourites_count": 136, "name": "Lady Romero \u2605", "url": "http://www.facebook.com/profile.php?id=1640385164", "created_at": "Fri Apr 22 23:04:41 +0000 2011", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "0a5b80", "default_profile": false, "following": null, "listed_count": 0}, "place": null, "retweet_count": 0, "geo": null, "in_reply_to_user_id_str": "306846931", "source": "web"}
我只对“文本”键、值字段感兴趣。在python中,我可以做到:
import simplejson
f = open("file.json", "r")
for line in f:
r = simplejson.loads(line).get('text')
print r
它返回所有文本字段值,如:
@_MuitoMais_ ´vcs são d msm amei o pode ou ão pode e a entrevist com a @claudialeitte =)
不过,当我尝试将此方法应用于disco附带的示例count_words.py时,效果很好,如下所示:
from disco.core import Job, result_iterator
import simplejson
def map(line, params):
r = simplejson.loads(line).get('text')
for word in r.split():
yield word, 1
def reduce(iter, params):
from disco.util import kvgroup
for word, counts in kvgroup(sorted(iter)):
yield word, sum(counts)
if __name__ == '__main__':
job = Job().run(input=["/tmp/file.json"],
map=map,
reduce=reduce)
for word, count in result_iterator(job.wait(show=True)):
print word, count
我得到以下错误:
# python test.py
Job@549:b4c76:9cbb1:
Status: [map] 0 waiting, 1 running, 0 done, 0 failed
2012/11/24 02:01:10 master New job initialized!
2012/11/24 02:01:10 master Starting job
2012/11/24 02:01:10 master Starting map phase
2012/11/24 02:01:10 master map:0 assigned to comp1
2012/11/24 02:01:11 master ERROR: Job failed: Worker at 'comp1' died: Traceback (most recent call last):
File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 329, in main
job.worker.start(task, job, **jobargs)
File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 290, in start
self.run(task, job, **jobargs)
File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 286, in run
getattr(self, task.mode)(task, params)
File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 302, in map
part = str(self['partition'](key, self['partitions'], params))
File "/home/DISCO/data/comp1/46/Job@549:b4c76:9cbb1/usr/local/lib/python2.7/site-packages/disco/worker/classic/func.py", line 341, in default_partition
return hash(str(key)) % nr_partitions
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128)
2012/11/24 02:01:11 master WARN: Job killed
Status: [map] 1 waiting, 0 running, 0 done, 1 failed
Traceback (most recent call last):
File "test.py", line 18, in <module>
for word, count in result_iterator(job.wait(show=True)):
File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 348, in wait
timeout, poll_interval * 1000)
File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 309, in check_results
raise JobError(Job(name=jobname, master=self), "Status %s" % status)
disco.error.JobError: Job Job@549:b4c76:9cbb1 failed: Status dead
看起来这应该是直截了当的,但我显然漏掉了一些东西。
有人能帮忙吗?
最佳答案
你的问题在disco/worker/classic/func.py
。。。str()
将不接受unicode字符。。。
>>> str(u'\xb4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128)
>>>
因为您只计算单词,所以可以使用
unicodedata
模块将unicode数据转换为字符串。。。import json
import unicodedata
f = open('file.json')
for line in f:
r = json.loads(line).get('text')
s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore')
print r
print s
输出:
@CataDuarte8 No! avÃseme cuando vaya ah salir para yo salir igual!
@CataDuarte8 No! aviseme cuando vaya ah salir para yo salir igual!
把这个应用到你的问题上。。。将
map()
函数重写为。。。def map(line, params):
r = simplejson.loads(line).get('text')
s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore')
for word in s.split():
yield word, 1