问题描述
我正在尝试使用scrapy"构建网站解析器.我目前的目标是在以下页面提取列表的所有标题:https://www.avito.ru/leningradskaya_oblast_kirovsk/kvartiry/prodam/1-komnatnye(语言:俄语).
然而,使用
response.xpath('here_comes_the_path_to_a_title').extract()
我得到了这样的东西:
[u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 56 \u043c\xb2, 4/5 \u044d.'u04u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 32 \u043c\xb2, 3/3 \u044d\u0442.',u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 48 \u043c\xb2, 11/16 \u044d\u0442.u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 42 \u043c\xb2, 1/4 \u044d\u0442.',u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 37 \u043c\xb2, 1/9 \u044d\u0442.',u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 42 \u043c\xb2, 3/4 \u044d\u0442.',u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 45 \u043c\xb2, 3/3 \u044d\u0442.
,]这显然是用 unicode 编码的所有标题的列表.
现在,问题来了.我想要这些项目(上面列表的值)以其原始形式(就像它们是在互联网页面上用原始语言编写的一样).例如,我想要一本字典:
{'title': 'the_first_value_of_the_above_list_in_original_language'}
然后将此类字典的列表存储在 JSON 或 CSV 文件中.
是否可以解码这些 unicode 字符串并获得它们的原始值?
*p.s.我还注意到我在 python shell 中使用 print 函数获得了原始值:
>>>str = u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 56 \u043c\xb2, 4/5 \u044d\u0442.>>>打印字符串但我不知道如何提取此值并将其写入文件*
不正确.它是字符串中包含的字符的表示.正如您使用 REPL 发现的那样,字符串本身确实包含您期望的字符.
如果您需要将这些字符写入文件,那么您需要为文件选择一种编码并在打开时使用它.
with io.open('output.txt', 'w', encoding='utf-8') 作为 fp:
I'm trying to built a parser of a website using "scrapy". My current aim is to extract all the titles of the listing at the following page: https://www.avito.ru/leningradskaya_oblast_kirovsk/kvartiry/prodam/1-komnatnye (language: russian).
However, using
response.xpath('here_comes_the_path_to_a_title').extract()
i get something like this:
[u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 56 \u043c\xb2, 4/5 \u044d\u0442.', u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 32 \u043c\xb2, 3/3 \u044d\u0442.', u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 48 \u043c\xb2, 11/16 \u044d\u0442.', u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 42 \u043c\xb2, 1/4 \u044d\u0442.', u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 37 \u043c\xb2, 1/9 \u044d\u0442.', u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 42 \u043c\xb2, 3/4 \u044d\u0442.', u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 45 \u043c\xb2, 3/3 \u044d\u0442.',]
which is obviously a list of all titles encoded in unicode.
Now, here comes the question. I would like to have these items (values of the above list) in their original form (like they were written in original language at the internet page).For example, i would like to have a dictionary:
{'title': 'the_first_value_of_the_above_list_in_original_language'}
And later store the list of such dictionaries in a JSON or CSV file.
Is it possible to decode these unicode strings and to get their original values?
*p.s. I also noticed that i get the original value using print function in python shell:
>>> str = u'\n 1-\u043a \u043a\u0432\u0430\u0440\u0442\u0438\u0440\u0430, 56 \u043c\xb2, 4/5 \u044d\u0442.'
>>> print str
but i have no idea how to extract this value and write it in a file*
Incorrect. It is the representation of the characters contained in the string. The string itself does contain the characters you expect, as you've discovered using the REPL.
If you need to write those characters out to a file then you will need to choose an encoding for the file and use it on opening.
with io.open('output.txt', 'w', encoding='utf-8') as fp:
这篇关于如何在python中提取unicode字符的真实值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!