问题描述
我正在尝试删除我的字符串列表中的字符串unicode"u'"标记.该列表是此网站上的演员列表 http://www.boxofficemojo.com/yearly/chart/?yr = 2013& p = .htm .
I am trying to remove the string unicode "u'" marks in my string list. The list is a list of actors from this site http://www.boxofficemojo.com/yearly/chart/?yr=2013&p=.htm.
我有一种方法可以从该网站获取这些字符串:
I have a method that gets these strings from this website:
def getActors(item_url):
response = requests.get(item_url)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
tempActors = []
try:
tempActors.append(soup.find(text="Actors:").find_parent("tr").find_all(text=True)[1:])
except AttributeError:
tempActors.append("n/a")
return tempActors
此方法将每部电影的演员放入一个临时列表.稍后,我将在使用
This method puts each movie's actors into a temporary list.I call this method later in a webcrawling method with
listOfActors.append(getActors(href))
将所有这些临时列表添加到电影所有演员的大列表中.
to append all these temporary lists into a big list of all the movie's actors.
后来,我用
将此列表写到一个csv文件中listOfActors中的项目的
Later, I write this list into a csv file with
for item in listOfActors:
wr.writerow((item))
现在输出就像
[u'Jennifer Lawrence', u'Josh Hutcherson', u'Liam Hemsworth', u'Elizabeth Banks', u'Stanley Tucci', u'Woody Harrelson', u'Philip Seymour Hoffman', u'Jeffrey Wright', u'Jena Malone', u'Amanda Plummer', u'Sam Claflin', u'Donald Sutherland', u'Lenny Kravitz']
[u'Robert Downey, Jr.', u'Gwyneth Paltrow', u'Don Cheadle', u'Guy Pearce', u'Rebecca Hall', u'James Badge Dale', u'Jon Favreau', u'Ben Kingsley', u'Paul Bettany*', u' ', u'(Voice)', u'Mark Ruffalo*', u' ', u'(Cameo)']
我尝试使用 str()
方法,但是我认为它没有用,或者我没有将其放在正确的位置,或者这不是正确的方法.问题在于,我并没有单独获得列表中的每个演员,而是将每个电影的演员都聚集在一起,所以我不知道如何转换整个列表.
I tried using str()
method but I don't think it's working, either I'm not placing it in the right place or this isn't the right way to do it. The issue is that I'm not getting each individual actor in the list by itself, I'm kind of clumping each movie's actors together, so I don't know how to convert the entire list.
推荐答案
提供一个重现问题的小示例,更容易纠正错误.缺少一个例子,这是直接来自 codecs
文档的 UnicodeWriter
.只要确保您的数据是Unicode字符串列表的列表即可:
Provide a small example that reproduces the problem and it is much easier to correct your mistakes. Lacking that, here's an example, with the UnicodeWriter
straight from the codecs
documentation. Just make sure your data is a list of lists of Unicode strings:
#!python2
#coding:utf8
import csv
import cStringIO
import codecs
data = [[u'Chinese',u'English'],
[u'马克',u'Mark'],
[u'你好',u'Hello']]
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
with open('out.csv','wb') as f:
w = UnicodeWriter(f)
w.writerows(data)
这篇关于如何从列表中删除字符串unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!