需要从tripAdvisor中提取更多信息

我的代码:

 item = TripadvisorItem()

item['url'] = response.url.encode('ascii', errors='ignore')

item['state'] =  hxs.xpath('//*[@id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
    item['state']=hxs.xpath('//*[@id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(@class,"region_title")][2]/text()').extract()

item['city'] =  hxs.select('//*[@id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
    item['city'] =hxs.xpath('//*[@id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
  item['city']=hxs.xpath('//*[@id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')

item['hotelName'] =  hxs.xpath('//*[@id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')

reviews = hxs.select('.//div[contains(@id, "review")]')


1.对于tripAdvisor中的每家酒店,都有一个酒店的ID号。像这家酒店的80075一样:http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS

如何从TA项目中提取此ID?


我需要为每家酒店提供的更多信息:shortDescription,星星,邮政编码,国家/地区和坐标(长,纬度)。我可以提取这些东西吗?
我需要为每次评论提取旅行者类型。怎么样?
我的审核代码:

for review in reviews:
it = Review()

it['state'] =  item['state']

it['city'] =   item['city']

it['hotelName'] = item['hotelName']

it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/@title').extract()
if(it['date']==[]):
    it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
if(it['date']!=[]):
    it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()

it['userName'] = review.xpath('.//div[contains(@class,"username mo")]/span/text()').extract()
if (it['userName']!=[]):
        it['userName']=it['userName'][0].encode('ascii', errors='ignore')

it['userLocation'] = ''.join(review.xpath('.//div[contains(@class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')

it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(@class,"quote")]/text()').extract()
if(it['reviewTitle']!=[]):
    it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
else:
    it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(@class,"noQuotes")]/text()').extract()
    if(it['reviewTitle']!=[]):
        it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')

it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
if(it['reviewContent']!=[]):
    it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()

it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/@alt').extract()
if(it['generalRating']!=[]):
    it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]



有一个好的手册如何找到这些东西?我迷失了所有的跨度和div ..

谢谢!

最佳答案

使用正则表达式从URL获取它是否可以接受?

id  = re.search('(-d)([0-9]+)',url).group(2)

关于python - 在Tripadvisor上爬行,爬行的评论:提取更多酒店和用户信息,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/31149708/

10-08 21:40