我的目标是从csv文件中提取公司名称,并刮擦公司成立的年份以及公司所在的国家/地区。例如,从以下公司,我想返回“ 1989”和“爱尔兰”
http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=allen%20mcguire%20partners
我已经做了一段时间了,使用SO帖子来指导我-但我似乎无法完成。这是Main文件,工作正常,但奇怪的事实是我的标头似乎无法识别,因此我必须使用标头的首字母来获取第一列也是唯一一列,但这对我而言很好。我的问题是,我的网络抓取文件(在此处的主要功能下方打印)找不到并随后返回我想要的值。
from BeautifulSoup import BeautifulSoup
import csv
import urllib
import urllib2
import business_week_test
input_csv = "sample.csv"
output_csv = "BUSINESS_WEEK.csv"
def main():
with open(input_csv, "rb") as infile:
input_fields = ("COMPANY_NAME")
reader = csv.DictReader(infile, fieldnames = input_fields)
with open(output_csv, "wb") as outfile:
output_fields = ("COMPANY_NAME","LOCATION", "YEAR_FOUNDED")
writer = csv.DictWriter(outfile, fieldnames = output_fields)
writer.writerow(dict((h,h) for h in output_fields))
next(reader)
first_row = next(reader)
for next_row in reader:
search_term = first_row["C"]
num_words_in_comp_name = first_row["C"].split()
num_words_in_comp_name = len(num_words_in_comp_name)
result = business_week_test.bwt(search_term, num_words_in_comp_name)
first_row["LOCATION"] = result
writer.writerow(first_row)
first_row = next_row
if __name__ == "__main__":
这是Webscraping文件:
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
def bwt(article, length):
art2 = article.split()
#print(art2)
article1 = urllib.quote(article)
#print(article1)
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Google Chrome')]
if (length == 1):
link = "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=" + art2[0]
elif (length == 2):
link = "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=" + art2[0] + "%20" + art2[1]
elif (length == 3):
#print(art2[0], art2[1],art2[2])
link = "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=" + art2[0] + "%20" + art2[1] + "%20" + art2[2]
#print(link)
try:
opener.open(link)
#print("here")
except urllib2.HTTPError, err:
if err.code == 404 or err.code == 400:
#print("here", link)
return "NA"
else:
raise
resource = opener.open(link)
#print(resource)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
#print(soup)
return soup.find('div',id="bodyContent").p
最佳答案
这是获取位置信息和“ A&P Group Limited”公司成立年份的示例代码:
import urllib2
from BeautifulSoup import BeautifulSoup
LINK = "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=1716794"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Google Chrome')]
soup = BeautifulSoup(opener.open(LINK))
location = soup.find('div', {'itemprop': 'address'}).findAll('p')[-1].text
founded = soup.find('span', {'itemprop': "foundingDate"}).text
print location, founded
印刷品:
United Kingdom 1971
希望能有所帮助。
关于python - Python Web Scraping,在商业周刊上的Beautifulsoup查找公司的成立年份和位置,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/22303041/