问题描述
我有以下代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://api.stlouisfed.org/fred/...")
bsObj = BeautifulSoup(html.read(), "lxml");
print(bsObj)
它返回如下内容:
<?xml version="1.0" encoding="utf-8" ?><html><body><observations count="276" file_type="xml" limit="100000" observation_end="9999-12-31" observation_start="1776-07-04" offset="0" order_by="observation_date" output_type="1" realtime_end="2016-06-22" realtime_start="2016-06-22" sort_order="asc" units="lin">
<observation date="1947-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.4"></observation>
<observation date="1948-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6"></observation>
<observation date="1948-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.7"></observation>
<observation date="1948-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="2.3"></observation>
<observation date="1948-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="0.4"></observation>
<observation date="1949-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-5.4"></observation>
<observation date="1949-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-1.3"></observation>
<observation date="1949-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="4.5"></observation>
<observation date="1949-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-3.5"></observation>
<observation date="1950-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.9"></observation>
<observation date="1950-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="12.7"></observation>
<observation date="1950-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.3"></observation>
</observations>
</body></html>
我只想提取日期"和值",所以最终我会得到这样的东西:
I want to extract only the "date" and the "value" so finaly I have something like this:
1947-04-01 -0.4
1947-07-01 -0.4
1947-10-01 6.4
1948-01-01 6
and so on...
到目前为止,我正在使用replace
抓取文本,并使用import csv
抓取csv文件:
so far I'm using replace
to scrape the text and import csv
for the csv file:
string = str(bsObj)
string = string.replace("realtime_start=","")
string = string.replace("realtime_end=","")
string = string.replace("observation","")
string = string.replace("date=","")
string = string.replace('"2016-06-22"',"")
string = string.replace("value=","")
string = string.replace("<","")
string = string.replace(">","")
string = string.replace("/","")
string = string.replace('"',"")
print(string)
import csv
with open('test.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter=',')
data = string
a.writerows(data)
尽管这几乎是灾难.它会将文本推入csv,但每个simbol都将移至新行.
This one though is almost disaster. It push the text in to the csv but every simbol is going on to new row.
我想知道是否还有其他更优雅的方式可以提取需要的东西.例如:
I want to know if there is any more elegant way I can extract what I need. For example:
for line in f:
extract "date" and "value"
或类似.并将其插入.csv文件的最合适方法是什么?每次调用此脚本时,我都会重写.csv文件.字段必须用,"分隔,行用"/n"分隔.
or similar. And what is the most apropriate way to insert it in to .csv file? I'll be rewriting the .csv file every time I call this script.The fields have to be separated by "," and the lines by "/n".
推荐答案
找到所有属性标签,然后提取所需的属性:
Find all the attribute tags and just extract the attributes you want:
x = """<?xml version="1.0" encoding="utf-8" ?><html><body><observations count="276" file_type="xml" limit="100000" observation_end="9999-12-31" observation_start="1776-07-04" offset="0" order_by="observation_date" output_type="1" realtime_end="2016-06-22" realtime_start="2016-06-22" sort_order="asc" units="lin">
<observation date="1947-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.4"></observation>
<observation date="1948-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6"></observation>
<observation date="1948-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.7"></observation>
<observation date="1948-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="2.3"></observation>
<observation date="1948-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="0.4"></observation>
<observation date="1949-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-5.4"></observation>
<observation date="1949-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-1.3"></observation>
<observation date="1949-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="4.5"></observation>
<observation date="1949-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-3.5"></observation>
<observation date="1950-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.9"></observation>
<observation date="1950-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="12.7"></observation>
<observation date="1950-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.3"></observation>
</observations>
</body></html>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(x,"lxml")
for ob in soup.find_all("observation"):
print(ob["date"])
print(ob["value"])
哪个会给你:
1947-04-01
-0.4
1947-07-01
-0.4
1947-10-01
6.4
1948-01-01
6
1948-04-01
6.7
1948-07-01
2.3
1948-10-01
0.4
1949-01-01
-5.4
1949-04-01
-1.3
1949-07-01
4.5
1949-10-01
-3.5
1950-01-01
16.9
1950-04-01
12.7
1950-07-01
16.3
要写入csv:
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(x, "lxml")
with open("out.csv", "w") as f:
csv.writer(f).writerows((ob["date"], ob["value"])
for ob in soup.find_all("observation"))
哪个为您提供了一个csv文件:
Which gives you a csv file with:
1947-04-01,-0.4
1947-07-01,-0.4
1947-10-01,6.4
1948-01-01,6
1948-04-01,6.7
1948-07-01,2.3
1948-10-01,0.4
1949-01-01,-5.4
1949-04-01,-1.3
1949-07-01,4.5
1949-10-01,-3.5
1950-01-01,16.9
1950-04-01,12.7
1950-07-01,16.3
这篇关于使用beautifulsoup爬取XML元素属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!