我正在尝试编写一个python脚本,使用requests模块处理http请求,从司法统计局获取数据。我正在请求数据的页面具有“多个选择”字段,允许用户从列表中选择一个或多个选项。
我试图下载数据的页面位于:http://www.ucrdatatool.gov/Search/Crime/Local/OneYearofData.cfm
这是我要提交的表单(在下载过程的第二步中,在您提交上面链接的“state”select表单之后):
<form name="CFForm_1" id="CFForm_1" action="RunCrimeOneYearofData.cfm" method="post" onsubmit="return _CF_checkCFForm_1(this)">
<INPUT TYPE="Hidden" Name="StateId" Value="1">
<INPUT TYPE="Hidden" Name="BJSPopulationGroupId" Value="">
<table width="94%" border="0" height="151">
<tr>
<td width="27%" valign="top"><font size="2" class="text"><b>
<LABEL FOR="agencies">a. Choose one or more agencies:</LABEL>
</b></font><BR> <BR> <font size="2" class="text">
<select name="CrimeCrossId" size="4" MULTIPLE ID="agencies">
<option value="102" >Alabaster Police Dept</option>
<option value="104" >Albertville Police Dept</option>
<option value="105" >Alexander City Police Dept</option>
<option value="110" >Anniston Police Dept</option>
<option value="119" >Athens Police Dept</option>
<option value="120" >Atmore Police Dept</option>
<option value="122" >Auburn Police Dept</option>
<option value="127" >Baldwin County Sheriff Deptartment</option>
<option value="134" >Bessemer Police Dept</option>
<option value="136" >Birmingham Police Dept</option>
<option value="138" >Blount County Sheriff Department</option>
<option value="156" >Calera Police Dept</option>
<option value="157" >Calhoun County Sheriff Department</option>
<option value="174" >Chilton County Sheriff Department</option>
<option value="204" >Cullman County Sheriff Department</option>
<option value="205" >Cullman Police Dept</option>
<option value="210" >Daphne Police Dept</option>
<option value="213" >Decatur Police Dept</option>
<option value="214" >Dekalb County Sheriff Department</option>
<option value="218" >Dothan Police Dept</option>
<option value="228" >Elmore County Sheriff Department</option>
<option value="229" >Enterprise Police Dept</option>
<option value="232" >Etowah County Sheriff Department</option>
<option value="233" >Eufaula Police Dept</option>
<option value="237" >Fairfield Police Dept</option>
<option value="238" >Fairhope Police Dept</option>
<option value="247" >Florence Police Dept</option>
<option value="248" >Foley Police Dept</option>
<option value="251" >Fort Payne Police Dept</option>
<option value="259" >Gadsden Police Dept</option>
<option value="262" >Gardendale Police Dept</option>
<option value="281" >Gulf Shores Police Dept</option>
<option value="292" >Hartselle Police Dept</option>
<option value="296" >Helena Police Dept</option>
<option value="305" >Homewood Police Dept</option>
<option value="306" >Hoover Police Dept</option>
<option value="307" >Houston County Sheriff Department</option>
<option value="308" >Hueytown Police Dept</option>
<option value="310" >Huntsville Police Dept</option>
<option value="314" >Irondale Police Dept</option>
<option value="315" >Jackson County Sheriff Department</option>
<option value="318" >Jacksonville Police Dept</option>
<option value="320" >Jasper Police Dept</option>
<option value="321" >Jefferson County Sheriff Department</option>
<option value="334" >Lauderdale County Sheriff Department</option>
<option value="335" >Lawrence County Sheriff Department</option>
<option value="337" >Lee County Sheriff Department</option>
<option value="338" >Leeds Police Dept</option>
<option value="343" >Limestone County Sheriff Department</option>
<option value="358" >Madison County Sheriff Department</option>
<option value="359" >Madison Police Dept</option>
<option value="365" >Marshall County Sheriff Department</option>
<option value="371" >Millbrook Police Dept</option>
<option value="374" >Mobile County Sheriff Department</option>
<option value="375" >Mobile Police Dept</option>
<option value="381" >Montgomery Police Dept</option>
<option value="382" >Moody Police Dept</option>
<option value="383" >Morgan County Sheriff Department</option>
<option value="388" >Mountain Brook Police Dept</option>
<option value="391" >Muscle Shoals Police Dept</option>
<option value="400" >Northport Police Dept</option>
<option value="406" >Opelika Police Dept</option>
<option value="410" >Oxford Police Dept</option>
<option value="411" >Ozark Police Dept</option>
<option value="413" >Pelham Police Dept</option>
<option value="414" >Pell City Police Dept</option>
<option value="417" >Phenix Police Dept</option>
<option value="426" >Pleasant Grove Police Dept</option>
<option value="429" >Prattville Police Dept</option>
<option value="431" >Prichard Police Dept</option>
<option value="451" >Saraland Police Dept</option>
<option value="454" >Scottsboro Police Dept</option>
<option value="456" >Selma Police Dept</option>
<option value="458" >Shelby County Sheriff Department</option>
<option value="470" >St. Clair County Sheriff Department</option>
<option value="478" >Sylacauga Police Dept</option>
<option value="481" >Talladega County Sheriff Department</option>
<option value="482" >Talladega Police Dept</option>
<option value="497" >Troy Police Dept</option>
<option value="500" >Trussville Police Dept</option>
<option value="501" >Tuscaloosa County Sheriff Department</option>
<option value="502" >Tuscaloosa Police Dept</option>
<option value="517" >Vestavia Hills Police Dept</option>
<option value="522" >Walker County Sheriff Department</option>
</select>
</font> </td>
<td width="34%" valign="top"><font size="2" class="text"><b>
<LABEL FOR="groups">b. Choose one or more variable groups:</LABEL>*
</b></font><BR>
<BR> <font size="2" class="text">
<select name="DataType" size="4" Multiple ID="groups">
<option value="1" >Number
of violent crimes</option>
<option value="2" >Number
of property crimes</option>
<option value="3" >Violent
crime rates</option>
<option value="4" >Property
crime rates</option>
</select>
</font> </td>
<td width="31%" rowspan="2" valign="top" NOWRAP><font size="2" class="text"><b>
<LABEL FOR="year">c. Choose one year:</LABEL>
</b></font><BR> <BR> <font size="2" class="text">
<SELECT Name="YearStart" Size="1" ID="year">
<OPTION Value="1985" >
1985 </OPTION>
<OPTION Value="1986" >
1986 </OPTION>
<OPTION Value="1987" >
1987 </OPTION>
<OPTION Value="1988" >
1988 </OPTION>
<OPTION Value="1989" >
1989 </OPTION>
<OPTION Value="1990" >
1990 </OPTION>
<OPTION Value="1991" >
1991 </OPTION>
<OPTION Value="1992" >
1992 </OPTION>
<OPTION Value="1993" >
1993 </OPTION>
<OPTION Value="1994" >
1994 </OPTION>
<OPTION Value="1995" >
1995 </OPTION>
<OPTION Value="1996" >
1996 </OPTION>
<OPTION Value="1997" >
1997 </OPTION>
<OPTION Value="1998" >
1998 </OPTION>
<OPTION Value="1999" >
1999 </OPTION>
<OPTION Value="2000" >
2000 </OPTION>
<OPTION Value="2001" >
2001 </OPTION>
<OPTION Value="2002" >
2002 </OPTION>
<OPTION Value="2003" >
2003 </OPTION>
<OPTION Value="2004" >
2004 </OPTION>
<OPTION Value="2005" >
2005 </OPTION>
<OPTION Value="2006" >
2006 </OPTION>
<OPTION Value="2007" >
2007 </OPTION>
<OPTION Value="2008" >
2008 </OPTION>
<OPTION Value="2009" >
2009 </OPTION>
<OPTION Value="2010" >
2010 </OPTION>
<OPTION Value="2011" >
2011 </OPTION>
<OPTION Value="2012" >
2012 </OPTION>
</SELECT>
</font> </td>
</tr>
<tr>
<td colspan="2" valign="top" NOWRAP><BR>
<table border="1" cellspacing="0" cellpadding="4" bordercolor="#999999" bgcolor="#FFFFCC" align="left" width="450">
<tr>
<td align="center" nowrap><font size="2" class="text" color="#330099"><b>Hold
down the control key to select more than one option.</b></font></td>
</tr>
</table> </td>
</tr>
<tr>
<td valign="top" NOWRAP> <BR> <BR> <p>
<input name="NextPage" type="submit" value="Get Table">
<input name="PreviousPage" type="submit" value="Previous">
<input name="Cancel" type="reset" value="Reset Form">
</p></td>
<td colspan="2" valign="top" NOWRAP><table width="300" border="0" cellspacing="0" cellpadding="3">
<tr align="left">
<td width="4%" valign="top"><strong>* </strong></td>
<td width="48%" valign="top">Violent crimes:</td>
<td colspan="2" valign="top">Property crimes :</td>
</tr>
<tr>
<td align="center" valign="top"></td>
<td valign="top"> <font class=text size=2> •murder<br>
•forcible rape<br>
•robbery<br>
•aggravated assault </font></td>
<td width="4%"> </td>
<td valign="top"> •burglary<br>
•larceny-theft<br> •motor
vehicle theft</td>
</tr>
<tr align="left">
<td colspan="4" valign="top"><FONT class=text size=2>Tables with
many variables may be very wide.</FONT> </td>
</tr>
</table>
<br> <FONT class=text
size=2>See <B><A
href="/offenses.cfm">UCR Offense Definitions</A></B>
for additional information about these crimes.</FONT> </td>
</tr>
</table>
</form>
我正在尝试选择这些多个字段中的所有字段(例如,选择所有机构/犯罪类型/etc),并提交包含所有这些字段的http post请求。
在firefox中手动提交此表单时,查看live http头的输出,可以看到post请求包含以下查询字符串:
状态id=1&bjspopulationgroupid=&crimecrossid=102&crimecrossid=104&crimecrossid=105&crimecrossid=110&crimecrossid=119&crimecrossid=120&crimecrossid=122&crimecrossid=127&crimecrossid=134&crimecrossid=136&crimecrossid=138&crimecrossid=156&crimecrossid=157&crimecrossid=174&crimecrossid=204&crimecrossid=205&crimecrossid=210&crimecrossid=213&crimecrossi克里克罗西德=214,克里克罗西德=218,克里克罗西德=228,克里克罗西德=229,克里克罗西德=232,克里克罗西德=233,克里克罗西德=237,克里克罗西德=238,克里克罗西德=247,克里克罗西德=248,克里克罗西德=251,克里克罗西德=259,克里克罗西德=262,克里克罗西德=281,克里克罗西德=292,克里克罗西德=296,克里克罗西德=296,克里克罗西德=305,克里克罗西德=306,克里克罗西德=218,克里克罗西德=218,克里克罗西德=292,克里克罗西德=292,克里克罗西德=296,克里克罗西德=296,克里克罗西德=308&crimecrossid=310&cri深红色红色=314,深红色=315,深红色=318,深红色=320,深红色=321,深红色=334,深红色=335,深红色=337,深红色=338,深红色=343,深红色=343,深红色=358,深红色=359,深红色=365,深红色=371,深红色=374,深红色=374,深红色=375,深红色=381,深红色=381,深红色=382,深红色=383,深红色=383,深红色=333,深红色=335,深红色=335,深红色=337,深红色=337,深红色=337,深红色=337,深红色=338,深红色=338,深红色=338,深红色=343 Ossid=388和Crimecrossid=391&深红色=400&深红色=406&深红色=410&深红色=411&深红色=413&深红色=414&深红色=417&深红色=426&深红色=429&深红色=429&深红色=431&深红色=451&深红色=451&深红色=454&深红色=456&深红色=458&深红色=458&深红色=470&深红色=470&深红色=478&深红色=478&深红色=481&深红色=481&深红色=482&深红色=482&深红色=416&深红色=414&深红色=414&深红色=414&深红色=414&深红色=417&深红色=417&深红色=417&深D=497和Crimecrossid=500和Crimecrossid=501&crimecrossid=502&crimecrossid=517&crimecrossid=522&datatype=1&datatype=2&datatype=3&datatype=4&yearstart=2010&nextpage=get+table
这是我目前为止试图实现的python代码…请注意我试图构建post_data2的部分…这不起作用(只是让我回到“第一步”页面):
import requests
from bs4 import BeautifulSoup as BS
base_url = 'http://www.ucrdatatool.gov/Search/Crime/Local/'
dl_page_url = base_url + 'OneYearofData.cfm'
post_url = base_url + 'OneYearofDataStepTwo.cfm'
r = requests.get(dl_page_url)
page = BS(r.content)
select_states = page.find('form', id = 'CFForm_1').find('select', id = 'state')
state_choices = select_states.findAll('option')
state = state_choices[2] #DEBUGGING
#for state in state_choices:
state_id = int(state.get('value'))
state_name = state.text
post_data = { 'StateId': state_id, 'BJSPopulationGroupId' : ''}
r2 = requests.post(post_url, post_data)
page2 = BS(r2.content)
step2_form = page2.find('form', id = 'CFForm_1')
select_agencies = step2_form.find('select', id = 'agencies')
select_crimes = step2_form.find('select', id = 'groups')
select_year = step2_form.find('select', id = 'year')
agency_choices = select_agencies.findAll('option')
crime_choices = select_crimes.findAll('option')
year_choices = select_year.findAll('option')
post_data2 = {'CrimeCrossId': list([a.get('value') for a in agency_choices]),
'DataType' : list([c.get('value') for c in crime_choices]),
'YearStart': '2010'}
post_url2 = base_url + 'RunCrimeOneYearofData.cfm'
r3 = requests.post(post_url2, post_data2)
state_results_page = BS(r3.content)
使用python请求模块提交这样一个多选字段的正确方法是什么?谢谢!
最佳答案
我发现了问题所在:基本上,在第二步中,我需要在post数据中包含两个从第一个表单继承过来的隐藏字段。
所以不是:
post_data2 = {'CrimeCrossId': list([a.get('value') for a in agency_choices]),
'DataType' : list([c.get('value') for c in crime_choices]),
'YearStart': '2010'}
我只需要在第二个请求中包含stateid和bjspopulationgroupid字段:
post_data2 = { 'StateId': state['id'], 'BJSPopulationGroupId': '',
'CrimeCrossId': list([a.get('value') for a in agencies]),
'DataType' : list([c.get('value') for c in crimes])
'YearStart': year}