我正在尝试编写一个python脚本,使用requests模块处理http请求,从司法统计局获取数据。我正在请求数据的页面具有“多个选择”字段,允许用户从列表中选择一个或多个选项。
我试图下载数据的页面位于:http://www.ucrdatatool.gov/Search/Crime/Local/OneYearofData.cfm
这是我要提交的表单(在下载过程的第二步中,在您提交上面链接的“state”select表单之后):

<form name="CFForm_1" id="CFForm_1" action="RunCrimeOneYearofData.cfm" method="post" onsubmit="return _CF_checkCFForm_1(this)">
        <INPUT TYPE="Hidden" Name="StateId" Value="1">

        <INPUT TYPE="Hidden" Name="BJSPopulationGroupId" Value="">


    <table width="94%" border="0" height="151">
      <tr>
        <td width="27%" valign="top"><font size="2" class="text"><b>
          <LABEL FOR="agencies">a. Choose one or more agencies:</LABEL>
          </b></font><BR> <BR> <font size="2" class="text">
          <select name="CrimeCrossId" size="4" MULTIPLE ID="agencies">

              <option value="102" >Alabaster Police Dept</option>

              <option value="104" >Albertville Police Dept</option>

              <option value="105" >Alexander City Police Dept</option>

              <option value="110" >Anniston Police Dept</option>

              <option value="119" >Athens Police Dept</option>

              <option value="120" >Atmore Police Dept</option>

              <option value="122" >Auburn Police Dept</option>

              <option value="127" >Baldwin County Sheriff Deptartment</option>

              <option value="134" >Bessemer Police Dept</option>

              <option value="136" >Birmingham Police Dept</option>

              <option value="138" >Blount County Sheriff Department</option>

              <option value="156" >Calera Police Dept</option>

              <option value="157" >Calhoun County Sheriff Department</option>

              <option value="174" >Chilton County Sheriff Department</option>

              <option value="204" >Cullman County Sheriff Department</option>

              <option value="205" >Cullman Police Dept</option>

              <option value="210" >Daphne Police Dept</option>

              <option value="213" >Decatur Police Dept</option>

              <option value="214" >Dekalb County Sheriff Department</option>

              <option value="218" >Dothan Police Dept</option>

              <option value="228" >Elmore County Sheriff Department</option>

              <option value="229" >Enterprise Police Dept</option>

              <option value="232" >Etowah County Sheriff Department</option>

              <option value="233" >Eufaula Police Dept</option>

              <option value="237" >Fairfield Police Dept</option>

              <option value="238" >Fairhope Police Dept</option>

              <option value="247" >Florence Police Dept</option>

              <option value="248" >Foley Police Dept</option>

              <option value="251" >Fort Payne Police Dept</option>

              <option value="259" >Gadsden Police Dept</option>

              <option value="262" >Gardendale Police Dept</option>

              <option value="281" >Gulf Shores Police Dept</option>

              <option value="292" >Hartselle Police Dept</option>

              <option value="296" >Helena Police Dept</option>

              <option value="305" >Homewood Police Dept</option>

              <option value="306" >Hoover Police Dept</option>

              <option value="307" >Houston County Sheriff Department</option>

              <option value="308" >Hueytown Police Dept</option>

              <option value="310" >Huntsville Police Dept</option>

              <option value="314" >Irondale Police Dept</option>

              <option value="315" >Jackson County Sheriff Department</option>

              <option value="318" >Jacksonville Police Dept</option>

              <option value="320" >Jasper Police Dept</option>

              <option value="321" >Jefferson County Sheriff Department</option>

              <option value="334" >Lauderdale County Sheriff Department</option>

              <option value="335" >Lawrence County Sheriff Department</option>

              <option value="337" >Lee County Sheriff Department</option>

              <option value="338" >Leeds Police Dept</option>

              <option value="343" >Limestone County Sheriff Department</option>

              <option value="358" >Madison County Sheriff Department</option>

              <option value="359" >Madison Police Dept</option>

              <option value="365" >Marshall County Sheriff Department</option>

              <option value="371" >Millbrook Police Dept</option>

              <option value="374" >Mobile County Sheriff Department</option>

              <option value="375" >Mobile Police Dept</option>

              <option value="381" >Montgomery Police Dept</option>

              <option value="382" >Moody Police Dept</option>

              <option value="383" >Morgan County Sheriff Department</option>

              <option value="388" >Mountain Brook Police Dept</option>

              <option value="391" >Muscle Shoals Police Dept</option>

              <option value="400" >Northport Police Dept</option>

              <option value="406" >Opelika Police Dept</option>

              <option value="410" >Oxford Police Dept</option>

              <option value="411" >Ozark Police Dept</option>

              <option value="413" >Pelham Police Dept</option>

              <option value="414" >Pell City Police Dept</option>

              <option value="417" >Phenix Police Dept</option>

              <option value="426" >Pleasant Grove Police Dept</option>

              <option value="429" >Prattville Police Dept</option>

              <option value="431" >Prichard Police Dept</option>

              <option value="451" >Saraland Police Dept</option>

              <option value="454" >Scottsboro Police Dept</option>

              <option value="456" >Selma Police Dept</option>

              <option value="458" >Shelby County Sheriff Department</option>

              <option value="470" >St. Clair County Sheriff Department</option>

              <option value="478" >Sylacauga Police Dept</option>

              <option value="481" >Talladega County Sheriff Department</option>

              <option value="482" >Talladega Police Dept</option>

              <option value="497" >Troy Police Dept</option>

              <option value="500" >Trussville Police Dept</option>

              <option value="501" >Tuscaloosa County Sheriff Department</option>

              <option value="502" >Tuscaloosa Police Dept</option>

              <option value="517" >Vestavia Hills Police Dept</option>

              <option value="522" >Walker County Sheriff Department</option>

          </select>
          </font> </td>
        <td width="34%" valign="top"><font size="2" class="text"><b>
          <LABEL FOR="groups">b. Choose one or more variable groups:</LABEL>*
                    </b></font><BR>
          <BR> <font size="2" class="text">
          <select name="DataType" size="4" Multiple ID="groups">

              <option value="1" >Number
              of violent crimes</option>
              <option value="2" >Number
              of property crimes</option>
              <option value="3" >Violent
              crime rates</option>
              <option value="4" >Property
              crime rates</option>

          </select>
        </font> </td>
        <td width="31%" rowspan="2" valign="top" NOWRAP><font size="2" class="text"><b>
          <LABEL FOR="year">c. Choose one year:</LABEL>
          </b></font><BR> <BR> <font size="2" class="text">
          <SELECT Name="YearStart" Size="1" ID="year">

                  <OPTION Value="1985" >
                  1985 </OPTION>

                  <OPTION Value="1986" >
                  1986 </OPTION>

                  <OPTION Value="1987" >
                  1987 </OPTION>

                  <OPTION Value="1988" >
                  1988 </OPTION>

                  <OPTION Value="1989" >
                  1989 </OPTION>

                  <OPTION Value="1990" >
                  1990 </OPTION>

                  <OPTION Value="1991" >
                  1991 </OPTION>

                  <OPTION Value="1992" >
                  1992 </OPTION>

                  <OPTION Value="1993" >
                  1993 </OPTION>

                  <OPTION Value="1994" >
                  1994 </OPTION>

                  <OPTION Value="1995" >
                  1995 </OPTION>

                  <OPTION Value="1996" >
                  1996 </OPTION>

                  <OPTION Value="1997" >
                  1997 </OPTION>

                  <OPTION Value="1998" >
                  1998 </OPTION>

                  <OPTION Value="1999" >
                  1999 </OPTION>

                  <OPTION Value="2000" >
                  2000 </OPTION>

                  <OPTION Value="2001" >
                  2001 </OPTION>

                  <OPTION Value="2002" >
                  2002 </OPTION>

                  <OPTION Value="2003" >
                  2003 </OPTION>

                  <OPTION Value="2004" >
                  2004 </OPTION>

                  <OPTION Value="2005" >
                  2005 </OPTION>

                  <OPTION Value="2006" >
                  2006 </OPTION>

                  <OPTION Value="2007" >
                  2007 </OPTION>

                  <OPTION Value="2008" >
                  2008 </OPTION>

                  <OPTION Value="2009" >
                  2009 </OPTION>

                  <OPTION Value="2010" >
                  2010 </OPTION>

                  <OPTION Value="2011" >
                  2011 </OPTION>

                  <OPTION Value="2012" >
                  2012 </OPTION>

          </SELECT>
          </font> </td>
      </tr>
      <tr>
        <td colspan="2" valign="top" NOWRAP><BR>
          <table border="1" cellspacing="0" cellpadding="4" bordercolor="#999999" bgcolor="#FFFFCC" align="left" width="450">
            <tr>
              <td align="center" nowrap><font size="2" class="text" color="#330099"><b>Hold
                down the control key to select more than one option.</b></font></td>
            </tr>
          </table>        </td>
      </tr>
      <tr>
        <td valign="top" NOWRAP> <BR> <BR> <p>
            <input name="NextPage" type="submit" value="Get Table">
            <input name="PreviousPage" type="submit" value="Previous">
            <input name="Cancel" type="reset" value="Reset Form">
          </p></td>
        <td colspan="2" valign="top" NOWRAP><table width="300" border="0" cellspacing="0" cellpadding="3">
            <tr align="left">
              <td width="4%" valign="top"><strong>* </strong></td>
              <td width="48%" valign="top">Violent crimes:</td>
              <td colspan="2" valign="top">Property crimes :</td>
            </tr>
            <tr>
              <td align="center" valign="top"></td>
              <td valign="top"> <font class=text size=2> &#8226;murder<br>
                &#8226;forcible rape<br>
                &#8226;robbery<br>
                &#8226;aggravated assault </font></td>
              <td width="4%">&nbsp;</td>
              <td valign="top"> &#8226;burglary<br>
                &#8226;larceny-theft<br> &#8226;motor
                vehicle theft</td>
            </tr>
            <tr align="left">
              <td colspan="4" valign="top"><FONT class=text size=2>Tables with
                many variables may be very wide.</FONT> </td>
            </tr>
          </table>
          <br> <FONT class=text
  size=2>See <B><A
  href="/offenses.cfm">UCR Offense Definitions</A></B>
          for additional information about these crimes.</FONT> </td>
      </tr>
    </table>
    </form>

我正在尝试选择这些多个字段中的所有字段(例如,选择所有机构/犯罪类型/etc),并提交包含所有这些字段的http post请求。
在firefox中手动提交此表单时,查看live http头的输出,可以看到post请求包含以下查询字符串:
状态id=1&bjspopulationgroupid=&crimecrossid=102&crimecrossid=104&crimecrossid=105&crimecrossid=110&crimecrossid=119&crimecrossid=120&crimecrossid=122&crimecrossid=127&crimecrossid=134&crimecrossid=136&crimecrossid=138&crimecrossid=156&crimecrossid=157&crimecrossid=174&crimecrossid=204&crimecrossid=205&crimecrossid=210&crimecrossid=213&crimecrossi克里克罗西德=214,克里克罗西德=218,克里克罗西德=228,克里克罗西德=229,克里克罗西德=232,克里克罗西德=233,克里克罗西德=237,克里克罗西德=238,克里克罗西德=247,克里克罗西德=248,克里克罗西德=251,克里克罗西德=259,克里克罗西德=262,克里克罗西德=281,克里克罗西德=292,克里克罗西德=296,克里克罗西德=296,克里克罗西德=305,克里克罗西德=306,克里克罗西德=218,克里克罗西德=218,克里克罗西德=292,克里克罗西德=292,克里克罗西德=296,克里克罗西德=296,克里克罗西德=308&crimecrossid=310&cri深红色红色=314,深红色=315,深红色=318,深红色=320,深红色=321,深红色=334,深红色=335,深红色=337,深红色=338,深红色=343,深红色=343,深红色=358,深红色=359,深红色=365,深红色=371,深红色=374,深红色=374,深红色=375,深红色=381,深红色=381,深红色=382,深红色=383,深红色=383,深红色=333,深红色=335,深红色=335,深红色=337,深红色=337,深红色=337,深红色=337,深红色=338,深红色=338,深红色=338,深红色=343 Ossid=388和Crimecrossid=391&深红色=400&深红色=406&深红色=410&深红色=411&深红色=413&深红色=414&深红色=417&深红色=426&深红色=429&深红色=429&深红色=431&深红色=451&深红色=451&深红色=454&深红色=456&深红色=458&深红色=458&深红色=470&深红色=470&深红色=478&深红色=478&深红色=481&深红色=481&深红色=482&深红色=482&深红色=416&深红色=414&深红色=414&深红色=414&深红色=414&深红色=417&深红色=417&深红色=417&深D=497和Crimecrossid=500和Crimecrossid=501&crimecrossid=502&crimecrossid=517&crimecrossid=522&datatype=1&datatype=2&datatype=3&datatype=4&yearstart=2010&nextpage=get+table
这是我目前为止试图实现的python代码…请注意我试图构建post_data2的部分…这不起作用(只是让我回到“第一步”页面):
import requests
from bs4 import BeautifulSoup as BS

base_url = 'http://www.ucrdatatool.gov/Search/Crime/Local/'
dl_page_url = base_url + 'OneYearofData.cfm'
post_url = base_url + 'OneYearofDataStepTwo.cfm'

r = requests.get(dl_page_url)
page = BS(r.content)

select_states = page.find('form', id = 'CFForm_1').find('select', id = 'state')
state_choices = select_states.findAll('option')

state = state_choices[2]   #DEBUGGING
#for state in state_choices:

state_id = int(state.get('value'))
state_name = state.text

post_data = { 'StateId': state_id, 'BJSPopulationGroupId' : ''}
r2 = requests.post(post_url, post_data)
page2 = BS(r2.content)

step2_form = page2.find('form', id = 'CFForm_1')
select_agencies =  step2_form.find('select', id = 'agencies')
select_crimes = step2_form.find('select', id = 'groups')
select_year =  step2_form.find('select', id = 'year')

agency_choices = select_agencies.findAll('option')
crime_choices = select_crimes.findAll('option')
year_choices = select_year.findAll('option')

post_data2 = {'CrimeCrossId': list([a.get('value') for a in agency_choices]),
              'DataType' :  list([c.get('value') for c in crime_choices]),
              'YearStart': '2010'}

post_url2 = base_url + 'RunCrimeOneYearofData.cfm'
r3 = requests.post(post_url2, post_data2)
state_results_page = BS(r3.content)

使用python请求模块提交这样一个多选字段的正确方法是什么?谢谢!

最佳答案

我发现了问题所在:基本上,在第二步中,我需要在post数据中包含两个从第一个表单继承过来的隐藏字段。
所以不是:

post_data2 = {'CrimeCrossId': list([a.get('value') for a in agency_choices]),
              'DataType' :  list([c.get('value') for c in crime_choices]),
              'YearStart': '2010'}

我只需要在第二个请求中包含stateid和bjspopulationgroupid字段:
 post_data2 = { 'StateId': state['id'], 'BJSPopulationGroupId': '',
                  'CrimeCrossId': list([a.get('value') for a in agencies]),
                  'DataType' :  list([c.get('value') for c in crimes])
                  'YearStart': year}

08-04 10:23