问题描述
我要寻找一种方式来从不同的网页下载文件,并让他们存储在一个特定文件夹下,在本地计算机上。我使用Python 2.7
I am looking for a way to download files from different pages and get them stored under a particular folder in a local machine. I am using Python 2.7
请参阅下面的字段:
修改
的这里是HTML内容:的
<input type="hidden" name="supplier.orgProfiles(1152444).location.locationPurposes().extendedAttributes(Upload_RFI_Form).value.filename" value="Screenshot.docx">
<a style="display:inline; position:relative;" href="
/aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz">
Screenshot.docx
</a>
一个方法可行我只是尝试:的与HTML内容,如果加说 https://xyz.test.com
构建如URL,如下
<一个href=\"https://xyz.test.com/aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz\" rel=\"nofollow\">https://xyz.test.com/aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz
和放置到浏览器的URL和命中输入
让我有机会下载文件所提到的截图。但是,现在我们可以找到这样的<$c$c>aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz$c$c>它的值是多少present呢?
and place that URL on to the browser and hit Enter
giving me a chance to download the file as screenshot mentioned. But now can we find such aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz
values how many it is present there?
code 我试过至今的
只有痛苦如何下载该文件。使用脚本构建网址:
Only pain how to download that file. using scripts constructed URL:
for a in soup.find_all('a', {"style": "display:inline; position:relative;"}, href=True):
href = a['href'].strip()
href = "https://xyz.test.com/" + href
print(href)
请帮我在这里!
让我知道如果你的人需要从我这里任何更多的信息,我很高兴地分享给你的人。
Let me know if you people need any more information from me, I am happy to share that to you people.
在此先感谢!
推荐答案
由于@JohnZwinck建议您可以使用 urllib.urlretrieve
,并使用重
模块创建的特定网页上的链接列表,并下载每个文件。下面是一个例子。
As @JohnZwinck suggested you can use urllib.urlretrieve
and use the re
module to create a list of links on a given page and download each file. Below is an example.
#!/usr/bin/python
"""
This script would scrape and download files using the anchor links.
"""
#Imports
import os, re, sys
import urllib, urllib2
#Config
base_url = "http://www.google.com/"
destination_directory = "downloads"
def _usage():
"""
This method simply prints out the Usage information.
"""
print "USAGE: %s <url>" %sys.argv[0]
def _create_url_list(url):
"""
This method would create a list of downloads, using the anchor links
found on the URL passed.
"""
raw_data = urllib2.urlopen(url).read()
raw_list = re.findall('<a style="display:inline; position:relative;" href="(.+?)"', raw_data)
url_list = [base_url + x for x in raw_list]
return url_list
def _get_file_name(url):
"""
This method will return the filename extracted from a passed URL
"""
parts = url.split('/')
return parts[len(parts) - 1]
def _download_file(url, filename):
"""
Given a URL and a filename, this method will save a file locally to the»
destination_directory path.
"""
if not os.path.exists(destination_directory):
print 'Directory [%s] does not exist, Creating directory...' % destination_directory
os.makedirs(destination_directory)
try:
urllib.urlretrieve(url, os.path.join(destination_directory, filename))
print 'Downloading File [%s]' % (filename)
except:
print 'Error Downloading File [%s]' % (filename)
def _download_all(main_url):
"""
Given a URL list, this method will download each file in the destination
directory.
"""
url_list = _create_url_list(main_url)
for url in url_list:
_download_file(url, _get_file_name(url))
def main(argv):
"""
This is the script's launcher method.
"""
if len(argv) != 1:
_usage()
sys.exit(1)
_download_all(sys.argv[1])
print 'Finished Downloading.'
if __name__ == '__main__':
main(sys.argv[1:])
您可以更改 BASE_URL
并根据您的需要和脚本另存为<$ C $ destination_directory
C> download.py 。然后从终端使用它像
You can Change the base_url
and the destination_directory
according to your needs and save the script as download.py
. Then from the terminal use it like
python download.py http://www.example.com/?page=1
这篇关于从文件类型字段中下载文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!