我在Python2.7中创建了一个webcrawler,并使用mysqldb将数据插入到数据库中。
我把每个函数作为不同网页的不同脚本来执行,但是当我把它们作为函数放到一个单独的文件中之后,程序显示错误;
(进入种子页和深度后)
回溯(最近一次呼叫时间):
文件“C:\ Users\Chetan\Desktop\webCrawler.py”,第207行,in
mainFunc(深度,url)
mainFunc的第194行文件“C:\ Users\Chetan\Desktop\webCrawler.py”
lst=每页(url)
文件“C:\ Users\Chetan\Desktop\webCrawler.py”,第186行,每页
筛选器内容(url,页面)
filterContent的第149行文件“C:\ Users\Chetan\Desktop\webCrawler.py”
cursor.execute(sql)
文件“C:\ Python27\lib\site packages\MySQLdb\cursors.py”,第202行,在execute中
self.errorhandler(self,exc,value)
defaulterrorhandler中第36行的文件“C:\ Python27\lib\site packages\MySQLdb\connections.py”
raise errorclass, errorvalue
ProgrammingError:(1064,'您的SQL语法有错误;请查看与MySQL服务器版本相对应的手册,以获取使用near's和specials的正确语法。“/>\n
我好像找不到任何问题。这是密码;
def metaContent(page,url):#EXTRACTS META TAG CONTENT
lst=[]
while page.find("<meta")!=-1:
start_link=page.find("<meta")
page=page[start_link:]
start_link=page.find("content=")
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
metaTag=page[start_quote+1:end_quote]
page=page[end_quote:]
lst.append(metaTag)
#ENTER DATA INTO DB
i,j=0,0
while i<len(lst):
sql = "INSERT INTO META(URL, \
KEYWORD) \
VALUES ('%s','%s')" % \
(url,lst[i])
cursor.execute(sql)
db.commit()
def filterContent(page,url):#FILTERS THE CONTENT OF THE REMAINING PORTION
phrase = ['to','a','an','the',"i'm",\
'for','from','that','their',\
'i','my','your','you','mine',\
'we','okay','yes','no','as',\
'if','but','why','can','now',\
'are','is','also']
#CALLS FUNC TO REMOVE HTML TAGS
page = strip_tags(page)
#CONVERT TO LOWERCASE
page = page.lower()
#REMOVES WHITESPACES
page = page.split()
page = " ".join(page)
#REMOVES IDENTICAL WORDS AND COMMON WORDS
page = set(page.split())
page.difference_update(phrase)
#CONVERTS FROM SET TO LIST
lst = list(page)
#ENTER DATA INTO DB
i,j=0,0
while i<len(lst):
sql = "INSERT INTO WORDS(URL, \
KEYWORD) \
VALUES ('%s','%s')" % \
(url,lst[i])
cursor.execute(sql)
db.commit()
#<6>
def perPage(url):#CALLS ALL THE FUNCTIONS
page=pageContent(url)
#REMOVES CONTENT BETWEEN SCRIPT TAGS
flg=0
while page.find("<script",flg)!=-1:
start=page.find("<script",flg)
end=page.find("</script>",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i<k:
page.pop(start)
i=i+1
page=''.join(page)
flg=start
#REMOVES CONTENT BETWEEN STYLE TAGS
flg=0
while page.find("<script",flg)!=-1:
start=page.find("<style",flg)
end=page.find("</style>",flg)
end=end+9
i,k=0,end-start
page=list(page)
while i<k:
page.pop(start)
i=i+1
page=''.join(page)
flg=start
metaContent(url,page)
lst=linksExt(url,page)
filterContent(url,page)
return lst#CHECK WEATHER NEEDED OR NOT
#<7>
crawled=[]
def mainFunc(depth,url):#FOR THE DEPTH MANIPULATION
if (depth):
lst=perPage(url)
crawled.append(url)
i=0
if (depth-1):
while i<len(lst):
if url[i] not in crawled:
mainFunc(depth-1,url[i])
i+=1
#CALLING MAIN FUNCTION
mainFunc(depth,url)
请指出任何错误,特别是深度操作函数(mainFunc())。任何关于改进爬虫的东西都会有帮助。
最佳答案
这绝对是sql错误,您的引号没有被转义。
而不是这个
sql = "INSERT INTO META(URL, \
KEYWORD) \
VALUES ('%s','%s')" % \
(url,lst[i])
cursor.execute(sql)
还有这个
sql = "INSERT INTO WORDS(URL, \
KEYWORD) \
VALUES ('%s','%s')" % \
(url,lst[i])
cursor.execute(sql)
试试这个
sql = "INSERT INTO WORDS(URL, \
KEYWORD) \
VALUES (%s, %s)"
cursor.execute(sql, (url, lst[i]))
还有这个
sql = "INSERT INTO META(URL, \
KEYWORD) \
VALUES (%s, %s)"
cursor.execute(sql, (url, lst[i]))
你也在使用while,但不是递增i,相反你可以使用这个
for keyword in lst:
sql = "INSERT INTO META(URL, \
KEYWORD) \
VALUES (%s, %s)"
cursor.execute(sql, (url, keyword))
关于python - 网络搜寻器无法正常工作,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25242281/