我在Python2.7中创建了一个webcrawler,并使用mysqldb将数据插入到数据库中。
我把每个函数作为不同网页的不同脚本来执行,但是当我把它们作为函数放到一个单独的文件中之后,程序显示错误;
(进入种子页和深度后)
回溯(最近一次呼叫时间):
文件“C:\ Users\Chetan\Desktop\webCrawler.py”,第207行,in
mainFunc(深度,url)
mainFunc的第194行文件“C:\ Users\Chetan\Desktop\webCrawler.py”
lst=每页(url)
文件“C:\ Users\Chetan\Desktop\webCrawler.py”,第186行,每页
筛选器内容(url,页面)
filterContent的第149行文件“C:\ Users\Chetan\Desktop\webCrawler.py”
cursor.execute(sql)
文件“C:\ Python27\lib\site packages\MySQLdb\cursors.py”,第202行,在execute中
self.errorhandler(self,exc,value)
defaulterrorhandler中第36行的文件“C:\ Python27\lib\site packages\MySQLdb\connections.py”

raise errorclass, errorvalue

ProgrammingError:(1064,'您的SQL语法有错误;请查看与MySQL服务器版本相对应的手册,以获取使用near's和specials的正确语法。“/>\n
我好像找不到任何问题。这是密码;
def metaContent(page,url):#EXTRACTS META TAG CONTENT
    lst=[]
    while page.find("<meta")!=-1:
            start_link=page.find("<meta")
            page=page[start_link:]
            start_link=page.find("content=")
            start_quote=page.find('"',start_link)
            end_quote=page.find('"',start_quote+1)
            metaTag=page[start_quote+1:end_quote]
            page=page[end_quote:]
            lst.append(metaTag)

    #ENTER DATA INTO DB
    i,j=0,0
    while i<len(lst):
        sql = "INSERT INTO META(URL, \
               KEYWORD) \
               VALUES ('%s','%s')" % \
               (url,lst[i])
        cursor.execute(sql)
    db.commit()

def filterContent(page,url):#FILTERS THE CONTENT OF THE REMAINING PORTION
    phrase = ['to','a','an','the',"i'm",\
        'for','from','that','their',\
        'i','my','your','you','mine',\
        'we','okay','yes','no','as',\
        'if','but','why','can','now',\
        'are','is','also']

    #CALLS FUNC TO REMOVE HTML TAGS
    page = strip_tags(page)

    #CONVERT TO LOWERCASE
    page = page.lower()

    #REMOVES WHITESPACES
    page = page.split()
    page = " ".join(page)

    #REMOVES IDENTICAL WORDS AND COMMON WORDS
    page = set(page.split())
    page.difference_update(phrase)

    #CONVERTS FROM SET TO LIST
    lst = list(page)

    #ENTER DATA INTO DB
    i,j=0,0
    while i<len(lst):
        sql = "INSERT INTO WORDS(URL, \
               KEYWORD) \
               VALUES ('%s','%s')" % \
               (url,lst[i])
        cursor.execute(sql)
    db.commit()


#<6>
def perPage(url):#CALLS ALL THE FUNCTIONS
    page=pageContent(url)

    #REMOVES CONTENT BETWEEN SCRIPT TAGS
    flg=0
    while page.find("<script",flg)!=-1:
            start=page.find("<script",flg)
            end=page.find("</script>",flg)
            end=end+9
            i,k=0,end-start
            page=list(page)
            while i<k:
                    page.pop(start)
                    i=i+1
            page=''.join(page)
            flg=start
    #REMOVES CONTENT BETWEEN STYLE TAGS
    flg=0
    while page.find("<script",flg)!=-1:
            start=page.find("<style",flg)
            end=page.find("</style>",flg)
            end=end+9
            i,k=0,end-start
            page=list(page)
            while i<k:
                    page.pop(start)
                    i=i+1
            page=''.join(page)
            flg=start

    metaContent(url,page)
    lst=linksExt(url,page)
    filterContent(url,page)
    return lst#CHECK WEATHER NEEDED OR NOT


#<7>
crawled=[]
def mainFunc(depth,url):#FOR THE DEPTH MANIPULATION
    if (depth):
        lst=perPage(url)
        crawled.append(url)
        i=0
        if (depth-1):
            while i<len(lst):
                if url[i] not in crawled:
                    mainFunc(depth-1,url[i])
                i+=1

#CALLING MAIN FUNCTION
mainFunc(depth,url)

请指出任何错误,特别是深度操作函数(mainFunc())。任何关于改进爬虫的东西都会有帮助。

最佳答案

这绝对是sql错误,您的引号没有被转义。
而不是这个

sql = "INSERT INTO META(URL, \
           KEYWORD) \
           VALUES ('%s','%s')" % \
           (url,lst[i])
cursor.execute(sql)

还有这个
sql = "INSERT INTO WORDS(URL, \
           KEYWORD) \
           VALUES ('%s','%s')" % \
           (url,lst[i])
cursor.execute(sql)

试试这个
sql = "INSERT INTO WORDS(URL, \
           KEYWORD) \
           VALUES (%s, %s)"
cursor.execute(sql, (url, lst[i]))

还有这个
sql = "INSERT INTO META(URL, \
           KEYWORD) \
           VALUES (%s, %s)"
cursor.execute(sql, (url, lst[i]))

你也在使用while,但不是递增i,相反你可以使用这个
for keyword in lst:
    sql = "INSERT INTO META(URL, \
           KEYWORD) \
           VALUES (%s, %s)"
    cursor.execute(sql, (url, keyword))

关于python - 网络搜寻器无法正常工作,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25242281/

10-10 10:56