使用html5-parser和xmls Common Lisp导航网页

例如，具有2x2表的html文档将如下所示:(defparameter *doc* '("html" () ("head" () ("title" () "Some title")) ("body" () ("table" (("class" "some-class")) ("tr" (("class" "odd")) ("td" () "Some string") ("td" () "Another string")) ("tr" (("class" "even")) ("td" () "Third string") ("td" () "Fourth string"))))))为了遍历dom-tree，让我们定义一个像这样的递归深度优先搜索(请注意，if-let依赖于alexandria库(将其导入或将其更改为alexandria:if-let)) :(defun find-tag (predicate doc &optional path) (when (funcall predicate doc path) (return-from find-tag doc)) (when (listp doc) (let ((path (cons doc path))) (dolist (child (xmls:node-children doc)) (if-let ((find (find-tag predicate child path))) (return-from find-tag find))))))通过谓词函数和文档进行调用.谓词函数被两个参数调用；匹配的元素及其祖先列表.为了找到第一个<td>，您可以执行以下操作:(find-tag (lambda (el path) (declare (ignore path)) (and (listp el) (xmls:xmlrep-tagmatch "td" el))) *doc*); => ("td" NIL "Some string")或者在偶数行中找到第一个<td>:(find-tag (lambda (el path) (and (listp el) (xmls:xmlrep-tagmatch "td" el) (string= (xmls:xmlrep-attrib-value "class" (first path)) "even"))) *doc*); => ("td" NIL "Third string")在偶数行上获取第二个<td>要求如下:(let ((matches 0)) (find-tag (lambda (el path) (when (and (listp el) (xmls:xmlrep-tagmatch "td" el) (string= (xmls:xmlrep-attrib-value "class" (first path)) "even")) (incf matches)) (= matches 2)) *doc*))您可以定义一个辅助函数来查找第n个标签:(defun find-nth-tag (n tag doc) (let ((matches 0)) (find-tag (lambda (el path) (declare (ignore path)) (when (and (listp el) (xmls:xmlrep-tagmatch tag el)) (incf matches)) (= matches n)) doc)))(find-nth-tag 2 "td" *doc*) ; => ("td" NIL "Another string")(find-nth-tag 4 "td" *doc*) ; => ("td" NIL "Fourth string")您可能想要一个简单的助手来获取节点的文本:(defun node-text (el) (if (listp el) (first (xmls:node-children el)) el))您可以定义类似的助手来完成您在应用程序中需要做的任何事情.使用这些，您给出的示例将如下所示:(defparameter *doc* (html5-parser:parse-html5 (drakma:http-request "https://en.wikipedia.org/wiki/List_of_the_heaviest_people") :dom :xmls))(node-text (find-nth-tag 1 "a" (find-nth-tag 1 "td" *doc*))); => "Jon Brower Minnoch"I am trying to get the first row under the column with the title "Name" so for example for https://en.wikipedia.org/wiki/List_of_the_heaviest_people I want to return the name "Jon Brower Minnoch". My code so far is as follows, but I think there must be a more general way of getting the name:(defun find-tag (tag doc) (when (listp doc) (when (string= (xmls:node-name doc) tag) (return-from find-tag doc)) (loop for child in (xmls:node-children doc) for find = (find-tag tag child) when find do (return-from find-tag find))) nil)(defun parse-list-website (url) (second (second (second (third (find-tag "td" (html5-parser:parse-html5 (drakma:http-request url) :dom :xmls)))))))and then to call the function:(parse-list-website "https://en.wikipedia.org/wiki/List_of_the_heaviest_people")I am not very good with xmls and don't know how to get an get a td under a certain column header. 解决方案 The elements in the document returned by html5-parser:parse-html5 are in the form:("name" (attribute-alist) &rest children)You could access the parts with the standard list manipulation functions, but xmls also provides functions node-name, node-attrs and node-children to access the three parts. It's a little bit clearer to use those. Edit: there are also functions xmlrep-attrib-value, to get the value of an attribute and xmlrep-tagmatch to match the tag name. The children are either plain strings, or elements in the same format.So for example, a html document with a 2x2 table would look like this:(defparameter *doc* '("html" () ("head" () ("title" () "Some title")) ("body" () ("table" (("class" "some-class")) ("tr" (("class" "odd")) ("td" () "Some string") ("td" () "Another string")) ("tr" (("class" "even")) ("td" () "Third string") ("td" () "Fourth string"))))))In order to traverse the dom-tree, lets define a recursive depth-first search like this (note that the if-let depends on the alexandria library (either import it, or change it to alexandria:if-let)):(defun find-tag (predicate doc &optional path) (when (funcall predicate doc path) (return-from find-tag doc)) (when (listp doc) (let ((path (cons doc path))) (dolist (child (xmls:node-children doc)) (if-let ((find (find-tag predicate child path))) (return-from find-tag find))))))It's called with a predicate function and a document. The predicate function gets called with two arguments; the element being matched and a list of its ancestors. In order to find the first <td>, you could do this:(find-tag (lambda (el path) (declare (ignore path)) (and (listp el) (xmls:xmlrep-tagmatch "td" el))) *doc*); => ("td" NIL "Some string")Or to find the first <td> in the even row:(find-tag (lambda (el path) (and (listp el) (xmls:xmlrep-tagmatch "td" el) (string= (xmls:xmlrep-attrib-value "class" (first path)) "even"))) *doc*); => ("td" NIL "Third string")Getting the second <td> on the even row would require something like this:(let ((matches 0)) (find-tag (lambda (el path) (when (and (listp el) (xmls:xmlrep-tagmatch "td" el) (string= (xmls:xmlrep-attrib-value "class" (first path)) "even")) (incf matches)) (= matches 2)) *doc*))You could define a helper function to find the nth tag:(defun find-nth-tag (n tag doc) (let ((matches 0)) (find-tag (lambda (el path) (declare (ignore path)) (when (and (listp el) (xmls:xmlrep-tagmatch tag el)) (incf matches)) (= matches n)) doc)))(find-nth-tag 2 "td" *doc*) ; => ("td" NIL "Another string")(find-nth-tag 4 "td" *doc*) ; => ("td" NIL "Fourth string")You might want to have a simple helper to get the text of a node:(defun node-text (el) (if (listp el) (first (xmls:node-children el)) el))You could define similiar helpers to do whatever you need to do in your application. Using these, the example you gave would look like this:(defparameter *doc* (html5-parser:parse-html5 (drakma:http-request "https://en.wikipedia.org/wiki/List_of_the_heaviest_people") :dom :xmls))(node-text (find-nth-tag 1 "a" (find-nth-tag 1 "td" *doc*))); => "Jon Brower Minnoch" 这篇关于使用html5-parser和xmls Common Lisp导航网页的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！