问题描述
在XPath之后,使用ajaxcourseindentfix类选择div元素,并将其从Prerequisite中拆分出来,并在prerequisite之后为我提供了所有内容.
Following XPath select div element with class ajaxcourseindentfix and split it from Prerequisite and gives me all the content after prerequisite.
div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]
我的div不仅可以具有先决条件,而且还可以具有以下拆分点:
My div can have not only prerequisite but also the following splitting points:
现在,只要我具备先决条件,上面的XPath都可以正常工作,但是只要以上三个条件中的任何一个出现,XPath都会失败,并为我提供全文.
Now, whenever I have Prerequisite, above XPath works fine but whenever anything from above three comes, the XPath fails and gives me the whole text.
是否可以在XPath中放置多个定界符?或如何解决?
Is there a way to put multiple delimiters in XPath? Or how do I solve it?
示例页面:
必备URL: http://catalog. fullerton.edu/ajax/preview_course.php?catoid=16&coid=96106&show
必备URL: http://catalog. fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show
两者: http://catalog.fullerton .edu/ajax/preview_course.php?catoid = 16& coid = 98590& show
[旧主题]-如何获取具有以下内容的文本没有HTML标记
推荐答案
除非特别需要XPath,否则这段代码可以解决您的问题,我还建议您查看 BeautifulSoup 文档中的方法我用过,您会发现 这里
This code is the solution to your problem unless you need XPath specifically, I would also suggest that you review BeautifulSoup documentation on the methods I've used, you can find that HERE
.next_element
和.next_sibling
可能非常有用.或.next_elements
,我们将获得一个生成器,我们必须对其进行转换或以可操纵生成器的方式使用它.
.next_element
and .next_sibling
can be very useful in these cases.or .next_elements
we'll get a generator that we'll have either to convert or use it in a manner that we can manipulate a generator.
from bs4 import BeautifulSoup
import requests
url = 'http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show'
makereq = requests.get(url).text
soup = BeautifulSoup(makereq, 'lxml')
whole = soup.find('td', {'class': 'custompad_10'})
# we select the whole table (td), not needed in this case
thedivs = whole.find_all('div')
# list of all divs and elements within them
title_h3 = thedivs[2]
# we select only yhe second one (list) and save it in a var
mytitle = title_h3.h3
# using .h3 we can traverse (go to the child <h3> element)
mylist = list(mytitle.next_elements)
# title_h3.h3 is still part of a three and we save all the neighbor elements
the_text = mylist[3]
# we can then select specific elements
# from a generator that we've converted into a list (i.e. list(...))
prequisite = mylist[6]
which_cpsc = mylist[8]
other_text = mylist[11]
print(the_text, ' is the text')
print(which_cpsc, other_text, ' is the cpsc and othertext ')
# this is for testing purposes
解决了这两个问题,我们不必使用 CSS选择器和那些怪异的列表操作.一切都是有机,并且有效.
Solves both issues, we don't have to use CSS selectors and those weird list manipulations. Everything is organic and works well.
这篇关于如何获取没有HTML标签的文本拆分中添加多个定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!