python - 在Quora中解析包含代码的答案

我想从Quora或具有代码的一般帖子中解析此帖子。
示例：http://qr.ae/Rkplrt

通过使用Python库Selenium，我可以在帖子中获取HTML：

 h = html2text.HTML2Text()
 content = ans.find_element_by_class_name('inline_editor_value')
 html_string = content.get_attribute('innerHTML')
 text = h.handle(html_string)
 print text

我希望所有内容都是一小段文字。但是，对于那些包含代码的表，html2text会插入许多\n并且不处理行的索引。

所以我可以看到：
https://imageshack.com/i/paEKbzT4p（这是包含带有代码的表的主体div。）
https://imageshack.com/i/hlIxFayop（html2text提取的文本）
https://imageshack.com/i/hlHFBXvQp（相反，这是文本的最终打印，但索引行和多余的\n存在问题。）

我已经尝试过在github上的本指南（https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options）中提供的其他设置，例如bypasse_tables，但没有成功。

有人可以告诉我在这种情况下如何使用html2text吗？

最佳答案

实际上，您根本不需要使用HTML2Text。

selenium可以直接为您提供“文本”：

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://qr.ae/Rkplrt")

print(driver.find_element_by_class_name('inline_editor_content').text)

它打印帖子的内容：

The single line of code must be useful, not something meant to be confusing or obfuscating.

...

What examples have you created or encountered ?

关于python - 在Quora中解析包含代码的答案，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/33373597/