Python中常用抽取库

测试样例

<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story [class]</b></p>
<p id="title"><b>The Dormouse's story [id]</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

Beautiful Soup 4

Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4。

官方文档：https://beautifulsoup.cn/

安装 pip install beautifulsoup4

解析器优缺点

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml-xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

常用

表达式	描述
`soup.prettify()`	格式化输出
`soup.title.get_text()` `soup.title.string`	选取title标签中的内容
`soup.find(id="link2").get_text()`	选取 id=link2 的节点中的内容
`soup.find(class_="story").get_text()`	选取 class=story 的节点中的内容（按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 `class` 在Python中是保留字,使用 `class` 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 `class_` 参数搜索有指定CSS类名的tag）
`soup.find_all(href=re.compile("elsie"), class_='sister')`	选取 class=’sister’ 中 href 可以正则匹配到 elsie 的节点
`soup.select("head > title")[0].get_text()`	CSS表达式选取 title节点中内容

from bs4 import BeautifulSoup

html_doc = ...
soup = BeautifulSoup(html_doc, 'html.parser')
# 定义过滤器
def has_class_but_no_id(tag):
    return not tag.has_attr('class') and tag.has_attr('id')

print(soup.find_all(has_class_but_no_id), '\n')

lxml

官网 https://lxml.de/

参考

http://c.biancheng.net/python_spider/lxml.html

https://www.w3cschool.cn/lxml/_lxml-vmoe3fju.html

安装 pip install lxml

引入 from lxml import etree

语法格式

lxml = etree.HTML(html_doc)

lxml.xpath('【XPath表达式】') 后接 .text 为取内容；.tag 为取标签

Parsel

官网 https://parsel.readthedocs.io/en/latest/

语法

.css('【CSS表达式】') 和 .xpath('【XPaht表达式】')

选取一个 .get() 别名 extract_first 和选取所有 .getall() 别名 extract

.re_first('【正则表达式】') 和 .re_('【正则表达式】')

from parsel import Selector

html_doc = ...
parsel  = Selector(html_doc)

print(parsel.css('head > title ::text').get())
print(parsel.xpath('//head/title/text()').get())
print(parsel.re_first(r'<title>(.*?)</title>'))

Scrapy Selectors

官网 https://docs.scrapy.org/en/latest/topics/selectors.html

Scrapy Selectors 继承自 Parsel，Parsel 继承自 lxml，实际应用推荐使用 Scrapy Selectors

Scrapy Selectors 的用法与 Parsel 大致相同，.get() 可以指定未提取到的默认值

from scrapy.selector import Selector

html_doc = ...

scrapy_selector = Selector(text=html_doc)
print(scrapy_selector.css('head > title ::text').get('未提取到'))
print(scrapy_selector.xpath('//head/title/text()').get('未提取到'))
print(scrapy_selector.re_first(r'<title>(.*?)</title>'))

爬亿爬 > 抽取

#Python抽取

Python中常用抽取库

https://元气码农少女酱.我爱你/fa4eab074cf0/

作者

元气码农少女酱

发布于

2023年5月2日

许可协议

CSS选择器上一篇

JSONPath选择器下一篇