Beautiful Soup - 按标签导航

在本章中，我们将讨论按标签导航。

下面是我们的 html 文档 -

>>> html_doc = """
<html><head><title>Tutorials Point</title></head>
<body>
<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
<a href="https://www.tutorialspoint.com/java/java_overview.htm" class="prog" id="link1">Java</a>,
<a href="https://www.tutorialspoint.com/cprogramming/index.htm" class="prog" id="link2">C</a>,
<a href="https://www.tutorialspoint.com/python/index.htm" class="prog" id="link3">Python</a>,
<a href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" class="prog" id="link4">JavaScript</a> and
<a href="https://www.tutorialspoint.com/ruby/index.htm" class="prog" id="link5">C</a>;
as per online survey.</p>
<p class="prog">Programming Languages</p>
"""
>>>
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>>

基于上述文档，我们将尝试从文档的一个部分移动到另一部分。

下降

任何 HTML 文档中最重要的元素之一是标签，它可能包含其他标签/字符串（标签的子标签）。Beautiful Soup 提供了不同的方式来导航和迭代标签的子代。

使用标签名称导航

搜索解析树的最简单方法是按标签名称搜索标签。如果您想要 <head> 标签，请使用 soup.head -

>>> soup.head
<head>&t;title>Tutorials Point</title></head>
>>> soup.title
<title>Tutorials Point</title>

获取 <body> 标记中的特定标记（例如第一个 标记）。

>>> soup.body.b
<b>The Biggest Online Tutorials Library, It's all Free</b>

使用标签名称作为属性只会为您提供该名称的第一个标签 -

>>> soup.a
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>

要获取所有标签的属性，您可以使用 find_all() 方法 -

>>> soup.find_all("a")
[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>]>>> soup.find_all("a")
[<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>, <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>, <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>, <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>, <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>]

.contents 和 .children

我们可以通过其 .contents 在列表中搜索标签的子级 -

>>> head_tag = soup.head
>>> head_tag
<head><title>Tutorials Point</title></head>
>>> Htag = soup.head
>>> Htag
<head><title>Tutorials Point</title></head>
>>>
>>> Htag.contents
[<title>Tutorials Point</title>
>>>
>>> Ttag = head_tag.contents[0]
>>> Ttag
<title>Tutorials Point</title>
>>> Ttag.contents
['Tutorials Point']

BeautifulSoup 对象本身有子对象。在这种情况下， <html> 标签是 BeautifulSoup 对象的子代 -

>>> len(soup.contents)
2
>>> soup.contents[1].name
'html'

字符串没有 .contents，因为它不能包含任何内容 -

>>> text = Ttag.contents[0]
>>> text.contents
self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'contents'

不要将它们作为列表获取，而是使用 .children 生成器来访问标签的子项 -

>>> for child in Ttag.children:
print(child)
Tutorials Point

。后人

.descendants 属性允许您递归地迭代标签的所有子代 -

它的直接子代及其直接子代的子代等等 -

>>> for child in Htag.descendants:
print(child)
<title>Tutorials Point</title>
Tutorials Point

<head> 标签只有一个子标签，但它有两个后代：<title> 标签和 <title> 标签的子标签。beautifulsoup 对象只有一个直接子对象（<html> 标签），但它有很多后代 -

>>> len(list(soup.children))
2
>>> len(list(soup.descendants))
33

。细绳

如果该标签只有一个子项，并且该子项是 NavigableString，则该子项将以 .string 形式提供 -

>>> Ttag.string
'Tutorials Point'

如果一个标签的唯一子标签是另一个标签，并且该标签具有 .string，则父标签被视为与其子标签具有相同的 .string -

>>> Htag.contents
[<title>Tutorials Point</title>]
>>>
>>> Htag.string
'Tutorials Point'

但是，如果一个标签包含多个内容，则不清楚 .string 应该引用什么，因此 .string 被定义为 None -

>>> print(soup.html.string)
None

.strings 和 stripped_strings

如果标签内有多个内容，您仍然可以只查看字符串。使用 .strings 生成器 -

>>> for string in soup.strings:
print(repr(string))
'\n'
'Tutorials Point'
'\n'
'\n'
"The Biggest Online Tutorials Library, It's all Free"
'\n'
'Top 5 most used Programming Languages are: \n'
'Java'
',\n'
'C'
',\n'
'Python'
',\n'
'JavaScript'
' and\n'
'C'
';\n \nas per online survey.'
'\n'
'Programming Languages'
'\n'

要删除多余的空格，请使用 .stripped_strings 生成器 -

>>> for string in soup.stripped_strings:
print(repr(string))
'Tutorials Point'
"The Biggest Online Tutorials Library, It's all Free"
'Top 5 most used Programming Languages are:'
'Java'
','
'C'
','
'Python'
','
'JavaScript'
'and'
'C'
';\n \nas per online survey.'
'Programming Languages'

往上走

在“家谱”类比中，每个标签和每个字符串都有一个父级：包含它的标签：

.parent

要访问元素的父元素，请使用 .parent 属性。

>>> Ttag = soup.title
>>> Ttag
<title>Tutorials Point</title>
>>> Ttag.parent
<head>title>Tutorials Point</title></head>

在我们的 html_doc 中，标题字符串本身有一个父级：包含它的 <title> 标签 -

>>> Ttag.string.parent
<title>Tutorials Point</title>

像 <html> 这样的顶级标签的父级是 Beautifulsoup 对象本身 -

>>> htmltag = soup.html
>>> type(htmltag.parent)
<class 'bs4.BeautifulSoup'>

Beautifulsoup 对象的 .parent 定义为 None -

>>> print(soup.parent)
None

。父母

要迭代所有父元素，请使用 .parents 属性。

>>> link = soup.a
>>> link
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
>>>
>>> for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]

横着走

下面是一份简单的文件 -

>>> sibling_soup = BeautifulSoup("<a><b>TutorialsPoint</b><c><strong>The Biggest Online Tutorials Library, It's all Free</strong></b></a>")
>>> print(sibling_soup.prettify())
<html>
<body>
   <a>
      <b>
         TutorialsPoint
      </b>
      <c>
         <strong>
            The Biggest Online Tutorials Library, It's all Free
         </strong>
      </c>
   </a>
</body>
</html>

在上面的文档中， 和 <c> 标记处于同一级别，并且它们都是同一标记的子级。 和 <c> 标记都是同级标记。

.next_sibling 和 .previous_sibling

使用 .next_sibling 和 .previous_sibling 在解析树同一级别的页面元素之间导航：

>>> sibling_soup.b.next_sibling
<c><strong>The Biggest Online Tutorials Library, It's all Free</strong></c>
>>>
>>> sibling_soup.c.previous_sibling
<b>TutorialsPoint</b>

标记有 .next_sibling 但没有 .previous_sibling，因为在树的同一级别上 标记之前没有任何内容，<c> 标记也是如此。

>>> print(sibling_soup.b.previous_sibling)
None
>>> print(sibling_soup.c.next_sibling)
None

这两个字符串不是兄弟姐妹，因为它们没有相同的父级。

>>> sibling_soup.b.string
'TutorialsPoint'
>>>
>>> print(sibling_soup.b.string.next_sibling)
None

.next_siblings 和 .previous_siblings

要迭代标签的同级标签，请使用 .next_siblings 和 .previous_siblings。

>>> for sibling in soup.a.next_siblings:
print(repr(sibling))
',\n'
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
',\n'
>a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>
',\n'
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
' and\n'
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm"
id="link5">C</a>
';\n \nas per online survey.'
>>> for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
',\n'
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>
',\n'
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>
'Top 5 most used Programming Languages are: \n'

来回

现在让我们回到前面的“html_doc”示例中的前两行 -

&t;html><head><title>Tutorials Point</title></head>
<body>
<h4 class="tagLine"><b>The Biggest Online Tutorials Library, It's all Free</b></h4>

HTML 解析器获取上述字符串并将其转换为一系列事件，例如“打开 <html> 标签”、“打开 <head> 标签”、“打开 <title> 标签”、“添加字符串”、 “关闭 </title> 标签”、“关闭 </head> 标签”、“打开 <h4> 标签”等等。BeautifulSoup 提供了不同的方法来重建文档的初始解析。

.next_element 和 .previous_element

标签或字符串的 .next_element 属性指向随后立即解析的内容。有时它看起来与 .next_sibling 类似，但并不完全一样。以下是“html_doc”示例文档中的最终 <a> 标记。

>>> last_a_tag = soup.find("a", id="link5")
>>> last_a_tag
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>
>>> last_a_tag.next_sibling
';\n \nas per online survey.'

然而，<a> 标签的 .next_element（在 <a> 标签之后立即解析的内容）不是该句子的其余部分：它是单词“C”：

>>> last_a_tag.next_element
'C'

上述Behave是因为在原始标记中，字母“C”出现在分号之前。解析器遇到了 <a> 标记，然后是字母“C”，然后是结束 </a> 标记，然后是分号和句子的其余部分。分号与 <a> 标记处于同一级别，但首先遇到字母“C”。

.previous_element 属性与 .next_element 完全相反。它指向紧接在此元素之前解析的任何元素。

>>> last_a_tag.previous_element
' and\n'
>>>
>>> last_a_tag.previous_element.next_element
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>

.next_elements 和 .previous_elements

我们使用这些迭代器向前和向后移动到一个元素。

>>> for element in last_a_tag.next_e lements:
print(repr(element))
'C'
';\n \nas per online survey.'
'\n'
<p class="prog">Programming Languages</p>
'Programming Languages'
'\n'