Beautiful Soup 函式庫

Beautiful Soup 函式庫 ( 模組 ) 是一個 Python 外部函式庫，可以分析網頁的 HTML 與 XML 文件，並將分析的結果轉換成「網頁標籤樹」( tag ) 的型態，讓資料讀取方式更接近網頁的操作語法，處理起來也更為便利，這篇教學會介紹 Beautiful Soup 函式庫的基本用法。

快速導覽：

安裝 Beautiful Soup 模組

import Beautiful Soup

開始使用 Beautiful Soup

認識基本網頁架構

網頁解析器

Beautiful Soup 的方法

Beautiful Soup 方法的參數

取得並輸出內容

抓取水庫的容量

本篇使用的 Python 版本為 3.7.12，所有範例可使用 Google Colab 實作，不用安裝任何軟體 ( 參考：使用 Google Colab )

安裝 Beautiful Soup 模組

如果是使用 Colab 或 Anaconda，預設已經安裝了 Beautiful Soup 函式庫，不用額外安裝，如果是本機環境，輸入下列指令，就能安裝 Beautiful Soup 函式庫 ( 依據每個人的作業環境不同，可使用 pip 或 pip3 或 pipenv )。

pip install beautifulsoup4

import Beautiful Soup

要使用 Beautiful Soup 必須先 import Beautiful Soup 模組。

from bs4 import BeautifulSoup

開始使用 Beautiful Soup

將 HTML 的原始碼 ( 純文字 ) 提供給 Beautiful Soup，就能轉換成可讀取的標籤樹 ( tag )，所以通常會搭配 requests 爬取網頁內容一併使用，下方的程式碼執行後，會使用 requests 抓取「台灣水庫即時水情」網頁的原始碼，接著使用 Beautiful Soup 轉換成標籤樹，最後印出 title 的標籤。

import requests
from bs4 import BeautifulSoup

url = 'https://water.taiwanstat.com/'
web = requests.get(url)                        # 取得網頁內容
soup = BeautifulSoup(web.text, "html.parser")  # 轉換成標籤樹
title = soup.title                             # 取得 title
print(title)                                   # 印出 title ( 台灣水庫即時水情 )

認識基本網頁架構

使用 Beautiful Soup 時，會讀取特定的網頁結構 ( 如同上面的範例會從網頁原始碼裡讀取 title 的標籤 )，因此必須要從網頁原始碼著手，稍微了解網頁的架構，如果要觀看原始碼，可以用瀏覽器 ( Chrome ) 開啟網頁，用滑鼠在網頁的任意位置按下右鍵，點選「檢視網頁原始碼」。

Python 教學 - Beautiful Soup 函式庫 - 認識基本網頁架構

點選後，會開啟網頁的原始碼，這也是使用 requests 會讀取到的基本資料。

Python 教學 - Beautiful Soup 函式庫 - 開啟網頁的原始碼

網頁是由「標籤」的語法所構成，標籤 ( tag ) 指的是由「<」和「>」包覆的代碼，通常沒有斜線的「<標籤>」作為開頭，有斜線「</標籤>」做為結尾，標籤代碼並不會顯示在網頁中，只有被標籤包覆的內容才會顯示在網頁裡，而標籤也會互相層疊包覆，形成所為的「巢狀結構」。

每個標籤和所包覆的內容，會組合成一個 DOM ( 文件模型 )，網頁的程式通常會針對 DOM 去做運算和處理，也可以針對不同的 DOM，給予不同的 id 或樣式屬性 ( attribute、class、style...等 )，只要知道 DOM 的標籤，或是取得特定的 id、class 或 attribute，就能進一步透過程式控制 DOM。

下圖是一個簡單網頁範例，左方的原始碼會產生右方的網頁內容，當中包含 h1、h2、div、ul、li...等標籤。

Python 教學 - Beautiful Soup 函式庫 - 左方的原始碼會產生右方的網頁內容

網頁解析器

當藉由 requests 取得網頁原始碼後，Beautiful Soup 還需要第二個「解析器」的參數，將原始碼的「純文字」，轉換成可供分析取用的「標籤樹」，Python 本身內建「html.parser」的解析器，也可以使用下方指令，另外安裝「html5lib」解析器 ( 依據每個人的作業環境不同，可使用 pip 或 pip3 或 pipenv )。

pip install html5lib

安裝後，只要更換第二個參數，就可以更換解析器，下方的程式碼使用 html5lib 解析器 ( 不需要 import，安裝後就可以使用 )，html5lib 的容錯率比 html.parser 高，但解析速度比較慢。

import requests
from bs4 import BeautifulSoup

url = 'https://water.taiwanstat.com/'
web = requests.get(url)
# soup = BeautifulSoup(web.text, "html.parser")  # 使用 html.parser 解析器
soup = BeautifulSoup(web.text, "html5lib")       # 使用 html5lib 解析器
title = soup.title
print(title)

Beautiful Soup 的方法

下方列出 Beautiful Soup 尋找網頁內容的方法，當中最常使用的是 find_all()、find() 和 select()。

方法	說明
select()	以 CSS 選擇器的方式尋找指定的 tag。
find_all()	以所在的 tag 位置，尋找內容裡所有指定的 tag。
find()	以所在的 tag 位置，尋找第一個找到的 tag。
find_parents()、find_parent()	以所在的 tag 位置，尋找父層所有指定的 tag 或第一個找到的 tag。
find_next_siblings()、find_next_sibling()	以所在的 tag 位置，尋找同一層後方所有指定的 tag 或第一個找到的 tag。
find_previous_siblings()、ind_previous_sibling()	以所在的 tag 位置，尋找同一層前方所有指定的 tag 或第一個找到的 tag。
find_all_next()、find_next()	以所在的 tag 位置，尋找後方內容裡所有指定的 tag 或第一個找到的 tag。
find_all_previous()、find_previous()	所在的 tag 位置，尋找前方內容裡所有指定的 tag 或第一個找到的 tag。

下方的程式碼，使用 Beautiful Soup 取得範例網頁中指定 tag 的內容。

範例網頁：https://www.iana.org/domains/

import requests
from bs4 import BeautifulSoup

url = 'https://www.iana.org/domains/'
web = requests.get(url)
soup = BeautifulSoup(web.text, "html.parser")

print(soup.select('#logo'))            # 搜尋 id 為 logo 的 tag 內容
print('\n----------\n')

print(soup.find_all('div',id="logo"))  # 搜尋所有 id 為 logo 的 div
print('\n----------\n')

divs = soup.find_all('div')            # 搜尋所有的 div
print(divs[1])                         # 取得搜尋到的第二個項目 ( 第一個為 divs[0] )
print('\n----------\n')

# 從搜尋到的項目裡，尋找父節點裡所有的 li
print(divs[1].find_parent().find_all('li'))
print('\n----------\n')

# 從搜尋到的項目裡，尋找父節點裡所有 li 的第三個項目，找到他後方同層的所有 li
print(divs[1].find_parent().find_all('li')[2].find_next_siblings())
print('\n----------\n')

# 從搜尋到的項目裡，尋找父節點裡所有 li 的第三個項目，找到他前方同層的所有 li
print(divs[1].find_parent().find_all('li')[2].find_previous_siblings())

Python 教學 - Beautiful Soup 函式庫 - Beautiful Soup 的方法

由於 find_all() 是使用頻率最高的方法，所以也可以簡化成下列的寫法：

import requests
from bs4 import BeautifulSoup

url = 'https://www.iana.org/domains/'
web = requests.get(url)
soup = BeautifulSoup(web.text, "html.parser")

print(soup.find_all('a'))    # 等同於下方的 soup('a')
print(soup('a'))             # 等同於上方的 find_all('a')

Beautiful Soup 方法的參數

使用 Beautiful Soup 方法時，可以加入一些參數，幫助更近一步的篩選搜尋結果，下方是一些常用的參數：

參數	說明
string	搜尋 tag 包含的文字。
limit	搜尋 tag 後只回傳多少個結果。
recursive	預設 True，會搜尋內容所有層，設定 False 只會搜尋下一層。
id	搜尋 tag 的 id。
class_	搜尋 tag class，因為 class 為 Python 保留字，所以後方要加上底線。
href	搜尋 tag href。
attrs	搜尋 tag attribute 屬性。

下方的程式碼，使用 Beautiful Soup 取得範例網頁中指定 tag 的內容，並加入參數做進一步的篩選。

範例網頁：https://www.iana.org/domains/

import requests
from bs4 import BeautifulSoup

url = 'https://www.iana.org/domains/'
web = requests.get(url)
soup = BeautifulSoup(web.text, "html.parser")
print(soup.find_all('a'))                     # 找出所有 a tag
print(soup.find_all('a', string='Domains'))   # 找出內容字串為 Domains 的 a tag
print(soup('a', limit=2))                     # 找出前兩個 a tag

取得並輸出內容

抓取到內容後，可以使用下列兩種常用的方法，將內容或屬性輸出為字串：

方法	說明
.get_text()	輸出 tag 的內容。
[屬性]	輸出 tag 裡某個屬性的內容。

下方的程式碼執行後，會先輸出第一個 a tag 的內容，接著輸出第一個 a tag 裡 href 屬性的內容。

import requests
from bs4 import BeautifulSoup

url = 'https://www.iana.org/domains/'
web = requests.get(url)
soup = BeautifulSoup(web.text, "html.parser")
print(soup.find('a').get_text())   # 輸出第一個 a tag 的內容
print(soup.find('a')['href'])      # 輸出第一個 a tag 的 href 屬性內容

抓取水庫的容量

如果是「靜態」頁面 ( 不需要跟伺服器溝通、頁面內容不是動態產生 )，透過 Beautiful Soup 都能很輕鬆的抓取到對應的內容，抓取到內容後，將內容輸出為純文字。

下方的程式碼執行後，會抓取水庫的名稱以及最大容量 ( 因即時水位是動態產生，所以單純用這個方法讀取 )。

範例網頁：https://water.taiwanstat.com/

import requests
from bs4 import BeautifulSoup

url = 'https://water.taiwanstat.com/'
web = requests.get(url)
soup = BeautifulSoup(web.text, "html.parser")
reservoir = soup.select('.reservoir')     # 取得所有 class 為 reservoir 的 tag
for i in reservoir:
  print(i.find('div', class_='name').get_text(), end=' ')  # 取得內容的 class 為 name 的 div 文字
  print(i.find('h5').get_text(), end=' ')   # 取得內容 h5 tag 的文字
  print()

Python 教學 - Beautiful Soup 函式庫

參考資料

更多內容可以參考 Beautiful Soup 的官方網站說明：

Beautiful Soup 4.4.0 文檔

意見回饋

如果有任何建議或問題，可傳送「意見表單」給我，謝謝～

Python 教學

Beautiful Soup 函式庫

安裝 Beautiful Soup 模組

import Beautiful Soup

開始使用 Beautiful Soup

認識基本網頁架構

網頁解析器

Beautiful Soup 的方法

Beautiful Soup 方法的參數

取得並輸出內容

抓取水庫的容量

參考資料

意見回饋

Python 教學

基本介紹

資料型別

語法觀念

函式操作

內建函式＆方法

標準函式庫＆模組

網路爬蟲

網頁服務與應用

LINE BOT 教學

OpenCV 教學

AI 影像辨識教學

NumPy 教學

matplotlib 圖表

Tkinter 介面設計

PyQt5 介面設計

PyQt6 介面設計

影音處理範例

實際應用範例

基礎範例

數學範例

ZeroJudge 解答