使用Python爬取Google自然搜索结果
Google搜索是动态且受保护的。静态爬虫(如requests、BeautifulSoup)无法正常工作,只会返回空HTML。相反,需要使用无头浏览器。以下是使用SeleniumBase在未检测Chrome模式下的简单指南。
目录
- Step 1. 安装SeleniumBase
- Step 2. 导入库
- Step 3. 构建搜索URL
- Step 4. 启动无头浏览器
- Step 5. 提取自然搜索结果
- Step 6. 保存结果
- 完整代码
- 注意事项
Step 1. 安装SeleniumBase
安装Selenium Base:
1
|
pip install seleniumbase
|
这将为您提供一个扩展的Selenium包装器,内置uc模式(未检测Chrome)。
Step 2. 导入库
导入到项目中:
1
2
3
|
from seleniumbase import Driver
from selenium.webdriver.common.by import By
import urllib.parse, pandas as pd
|
Step 3. 构建搜索URL
我们将根据关键词生成Google搜索URL:
1
2
|
def build_search_url(query):
return f"https://www.google.com/search?q={urllib.parse.quote_plus(query)}"
|
Step 4. 启动无头浏览器
在uc模式下启动Chrome,使Google将其视为真实用户:
1
|
driver = Driver(uc=True, headless=True)
|
Step 5. 提取自然搜索结果
每个结果都位于div.MjjYud内。从中获取标题、链接和摘要:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
def scrape_google(driver, query):
driver.get(build_search_url(query))
blocks = driver.find_elements(By.CSS_SELECTOR, "div.MjjYud")
results = []
for b in blocks:
try:
title = b.find_element(By.CSS_SELECTOR, "h3").text
link = b.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
snippet = b.find_element(By.CSS_SELECTOR, "div.VwiC3b").text
results.append({"Title": title, "Link": link, "Snippet": snippet})
except:
continue
return results
|
Step 6. 保存结果
使用pandas将所有内容存储到CSV文件中:
1
2
3
|
data = scrape_google(driver, "what is web scraping")
pd.DataFrame(data).to_csv("organic_results.csv", index=False)
print(f"Saved {len(data)} results")
|
完整代码
检查选择器并复制:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
from seleniumbase import Driver
from selenium.webdriver.common.by import By
import urllib.parse, pandas as pd, time
def build_search_url(query):
return f"https://www.google.com/search?q={urllib.parse.quote_plus(query)}"
def scrape_google(driver, query, max_pages=1):
results = []
for page in range(max_pages):
url = build_search_url(query) + (f"&start={page*10}" if page > 0 else "")
driver.get(url)
time.sleep(5) # 等待页面加载
try:
blocks = driver.find_elements(By.CSS_SELECTOR, "div.MjjYud")
except:
continue
for b in blocks:
try:
title = b.find_element(By.CSS_SELECTOR, "h3").text
link = b.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
snippet = b.find_element(By.CSS_SELECTOR, "div.VwiC3b").text
results.append({"Title": title, "Link": link, "Snippet": snippet})
except:
continue
return results
driver = Driver(uc=True, headless=True) # 未检测Chrome
try:
data = scrape_google(driver, "what is web scraping", max_pages=2)
pd.DataFrame(data).to_csv("organic_results.csv", index=False)
print(f"Saved {len(data)} results")
finally:
driver.quit()
|
注意事项
- 完整的Python爬取Google SERP指南
- 加入我们的Discord
- 如果您想要我可能遗漏的任何示例,请留言,我会添加它们。