一篇最全Python 爬虫超详细讲解（零基础入门,适合小白）-育师

一篇最全Python 爬虫超详细讲解（零基础入门，适合小白）

大家好！我是 Grok，由 xAI 构建。今天我们来聊聊 Python 爬虫。作为一个零基础教程，我会从最简单的地方开始，一步步带你入门。爬虫（Web Scraping）就是用程序自动从网站上抓取数据的工具，比如下载图片、收集新闻、分析价格等。为什么学？因为它超级实用，能帮你自动化很多重复工作，比如监控电商价格或收集研究数据。

注意：爬虫要遵守法律和道德！不要爬取受保护的数据（如个人信息），尊重 robots.txt 协议，避免高频请求导致网站崩溃。否则可能被封 IP 或面临法律风险。

这个教程基于 2026 年最新实践（Python 3.12+），结合了网络上热门资源（如 Bilibili 尚硅谷教程、知乎文章等）。我们会从基础到进阶，包含代码示例。准备好你的电脑，我们开始吧！

第一部分：准备工作（零基础起步）

安装 Python：
- 下载官网最新版（https://www.python.org/downloads/），推荐 Python 3.12 或更高。Windows/Mac/Linux 都支持。
- 安装时勾选“Add Python to PATH”，便于命令行使用。
- 验证：打开命令提示符（Win: cmd；Mac: Terminal），输入python --version，看到版本号就 OK。
安装代码编辑器：
- 推荐 VS Code（免费，轻量）：下载 https://code.visualstudio.com/，安装 Python 扩展。
- 或者 PyCharm Community Edition（专业 IDE）：https://www.jetbrains.com/pycharm/download/。
安装爬虫常用库（用 pip，Python 自带包管理器）：
- 打开命令行，输入：
```
pip install requests beautifulsoup4 lxml selenium scrapy
```
- 解释：
  - requests：发送 HTTP 请求，模拟浏览器访问网站。
  - beautifulsoup4（简称 bs4）：解析 HTML，提取数据。
  - lxml：bs4 的高效解析器。
  - selenium：处理动态页面（如 JavaScript 加载）。
  - scrapy：专业爬虫框架。
测试环境：
- 新建一个 .py 文件（如 test.py），写：
```
print("Hello, 爬虫世界！")
```
- 运行：命令行python test.py，看到输出就成功。

第二部分：爬虫基础知识

爬虫流程（核心三步）：

发送请求：用 requests 获取网页内容。
解析数据：用 bs4 或 xpath 提取有用信息。
保存数据：存到文件、数据库或 Excel。

HTTP 基础（小白必知）：

GET：获取数据（最常见）。
POST：提交数据（如登录）。
Headers：模拟浏览器（如 User-Agent）。
Cookies：保持登录状态。

反爬虫常见问题：

网站检测机器人：用假 User-Agent 或代理 IP。
动态加载：用 Selenium 模拟浏览器。

第三部分：简单爬虫实战（入门示例）

我们爬取一个简单网站：百度首页的标题和链接。作为小白第一爬，超级简单！

代码示例（用 requests + bs4）：

importrequestsfrombs4importBeautifulSoup# 第一步：发送请求url="https://www.baidu.com"# 目标网址headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}# 模拟浏览器，避开简单反爬response=requests.get(url,headers=headers)# 检查响应ifresponse.status_code==200:print("请求成功！")else:print("请求失败，状态码：",response.status_code)exit()# 退出程序# 第二步：解析 HTMLsoup=BeautifulSoup(response.text,"lxml")# 用 lxml 解析器# 提取标题title=soup.title.stringprint("页面标题：",title)# 提取所有链接links=soup.find_all("a")# 找所有 <a> 标签forlinkinlinks:href=link.get("href")# 获取 href 属性text=link.string# 获取文本iftext:# 过滤空文本print(f"链接文本：{text}，URL：{href}")# 第三步：保存数据（可选，存到文件）withopen("baidu_links.txt","w",encoding="utf-8")asf:forlinkinlinks:iflink.string:f.write(f"{link.string}:{link.get('href')}\n")print("数据已保存到 baidu_links.txt")

运行：保存为 baidu_crawler.py，命令行python baidu_crawler.py。
输出：页面标题和链接列表。

解释代码：
- requests.get()：获取网页源代码。
- BeautifulSoup：像“汤”一样搅拌 HTML，轻松找标签（如find_all("a")找所有超链接）。
- 如果网站用 JavaScript 加载，用 Selenium 替换 requests（见进阶）。
小练习：改成爬取豆瓣电影 Top 250 的电影名（URL: https://yingjuxia.com/archives/8406）。提示：找class="title"的标签。

第四部分：进阶技巧（从小白到高手）

处理动态页面（JavaScript 渲染）：

用 Selenium 模拟浏览器。
安装 ChromeDriver（匹配你的 Chrome 版本）：https://googlechromelabs.github.io/chrome-for-testing/。

示例代码：

fromseleniumimportwebdriverfromselenium.webdriver.chrome.serviceimportServicefromselenium.webdriver.common.byimportBy# 配置 ChromeDriver 路径service=Service("path/to/chromedriver.exe")# 替换成你的路径driver=webdriver.Chrome(service=service)url="https://www.example.com"# 动态网站driver.get(url)# 找元素（用 XPath 或 CSS）elements=driver.find_elements(By.CSS_SELECTOR,"div.classname")foreleminelements:print(elem.text)driver.quit()# 关闭浏览器

优势：能处理登录、点击等交互。

XPath 解析（更精确提取）：

用 lxml 的 etree。

示例：

fromlxmlimportetree html=etree.HTML(response.text)# 解析titles=html.xpath('//h1/text()')# XPath 表达式：所有 h1 标签的文本print(titles)

XPath 语法：//tag找所有 tag；@attr找属性。

反爬虫应对：
- User-Agent 轮换：用 fake_useragent 库随机 UA。
```
pip install fake_useragentfromfake_useragentimportUserAgent ua=UserAgent()headers={"User-Agent":ua.random}
```
- 代理 IP：用免费/付费代理池，避免 IP 被封。
```
proxies={"http":"http://your_proxy:port"}response=requests.get(url,proxies=proxies)
```
- 延迟请求：import time; time.sleep(2) 每请求睡 2 秒。
- 验证码：用 OCR 库如 pytesseract 识别简单验证码。

数据存储：

CSV：用 pandas。

importpandasaspd data=[{"name":"Alice","age":25}]df=pd.DataFrame(data)df.to_csv("data.csv",index=False)

数据库：SQLite 或 MySQL（用 sqlite3 或 pymysql）。

Scrapy 框架（专业级）：
- 安装后，创建项目：scrapy startproject myspider。
- 示例 Spider：
```
importscrapyclassMySpider(scrapy.Spider):name="example"start_urls=["https://www.example.com"]defparse(self,response):titles=response.xpath('//h1/text()').getall()yield{"title":titles}
```
- 运行：scrapy crawl example -o output.json。
- 优势：内置调度、管道、去重，适合大项目。