Python

Introduction

Python是一种面向对象的编程语言,不同于C和C++,它是一种解释型语言。

包管理工具Anaconda

  • Anaconda可以根据需要构建不同的python环境,管理python包,构建独立的python内核

jupyter

  • Jupyter Notebook(此前被称为 IPython notebook)是一个交互式笔记本,支持运行 40 多种编程语言。

正则表达式

支持普通字符

元字符

  • \d 匹配一个数字(0-9)

  • \w 匹配数字、字母、下划线(0-9, a-z, A-Z, _)

  • \W``\D 上述的取反,除了数字字母下划线以外的内容

  • [abc] 匹配a或b或c

  • [^abc] 取反

  • . 除了换行符以外

量词控制元字符数量

  • + 前面元字符出现1次或多次

  • * 前面元字符出现0次或多次(尽可能多地贪婪匹配)

  • ? 前面元字符出现0次或1次

惰性匹配

  • a.*b 最长的axxxxxb,贪婪匹配

  • a.*?b 最短的axxxxxb,懒惰匹配

  • 可以用于爬取<div>xxx</div>

python内置re模块

  • r""用于输入原生字符串作为正则表达式,无需考虑转义字符

>folded
1
2
3
4
5
6
7
import re

result = re.findall(r"\d+", "I have 100 , buy 2 cake.")
result = re.search(r"\d+", "I have 100 , buy 2 cake.")
result = re.finditer(r"\d+", "I have 100 , buy 2 cake.")

print(result.group())

预加载正则表达式对象

>folded
1
2
obj = re.complie(r"d+")
obj.findall("I have 100 , buy 2 cake.")

提取正则表达式中的部分

>folded
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
s = """
<div class="abc">
<div><a href="baidu.com">百度</div>
<div><a href="163.com">网易</div>
<div><a href="qq.com">QQ</div>
</div>
"""

obj = re.compile(r'<div><a href="(?P<url>.*?)">(?P<txt>.*?)</div>')

res = obj.finditer(s)
for item in res:
url = item.group("url")
txt = item.group("txt")
print(txt, url)
print(item.groupdict())

爬虫

  • 核心的库import requests 发送http请求并接收回应

  • import bs4 数据处理

  • 爬取豆瓣网页的python爬虫脚本

>folded
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import requests
from bs4 import BeautifulSoup

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.41"
}
for start_num in range(0, 250,25):
response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
alltitles = soup.findAll("span", attrs={"class": "title"})
for title in alltitles:
title_string = title.string
if '/' not in title_string:
print(title_string)
  • 正则表达式模式,处理换行

>folded
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import re

import requests
url = "https://movie.douban.com/top250"

head = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.41"
}

resp = requests.get(url, headers=head)
resp.encoding = 'utf-8'

obj = re.compile(r'<div class="item">.*?<span class="title">(?P<name>.*?)</span>'
r'.*?<br>(?P<year>.*?)&nbsp;.*?<span class="rating_num" '
r'property="v:average">(?P<score>.*?)</span>.*?'
r'<span>(?P<num>.*?)人评价</span>', re.S)

res = obj.finditer(resp.text)
for item in res:
dic = item.groupdict()
dic['year'] = dic['year'].strip() # 去除空格换行
print(dic)
作者

huayi

发布于

2023-04-25

更新于

2023-11-29

许可协议