Introduction
Python是一种面向对象的编程语言,不同于C和C++,它是一种解释型语言。
包管理工具Anaconda
Anaconda可以根据需要构建不同的python环境,管理python包,构建独立的python内核
jupyter
正则表达式
支持普通字符
元字符
\d
匹配一个数字(0-9)
\w
匹配数字、字母、下划线(0-9, a-z, A-Z, _)
\W``\D
上述的取反,除了数字字母下划线以外的内容
[abc]
匹配a或b或c
[^abc]
取反
.
除了换行符以外
量词控制元字符数量
惰性匹配
a.*b
最长的axxxxxb,贪婪匹配
a.*?b
最短的axxxxxb,懒惰匹配
可以用于爬取<div>xxx</div>
python内置re模块
>folded 1 2 3 4 5 6 7 import reresult = re.findall(r"\d+" , "I have 100 , buy 2 cake." ) result = re.search(r"\d+" , "I have 100 , buy 2 cake." ) result = re.finditer(r"\d+" , "I have 100 , buy 2 cake." ) print (result.group())
预加载正则表达式对象
>folded 1 2 obj = re.complie(r"d+" ) obj.findall("I have 100 , buy 2 cake." )
提取正则表达式中的部分
>folded 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 s = """ <div class="abc"> <div><a href="baidu.com">百度</div> <div><a href="163.com">网易</div> <div><a href="qq.com">QQ</div> </div> """ obj = re.compile (r'<div><a href="(?P<url>.*?)">(?P<txt>.*?)</div>' ) res = obj.finditer(s) for item in res: url = item.group("url" ) txt = item.group("txt" ) print (txt, url) print (item.groupdict())
爬虫
>folded 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import requestsfrom bs4 import BeautifulSoupheaders = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.41" } for start_num in range (0 , 250 ,25 ): response = requests.get(f"https://movie.douban.com/top250?start={start_num} " , headers=headers) html = response.text soup = BeautifulSoup(html, "html.parser" ) alltitles = soup.findAll("span" , attrs={"class" : "title" }) for title in alltitles: title_string = title.string if '/' not in title_string: print (title_string)
>folded 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import reimport requestsurl = "https://movie.douban.com/top250" head = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.41" } resp = requests.get(url, headers=head) resp.encoding = 'utf-8' obj = re.compile (r'<div class="item">.*?<span class="title">(?P<name>.*?)</span>' r'.*?<br>(?P<year>.*?) .*?<span class="rating_num" ' r'property="v:average">(?P<score>.*?)</span>.*?' r'<span>(?P<num>.*?)人评价</span>' , re.S) res = obj.finditer(resp.text) for item in res: dic = item.groupdict() dic['year' ] = dic['year' ].strip() print (dic)