【python抓取明日名言】python学习之从网页爬取段子（超实用、简单）-拍拖百科

以下是通过python爬上文坛的完整代码执行环境python 3.4注释。

# -*- coding:utf-8 -*-

import urllib.request

import re

def getHtml(url , page):

user_agent = 'Mozilla (compatible; MSIE 5.5; Windows NT)'

try:

req = urllib.reque(url)

req.add_header('User-Agent', user_agent) #'fake-client'

response = urllib.reque(req)

the_page = re()

html = ('utf-8')

except urllib.error.URLError as e:

if hasattr(e,"code"):

print )

if hasattr(e,"reason"):

print )

return html

def getCrossTalk(html):

pattern = re.compile('<div.*?class="author.*?>.*?<a.*?</a>.*?<a.*?>.*?<h2>(.*?)</h2>.*?</a>.*?<div.*?class'+

'="content".*?<span>(.*?)</span>.*?</div>.*?<div class="stats.*?class="number">(.*?)</i>.*?class="number">(.*?)</i>',re.S)

#1）.*? 是一个固定的搭配，.和*代表可以匹配任意无限多个字符，加上？表示使用非贪婪模式进行匹配，也就是我们会尽可能短地做匹配，以后我们还会大量用到 .*? 的搭配。

#2）(.*?)代表一个分组，在这个正则表达式中我们匹配了五个分组，在后面的遍历item中，item[0]就代表第一个(.*?)所指代的内容，item[1]就代表第二个(.*?)所指代的内容，以此类推。

#3）re.S 标志代表在匹配时为点任意匹配模式，点 . 也可以代表换行符。

items = re.findall(pattern,html)

for item in items:

haveImg = re.search("img",item[3])

if not haveImg:

print ('发帖者： '+item[0]+' ','内容： '+item[1]+' ','点赞人数： '+item[2],'评论人数： '+item[3]+' ')

for page in range(1,35):

url = '; + str(page)

html = getHtml(url, page)

getCrossTalk(html)