pyhton|python +selenium 爬取淘宝网商品信息淘宝|动态网页|selenium

前几天用python爬取豆瓣关于电影《长城》的影评，发现豆瓣的网页是静态的，心中一阵窃喜。以为对于动态网页了解的不是太多。但是主要是用cookie加headers爬取的。效果还不错，爬取了六七万条网友的评价，后期主要打算研究一下，如何发现那些用户是水军。今天研究了动态网页的爬取，主要是爬取的淘宝网上商品信息。主要是用到了selenium库。
主要是实现一下几个步骤：
（注释）用到的python库：

import re import time import random from bs4 import BeautifulSoup from selenium import webdriver

（1）通过selenium模拟登陆浏览器（我用的是Firefox,其他的浏览器原理也类似。）
构造火狐模拟浏览器
firefox_login=webdriver.Firefox() （电脑上如果有火狐的话，或打开一个空白的浏览器网页）
登陆淘宝账户（用户名，密码）（注释，必须切换到账户密码登陆下才能这样登陆，万恶的淘宝，现在的登陆页面是先跳转到扫二维码的登陆方式，所以必须要手动切换回来回事通过selenium 模拟切换回密码登陆状态才行），如果有大神可以告诉我一下如何通过二维码登陆哈，万分感激。

firefox_login.find_element_by_id('TPL_username_1').clear() firefox_login.find_element_by_id('TPL_username_1').send_keys(u'用户名') firefox_login.find_element_by_id('TPL_password_1').clear() firefox_login.find_element_by_id('TPL_password_1').send_keys(u'密码')

【pyhton|python +selenium 爬取淘宝网商品信息】
点击登陆按钮实现登陆

firefox_login.find_element_by_id('J_SubmitStatic').click()

OK，现在浏览器已经登陆到自己的淘宝账户了。下一步就是搜索你想要的东西了。
（2）搜索相应的信息（在这里我搜了“”代码之美”的书的相关信息）

firefox_login.find_element_by_id('q').send_keys(u'代码之美') firefox_login.find_element_by_class_name('btn-search').click()

此时你的火狐浏览器中的页面就会跳转到“代码之美”网页，此网页中就包换了淘宝网上关于《代码之美》的信息了（包括商家、书名‘、简介、价格、店铺所在地、已付款人数等’’）
（3）获取循环翻页的页数。（为循环做准备）
首先，获取浏览器下的静态页面

html=firefox_login.page_source

这时候BeautifulSoup 库就可以发挥它的强大了

soup = BeautifulSoup(html,'lxml') comments=soup.find_all("div", class_="total")#匹配总的页数 pattern=re.compile(r'[0-9]') pageNum=pattern.findall(comments[0].text)# 将数字页数提取 pageNum=int(pageNum[0])

必须注意的一点就是获得的PageNum必须转换成int型。
（3）对html进行解析

Infolist=[]#存储爬去的信息 comments=soup.find_all("div", class_="ctx-box J_MouseEneterLeave J_IconMoreNew") for i incomments: temp=[] Item=i.find_all("div",class_="row row-2 title")#图书相关信息 temp.append(Item[0].text.strip()) shop=i.find_all("div",class_="row row-3 g-clearfix") for j in shop: a=j.find_all("span") temp.append(a[-1].text)#店铺名称 address=i.find_all('div',class_='location') temp.append(address[0].text.strip())#店铺所在地 priceandnum=i.find_all("div",class_="row row-1 g-clearfix") for m in priceandnum: Y=m.find_all('div',class_='price g_price g_price-highlight') temp.append(Y[0].text.strip()) #商品价格 Num=m.find_all('div',class_='deal-cnt') temp.append(Num[0].text.strip())#购买人数 Infolist.append(temp)

以上爬虫代码必须结合着网页源码才能比较好的理解。

（4）爬完一页就需要点击刷新数据进行下一次爬去((淘宝页面用的是ajax( 意味着不必重新加载真个页面的情况下，对局部数据进行更新，所以网页地址不会改变))

firefox_login.find_element_by_xpath('//a[@trace="srp_bottom_pagedown"]').click()#点击下一页ajax刷新数据

以上几个部分基本上就是淘宝网的物品信息必备的几个步骤了。本人也是初步学习爬虫。难免有错误和瑕疵，请大神批评指正。下面是完整的程序。比较简单，后期会加上多线程以及其他相应的相应的情况处理（比如多次登录后，再次登录，淘宝会通过滑动的验证码进行验证，这个可以做一下）。写一下这个也是为了防止自己以后忘记，虽然对于大神来说有些简单，但是，刚开始学习，就是从基础做起。 Fighting,加油！

完整代码如下：

from selenium import webdriver from bs4 import BeautifulSoup import random import re import time Infolist=[]def init(): firefox_login=webdriver.Firefox()#构造模拟浏览器 firefox_login.get('https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fwww.taobao.com%2F') #淘宝登录页面 firefox_login.maximize_window()#窗口最大化，可有可无，看情况 return firefox_logindef login(firefox_login): #输入账户密码 #我请求的页面的账户输入框的'id'是username和密码输入框的'name'是password firefox_login.find_element_by_id('TPL_username_1').clear() firefox_login.find_element_by_id('TPL_username_1').send_keys(u'用户名') firefox_login.find_element_by_id('TPL_password_1').clear() firefox_login.find_element_by_id('TPL_password_1').send_keys(u'密码') firefox_login.find_element_by_id('J_SubmitStatic').click() time.sleep(random.randint(2,5)) firefox_login.find_element_by_id('q').send_keys(u'代码之美') firefox_login.find_element_by_class_name('btn-search').click() return firefox_login def ObtainHtml(firefox_login):data=https://www.it610.com/article/firefox_login.page_source soup = BeautifulSoup(data,'lxml') comments=soup.find_all("div", class_="ctx-box J_MouseEneterLeave J_IconMoreNew") for i incomments: temp=[] Item=i.find_all("div",class_="row row-2 title")#图书相关信息 temp.append(Item[0].text.strip()) shop=i.find_all("div",class_="row row-3 g-clearfix") for j in shop: a=j.find_all("span") temp.append(a[-1].text)#店铺名称 address=i.find_all('div',class_='location') temp.append(address[0].text.strip())#店铺所在地 priceandnum=i.find_all("div",class_="row row-1 g-clearfix") for m in priceandnum: Y=m.find_all('div',class_='price g_price g_price-highlight') temp.append(Y[0].text.strip()) #商品价格 Num=m.find_all('div',class_='deal-cnt') temp.append(Num[0].text.strip())#购买人数 Infolist.append(temp)#获取循环爬虫的页码数 def getPageNum(firefox_login): data=https://www.it610.com/article/firefox_login.page_source soup = BeautifulSoup(data,'lxml') comments=soup.find_all("div", class_="total")#匹配总的页数 pattern=re.compile(r'[0-9]') pageNum=pattern.findall(comments[0].text)# 将数字页数提取 pageNum=int(pageNum[0]) return pageNum#用于循环的次数设置# 点击下一页 //更新数据。 def NextPage(firefox_login): firefox_login.find_element_by_xpath('//a[@trace="srp_bottom_pagedown"]').click()#点击下一页ajax刷新数据if __name__=='__main__': firefox_login=init() firefox_login=login(firefox_login) Num=getPageNum(firefox_login) for i in range(Num-1): ObtainHtml(firefox_login) NextPage(firefox_login) print("信息爬取完成")