Python|苏宁易购网址爬虫爬取商品信息及图片

利用scrapy来爬取苏宁官网上任何商品的信息,主要的信息有商品标题、商品现价、商品原价、商铺名称,以及用scrapy的ImagesPipeline来下载商品图片。
Python|苏宁易购网址爬虫爬取商品信息及图片
文章图片


部分主函数代码如下:

# -*- coding: utf-8 -*- import scrapy import time import re import json import jsonpath import urllib.parse from Suning.items import SuningItemclass SuningSpider(scrapy.Spider): name = 'suning' allowed_domains = ['search.suning.com/'] keyword = input("请输入商品:") temp_data = https://www.it610.com/article/urllib.parse.quote(keyword) temp_url ="https://search.suning.com/{}/" val_url = temp_url.format(temp_data) start_urls = [val_url]def __init__(self, name=None, **kwargs): super().__init__(name=None, **kwargs) self.page_num = 0def parse(self, response):# content = response.body.decode("utf-8") # with open("./file/苏宁.html", "w", encoding="utf-8") as file: #file.write(content) li_elements = response.xpath("//div[@id='product-list']/ul[@class='general clearfix']/li") # print(len(li_elements)) for li_element in li_elements: title_elements = li_element.xpath( ".//div[@class='res-info']/div[@class='title-selling-point']/a//text()").extract() title_list = [] for temp_title in title_elements: temp_title = re.sub(r"\s", "", temp_title) if len(temp_title) > 0: temp_title = temp_title.replace(",", ",") title_list.append(temp_title) title = "-".join(title_list) store_name = li_element.xpath( ".//div[@class='res-info']/div[@class='store-stock']/a/@title").extract_first() # print(store_name) # print(title) temp_image_url = li_element.xpath( ".//div[@class='img-block']/a[@class='sellPoint']/img/@src").extract_first() image_url = "https:" + temp_image_url # print(image_url) temp_product_url = li_element.xpath( ".//div[@class='img-block']/a[@class='sellPoint']/@href").extract_first() src_args = re.findall(r"com/(.*?).html", temp_product_url)[0] key0 = src_args.split("/")[0] key1 = src_args.split("/")[-1] price_src = "https://pas.suning.com/nspcsale_0_0000000" + key1 + "_0000000" + key1 + "_" + key0 + "_190_755_7550199_500353_1000051_9051_10346_Z001___R9006372_0.91_1___00031F072____0___750.0_2__500363_500519__.html?callback=pcData&_=1630468559926" # price_src = "https://pas.suning.com/nspcsale_0_0000000" + key1 + "_0000000" + key1 + "_" + key0 + "_250_029_0290199_20089_1000257_9254_12006_Z001___R1901001_0.5_0___000060864___.html?callback=pcData&_=1630466740130" # print(price_src) item = {"title": title, "store_name": store_name, "image_url": image_url} yield scrapy.Request(price_src, callback=self.get_price, dont_filter=True, meta=item)

爬取后用csv保存文件:
Python|苏宁易购网址爬虫爬取商品信息及图片
文章图片

下载好的照片如下:

项目内包含的文件:
Python|苏宁易购网址爬虫爬取商品信息及图片
文章图片

begin.py说明: 直接运行该文件既可以运行程序,也可以自己在终端运行scrapy crawl suningproxy.py说明: 运行该文件可以修改ip池内的ip代理; url = "https://www.kuaidaili.com/free/inha/1/" 修改上面的数字可以获得其他页数的ip地址,1代表第一页。 运行的结果复制替换下面列表内容,可以多添加。 ip_list = ['http://129.226.182.125:80', 'http://106.45.104.214:3256'] requestheaderstool.py说明: 可以更换cookie值,倘若数据无法获取时。即复制自己浏览器的cookie,(最好是登录苏宁账号后的cookie值)

资源下载:
【Python|苏宁易购网址爬虫爬取商品信息及图片】苏宁易购网址爬虫爬取商品信息及图片-Python文档类资源-CSDN下载运行程序,输入需要爬取的商品名称即可爬取到该商品的所有商品名称价格,商铺名称,以及商品图片等。更多下载资源、学习资料请访问CSDN下载频道.https://download.csdn.net/download/weixin_45179605/24366299

    推荐阅读