推荐 最新
老詹啊老詹

scrapy 爬虫,始终获取不到数据,如何解决呢?

求助 scrapy 爬取数据失败,排查了好久都没有找到问题了,实在找不到了 目标:爬取欣欣旅游网的某一城市 各大景点的基本信息 这是我的 sipder 以及 item 代码 spider: from scrapy import Request from scrapy.spiders import Spider from XXtourism.items import XxtourismItem class TourismSpider(Spider): name = "tourism" # 初始请求 def start_requests(self): url = "https://tianjin.cncn.com/jingdian/" yield Request(url,dont_filter=True) # 解析函数 def parse(self, response,*args,**kwargs): spots_list = response.xpath('//div[@class="city_spots_list"]/ul/li') for i in spots_list: try: #景点名称 name = i.xpath('./a/div[@class="title"]/b/text()').extract_first() #景点简介 introduce = i.xpath('./div[@class="text_con"]/p/text()').extract_first() item = XxtourismItem() item["name"] = name item["introduce"] = introduce #生成详细页请求 url = i.xpath("./a/@href").extract_first() yield Request(url,meta={"item":item},callback=self.pif_parse,dont_filter=True) except: pass def pif_parse(self,response): try: address = response.xpath("//div[@class='type']/dl[1]/dd/text()").extract_first() time = response.xpath("//div[@class='type']/dl[4]/dd/p/text()").extract_first() ticket = response.xpath("//div[@class='type']/dl[5]/dd/p/text()").extract_first() response.find_element_by_xpath("//div[@class='type']/dl[3]//dd/a/text()") type = response.xpath("//div[@class='type']/dl[3]//dd/a/text()").extract_first() if type: type = type else: type = ' ' item = response.meta["item"] item["address"] = address item["time"] = time item["ticket"] = ticket item["type"] = type yield item # url = response.xpath("//div[@class='spots_info']/div[@class='type']/div[@class='introduce']/dd/a/@href").extract_first() # yield Request(url,meta={"item":item},callback=self.fin_parse) except: type = ' ' # def fin_parse(self,response): # try: # traffic = response.xpath("//div[@class='type']/div[@class='top']/div[3]/text()").extract() # # item = response.meta["item"] # item["traffic"] = traffic # # yield item # # except: # pass item: # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class XxtourismItem(scrapy.Item): # define the fields for your item here like: # 景点名称 name = scrapy.Field() # 景点地址 address = scrapy.Field() # 景点简介 introduce = scrapy.Field() # 景点类型 type = scrapy.Field() # 开放时间 time = scrapy.Field() # 门票概况 ticket = scrapy.Field() # 交通概况 traffic = scrapy.Field() 这是执行日志: PS D:\Python\XXtourism\XXtourism> scrapy crawl tourism -o tourism.csv 2023-12-20 18:16:56 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: XXtourism) 2023-12-20 18:16:56 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23 .2.0 (OpenSSL 3.0.12 24 Oct 2023), cryptography 41.0.7, Platform Windows-10-10.0.19045-SP0 2023-12-20 18:16:56 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'XXtourism', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3, 'NEWSPIDER_MODULE': 'XXtourism.spiders', 'SPIDER_MODULES': ['XXtourism.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'} 2023-12-20 18:16:56 [py.warnings] WARNING: D:\Ana\lib\site-packages\scrapy\utils\request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy. See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation. return cls(crawler) 2023-12-20 18:16:56 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2023-12-20 18:16:56 [scrapy.extensions.telnet] INFO: Telnet Password: c388126d14d4b80a 2023-12-20 18:16:56 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2023-12-20 18:16:57 [scrapy.middleware] INFO: Enabled item pipelines: [] 2023-12-20 18:16:57 [scrapy.core.engine] INFO: Spider opened 2023-12-20 18:16:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2023-12-20 18:16:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2023-12-20 18:16:57 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:17:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:13 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:17:57 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min) 2023-12-20 18:17:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:18:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:18:06 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:08 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:11 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:14 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:18 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:22 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:26 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:29 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:33 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:36 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:40 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:44 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:47 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:51 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:55 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:18:57 [scrapy.extensions.logstats] INFO: Crawled 16 pages (at 15 pages/min), scraped 0 items (at 0 items/min) 2023-12-20 18:19:00 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:19:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:19:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:19:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to from 2023-12-20 18:19:14 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:19:17 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:19:20 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2023-12-20 18:19:20 [scrapy.core.engine] INFO: Closing spider (finished) 2023-12-20 18:19:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 12810, 'downloader/request_count': 39, 'downloader/request_method_count/GET': 39, 'downloader/response_bytes': 151805, 'downloader/response_count': 39, 'downloader/response_status_count/200': 20, 'downloader/response_status_count/301': 19, 'elapsed_time_seconds': 142.80337, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2023, 12, 20, 10, 19, 20, 220207), 'httpcompression/response_bytes': 458357, 'httpcompression/response_count': 20, 'log_count/DEBUG': 40, 'log_count/INFO': 12, 'log_count/WARNING': 1, 'request_depth_max': 1, 'response_received_count': 20, 'scheduler/dequeued': 39, 'scheduler/dequeued/memory': 39, 'scheduler/enqueued': 39, 'scheduler/enqueued/memory': 39, 'start_time': datetime.datetime(2023, 12, 20, 10, 16, 57, 416837)} 2023-12-20 18:19:20 [scrapy.core.engine] INFO: Spider closed (finished) 跟着老师讲的一步一步来的,自己多爬取了几个信息(打开对应的详细网页进行爬取) 始终获取不到任何信息,301重定向错误也试了很多方法,但都没有解决 救救我吧 大佬们

0
1
0
浏览量21