无法下载网页urllib.error.HTTPError: HTTP Error 403: Forbidden?-灵析社区

MastFancy

想提取这个网页的数据 from urllib.request import urlretrieve import urllib import random url="https://cn.investing.com/indices/hnx-30-components" opener = urllib.request.build_opener() ua_list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0' ] opener.addheaders = [('User-Agent', random.choice(ua_list))] urllib.request.install_opener(opener) urlretrieve(url, '/tmp/test.html') 网页无法打开,浏览器可以打开 File "/usr/local/lib/python3.11/urllib/request.py", line 643, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden 请问,如何解决?

阅读量:134

点赞量:0

问AI
被TLS指纹反爬虫了,可以用curl_cffi库爬 import random from curl_cffi import requests ua_list = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0' ] headers = {'User-Agent':random.choice(ua_list)} url="https://cn.investing.com/indices/hnx-30-components" resp = requests.get(url, headers=headers, impersonate="chrome110") print(resp.status_code) with open('temp.html', 'wb') as fw: fw.write(resp.content) 如果页面数据是异步加载的,还是用selenium这类库爬吧