监控的站点:
1. http://www.0818tuan.com/e/search/result/?searchid=7
2. https://www.txyangmao.cn/search.php?q=xx
3. http://www.zuanke8.com/search.php?mod=forum
4. https://www.dsq.com/search.php?mod=forum
对于0818tuan、txyangmao只需要通过request + BeautifulSoup即可搞定。主要功能代码如下:
def requesthtml(url,threathold=0):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.163.400 QQBrowser/9.3.7175.400'}
html = requests.get(url, headers=headers, verify=False).text
bsObj = BeautifulSoup(html, "html.parser")
thradTable = bsObj.find_all(text =reexpresslist(BUSSINESSFLITER))
if len(thradTable) >= threathold :
return url
对于zuanke8,dsq是动态变化的参数,需要利用selenium动态得到生成的url地址,然后再调用requesthtml即可,其中selenium代码如下。
def search(self):
try:
driver = webdriver.Chrome(chromedriverpath)
#driver.minimize_window()
driver.maximize_window()
driver.delete_all_cookies()
'''通过request 登陆系统,获取cookie'''
driver.get(self.url)
time.sleep(3)
driver.find_element_by_xpath(self.searchxpath).send_keys(self.key)# //*[@id="login_field"]
time.sleep(6)#执行过快的话,会导致xpath定位的click不对
driver.find_element_by_xpath(self.clickxpath).click()
driver.refresh()
afterclick_url = driver.current_url
print("search url is %s "%afterclick_url)
ret = requesthtml(afterclick_url,threathold=self.threadhold)
driver.quit()
if ret:
return ret
except Exception as e:
print(e)
driver.quit()