监控的站点:

1. http://www.0818tuan.com/e/search/result/?searchid=7
2. https://www.txyangmao.cn/search.php?q=xx
3. http://www.zuanke8.com/search.php?mod=forum
4. https://www.dsq.com/search.php?mod=forum

对于0818tuan、txyangmao只需要通过request + BeautifulSoup即可搞定。主要功能代码如下:

def requesthtml(url,threathold=0):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.163.400 QQBrowser/9.3.7175.400'}
    html = requests.get(url, headers=headers, verify=False).text
    bsObj = BeautifulSoup(html, "html.parser")
    thradTable = bsObj.find_all(text =reexpresslist(BUSSINESSFLITER))
    if len(thradTable) >= threathold :
        return  url

对于zuanke8,dsq是动态变化的参数,需要利用selenium动态得到生成的url地址,然后再调用requesthtml即可,其中selenium代码如下。

def search(self):
    try:
        driver = webdriver.Chrome(chromedriverpath)
        #driver.minimize_window()
        driver.maximize_window()
        driver.delete_all_cookies()
        '''通过request 登陆系统,获取cookie'''
        driver.get(self.url)
        time.sleep(3)
        driver.find_element_by_xpath(self.searchxpath).send_keys(self.key)#  //*[@id="login_field"]
        time.sleep(6)#执行过快的话,会导致xpath定位的click不对
        driver.find_element_by_xpath(self.clickxpath).click()
        driver.refresh()
        afterclick_url = driver.current_url
        print("search url is %s "%afterclick_url)
        ret = requesthtml(afterclick_url,threathold=self.threadhold)
        driver.quit()
        if ret:
            return ret
    except Exception as e:
        print(e)
        driver.quit()