如何用python js 动态抓取抓取js生成的数据

The page is temporarily unavailable
nginx error!
The page you are looking for is temporarily unavailable.
Please try again later.
Website Administrator
Something has triggered an error on your
This is the default error page for
nginx that is distributed with
It is located
/usr/share/nginx/html/50x.html
You should customize this error page for your own
site or edit the error_page directive in
the nginx configuration file
/etc/nginx/nginx.conf.The page is temporarily unavailable
nginx error!
The page you are looking for is temporarily unavailable.
Please try again later.
Website Administrator
Something has triggered an error on your
This is the default error page for
nginx that is distributed with
It is located
/usr/share/nginx/html/50x.html
You should customize this error page for your own
site or edit the error_page directive in
the nginx configuration file
/etc/nginx/nginx.conf.使用python/casperjs编写终极爬虫-客户端App的抓取
随着移动互联网的发展,现在写web和我三年前刚开始写爬虫的时候已经改变了太多。特别是在node以及javascript/ruby社区的努力下,以往“服务器端”做的事情都慢慢搬到了“浏览器”来实现,最极端的例子可能是了 ,写web程序无需划分前端后端的时代已经到来了。。。
在这一方面,Google一向是最激进的。纵观Google目前的产品线,社交的Google Plus,网站分析的Google Analytics,Google目前赖以生存的Google Adwords等,如果想下载源码,用ElementTree来解析网页,那什么都得不到,因为Google的数据都是通过Ajax调用经过数据混淆处理的数据,然后用JavaScript进行解析渲染到页面上的。
本来这种事情也不算太多,忍一忍就行了,不过最近因业务需要,经常需要上Google的Keyword Tools来分析特定关键字的搜索量。
图为关键字搜索的截图
图为Google经过混淆处理的Ajax返回结果。
要把这么费劲的事情自动化倒也不难,因为Google也提供了API来做,Adwords项目的就是来做这个的,问题是Google的API调用需要花钱,而如果能用爬虫技术来爬取这个结果,就能省去不必要的额外开销。
2. Selenium WebDriver
由于要解析执行复杂的JavaScript,必须有一个Full Stack的浏览器JavaScript环境,这种环境三年前的话,可能只能诉诸于于,selenium是一款多语言的浏览器Driver,它最大的优点在于,提供了从命令行统一操控多种不同浏览器的方法,这极大地方便了web产品的兼容性自动化测试。
2.1 在没有图形界面的服务器上安装和使用Selenium
安装selenium非常简单,pip install selenium 即可,但是要让firefox/chrome工作,没有图形界面的话,还是要费一番功夫的。
推荐的做法是
apt-get install xvfb
Xvfb :99 -ac -screen 0 &
export DISPLAY=:99
Selenium的安装和配置在此就不多说了,值得注意的是,如果是Ubuntu用户,并且要使用Chrome的话,必须额外下载一个chromedriver,并且把安装的chromium-browser链接到/usr/bin/google-chrome,否则将无法运行。
2.2 爬取Keywords
先总结一下Adwords的使用方法吧,要能正常使用Adwords,必须要有一个开通Adwords的Google Account,这倒不是很难,只要访问
,Google会协助创建账号,如果还没有的话,其次就是登陆了。
通过分析登陆页面,我们可以看到需要在id为Email的一个input框内输入email,然后在id为Passwd的密码框内输入密码,然后点击Sign in提交这个页面内唯一的form。
首先别忘了开一个浏览器先
from selenium import webdriver
driver = webdriver.Firefox()
driver.find_element_by_id("Email").send_keys(email)
driver.find_element_by_id("Passwd").send_keys(passwd)
driver.find_element_by_id('signIn').submit()
登陆后,我们发现需要访问一个类似 /o/Targeting/Explorer 的网页才能跳转到关键字工具,于是我们手动生成一下这个网页
search = re.compile(r'(\?[^#]*)#').search(driver.current_url).group(1)
kwurl='/o/Targeting/Explorer'+search+'&__o=cues&ideaRequestType=KEYWORD_IDEAS'
到了工具主页以后,事情就变得Tricky起来了。因为整个关键字工具都是个客户端App,在全部文件载入完成以后,页面不会直接渲染完毕,而是要经过复杂的JavaScript运算后页面才会完整显示。然而Selenium WebDriver并不知道这一点,所以我们要让他知道。
在这里,我们要等待Search按钮在浏览器中出现以后,才能确认网页加载完毕,Selenium WebDriver有两种方式可以实现这一点,我偷懒用了全局的默认等待机制:
driver.implicitly_wait(30)
于是Selenium就会在找不到页面元素的时候自动等候不超过30秒
接下来,等待输入框和Search按钮出现后提交搜索iphone关键字的请求
driver.find_element_by_class_name("sEAB").send_keys("iphone")
find_element_by_css_selector("button.gwt-Button").click()
然后我们继续等待class为sLNB的table的出现,并解析结果
result = {}
texts = driver.find_elements_by_xpath('//table[@class="sLNB"]')\
[0].text.split()
for i in range(1, len(texts)/4):
result[ texts[i*4] ] = (texts[i*4+2], texts[i*4+3])
这里我们使用了xpath来提取网页特征,这也算是写爬虫的必备吧。
完整的例子见:
替换email和passwd后直接就能用了
3. JavaScript Headless解决方案
随着Node以及随之而来的JavaScript社区的进化,如今的我们就幸福多了。远的我们有phantomjs, 一个Headless的WebKit Driver,意味着可以无需GUI,完全模拟Chrome/Safari的操作。 近的有casperjs(基于phantomjs的好用封装),zombie(相比phantomjs的优势是可以和node集成)等。
其中非常可惜地是,zombiejs似乎对富JavaScript网站支持得有问题,所以后来我还是只能用casperjs来进行测试。Headless的方案因为不需要渲染GUI,执行速度约为Selenium方案的三倍。
另外由于这是纯JavaScript的方案,于是我们可以直接在例如Chrome的Console模式下写代码控制浏览器,不存在如Selenium那样还需要语义转换,非常简洁直观。例如利用W3C Selectors API Level 1所提供的querySelector来快速选取元素,对表单进行submit,对按钮进行click,甚至可以执行自定义JavaScript脚本以便按一定规律对页面进行操控。
但是casperjs或者说phantomjs的弱点是不支持除了文件读写和浏览器操作以外的一切*nix IPC惯用伎俩,socket神马的统统不支持,1.4版本以后才加入了一个webserver用于和外界通信,但是用httpserver来和外界通信?我有点抵触就是了。
废话不说了,casperjs的代码看起来就是这样,登陆:
var casper = require('casper').create({verbose:true,logLevel:"debug"});
casper.start('');
casper.thenEvaluate(function login(email, passwd) {
document.querySelector('#Email').setAttribute('value', email);
document.querySelector('#Passwd').setAttribute('value', passwd);
document.querySelector('form').submit();
}, {email:email, passwd:passwd});
casper.waitForSelector(".aw-cues-item", function() {
kwurl = this.evaluate(function(){
var search = document.location.
return '/o/Targeting/Explorer'+search+'&__o=cues&ideaRequestType=KEYWORD_IDEAS';
与Selenium类似,因为页面都是Ajax调用的,我们需要明确地“等待某个元素出现”,即:waitForSelector,casperjs的文档既简洁又漂亮,不妨多逛逛。
值得一提的是,casperjs一定要调用casper.run方法,之前的start, then等方法,只是把步骤封装到了this._steps里面,只有在run的时候才会真正执行,所以casperjs设计流程的时候会很痛苦,for/each之类的手法有时并不好用。
这个时候需要用JavaScript编程比较常用的递归化的方法,参见/n1k0/casperjs/blob/master/samples/dynamic.js 这个例子。我在完整的casperjs代码里面也是这么做的。
具体逻辑的实现和selenium类似,我就不废话了,完整的例子参见:
介绍了selenium和casperjs两种不同的终极爬虫写法,但是其实这篇文写来只是太久没更新了,写点东西更新一下而已:)
文章来自:
使用python/casperjs编写终极爬虫-客户端App的抓取Python使用selenium获取JS生成内容,速度如何更快?
如下,wait_class用来获取class=elem_name标签出现后的html。&br&&br&测试过程中driver.get(url)这一步耗时较长而且不稳定,从5秒到1分钟不等。但从浏览器上观察,EC条件已经满足的情况下,driver.get(url)甚至还没完成。有什么可以限制.get的时间(例如超时停止加载?)的方法吗?&br&&br&谢谢。&br&&br&&div class=&highlight&&&pre&&code class=&language-python&&&span class=&kn&&import&/span& &span class=&nn&&time&/span&
&span class=&kn&&from&/span& &span class=&nn&&selenium&/span& &span class=&kn&&import&/span& &span class=&n&&webdriver&/span&
&span class=&kn&&from&/span& &span class=&nn&&mon.desired_capabilities&/span& &span class=&kn&&import&/span& &span class=&n&&DesiredCapabilities&/span&
&span class=&kn&&from&/span& &span class=&nn&&mon.by&/span& &span class=&kn&&import&/span& &span class=&n&&By&/span&
&span class=&kn&&from&/span& &span class=&nn&&selenium.webdriver.support&/span& &span class=&kn&&import&/span& &span class=&n&&expected_conditions&/span& &span class=&k&&as&/span& &span class=&n&&EC&/span&
&span class=&kn&&from&/span& &span class=&nn&&selenium.webdriver.support.ui&/span& &span class=&kn&&import&/span& &span class=&n&&WebDriverWait&/span&
&span class=&k&&def&/span& &span class=&nf&&wait_class&/span&&span class=&p&&(&/span&&span class=&n&&elem_name&/span&&span class=&p&&,&/span& &span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&c&&# remote driver&/span&
&span class=&n&&t0&/span& &span class=&o&&=&/span& &span class=&n&&time&/span&&span class=&o&&.&/span&&span class=&n&&time&/span&&span class=&p&&()&/span&
&span class=&n&&driver&/span& &span class=&o&&=&/span& &span class=&n&&webdriver&/span&&span class=&o&&.&/span&&span class=&n&&Remote&/span&&span class=&p&&(&/span&&span class=&n&&command_executor&/span&&span class=&o&&=&/span&&span class=&s&&'http://127.0.0.1:4444/wd/hub'&/span&&span class=&p&&,&/span&&span class=&n&&desired_capabilities&/span&&span class=&o&&=&/span&&span class=&n&&DesiredCapabilities&/span&&span class=&o&&.&/span&&span class=&n&&CHROME&/span&&span class=&p&&)&/span&
&span class=&c&&# navigate to a page given by the URL&/span&
&span class=&n&&driver&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&t1&/span& &span class=&o&&=&/span& &span class=&n&&time&/span&&span class=&o&&.&/span&&span class=&n&&time&/span&&span class=&p&&()&/span& &span class=&o&&-&/span& &span class=&n&&t0&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&s&&&page loaded, {0} secs passed&&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&nb&&str&/span&&span class=&p&&(&/span&&span class=&n&&t1&/span&&span class=&p&&)))&/span&
&span class=&c&&# wait until element presence&/span&
&span class=&n&&WebDriverWait&/span&&span class=&p&&(&/span&&span class=&n&&driver&/span&&span class=&p&&,&/span& &span class=&mi&&10&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&until&/span&&span class=&p&&(&/span&&span class=&n&&EC&/span&&span class=&o&&.&/span&&span class=&n&&presence_of_element_located&/span&&span class=&p&&((&/span&&span class=&n&&By&/span&&span class=&o&&.&/span&&span class=&n&&CLASS_NAME&/span&&span class=&p&&,&/span& &span class=&n&&elem_name&/span&&span class=&p&&)))&/span&
&span class=&n&&t2&/span& &span class=&o&&=&/span& &span class=&n&&time&/span&&span class=&o&&.&/span&&span class=&n&&time&/span&&span class=&p&&()&/span& &span class=&o&&-&/span& &span class=&n&&t0&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&s&&&element found, {0} secs passed&&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&nb&&str&/span&&span class=&p&&(&/span&&span class=&n&&t2&/span&&span class=&p&&)))&/span&
&span class=&c&&# save loaded html&/span&
&span class=&n&&saved&/span& &span class=&o&&=&/span& &span class=&n&&driver&/span&&span class=&o&&.&/span&&span class=&n&&page_source&/span&
&span class=&n&&t3&/span& &span class=&o&&=&/span& &span class=&n&&time&/span&&span class=&o&&.&/span&&span class=&n&&time&/span&&span class=&p&&()&/span& &span class=&o&&-&/span& &span class=&n&&t0&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&s&&&html saved, {0} secs passed&&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&nb&&str&/span&&span class=&p&&(&/span&&span class=&n&&t3&/span&&span class=&p&&)))&/span&
&span class=&n&&driver&/span&&span class=&o&&.&/span&&span class=&n&&quit&/span&&span class=&p&&()&/span&
&span class=&k&&return&/span& &span class=&n&&saved&/span&
&/code&&/pre&&/div&
如下,wait_class用来获取class=elem_name标签出现后的html。测试过程中driver.get(url)这一步耗时较长而且不稳定,从5秒到1分钟不等。但从浏览器上观察,EC条件已经满足的情况下,driver.get(url)甚至还没完成。有什么可以限制.get的时间(例如超时停止加载?)的方法吗?谢谢。import time
from selenium import webdriver
from mon.desired_capabilities import DesiredCapabilities
from mon.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
def wait_class(elem_name, url):
# remote driver
t0 = time.time()
driver = webdriver.Remote(command_executor='http://127.0.0.1:4444/wd/hub',desired_capabilities=DesiredCapabilities.CHROME)
# navigate to a page given by the URL
driver.get(url)
t1 = time.time() - t0
print("page loaded, {0} secs passed".format(str(t1)))
# wait until element presence
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, elem_name)))
t2 = time.time() - t0
print("element found, {0} secs passed".format(str(t2)))
# save loaded html
saved = driver.page_source
t3 = time.time() - t0
print("html saved, {0} secs passed".format(str(t3)))
driver.quit()
return saved…
出了禁止图片加载,你当然可以设置超时:除了上面两种超时设置,你还可以试试implicitly_wait方法除了上面两种超时设置,你还可以试试implicitly_wait方法
我觉得get方法加载网页时间长的无法忍受,难道就没有优化的方法吗?
禁止图片加载
已有帐号?
无法登录?
社交帐号登录}

我要回帖

更多关于 c 抓取js生成的数据 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信