爬取 谷歌翻译

领导交给我一个任务,有一些英文的文章,翻译成中文,用谷歌翻译试试。那就试试呗。

F12 经过多次测试,发现请求 https://translate.google.cn/translate_a/single 带一些参数就可以爬取谷歌翻译
第一步 访问 https://translate.google.cn/ 获取TKK
第二步 根据TKK和文本算出tk值,然后访问 https://translate.google.cn/translate_a/single 时加上参数,就可以爬取到翻译后的内容了。

步骤如下:

Get方式
https://translate.google.cn/translate_a/single?client=t&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=btn&ssel=3&tsel=0&kc=2&tk=750300.857549&q=test

Post方式
https://translate.google.cn/translate_a/single?client=t&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&pc=1&otf=1&ssel=3&tsel=0&kc=1&tk=280278.139719

https://translate.google.cn/translate_a/single
?client=t
&sl=en
&tl=zh-CN
&hl=zh-CN
&dt=at
&dt=bd
&dt=ex
&dt=ld
&dt=md
&dt=qca
&dt=rw
&dt=rm
&dt=ss
&dt=t
&ie=UTF-8
&oe=UTF-8
&pc=1
&otf=1
&ssel=3
&tsel=0
&kc=1
&tk=280278.139719
sl=en:source language为en,即需要翻译的文字是英文 
tl=zh-CN:to language,目标语言为zh-CN,即要翻译为中文简体

ie=UTF-8:input encoding,输入的文字的编码为UTF-8  
oe=UTF-8:output encoding,输出,翻译后,的文字的编码为UTF-8

tk是用JavaScript根据TKK和输入的文字算出来的

(1) 获取TKK

请求 https://translate.google.cn/ 后,响应的内容里搜索TKK,可以找见下面这行代码,每次TKK都不一样

TKK=eval('((function(){var a\x3d3412621750;var b\x3d-891074424;return 419654+\x27.\x27+(a+b)})())');

\x3d 是 =
\x27 是 '

上面这行代码翻译过来就是
TKK=eval('((function(){var a=3412621750;var b=-891074424;return 419654+'.'+(a+b)})())');

(2) 获取tk值

根据TKK和文本算出tk值

var b = function (a, b) {
    for (var d = 0; d < b.length - 2; d += 3) {
        var c = b.charAt(d + 2),
            c = "a" <= c ? c.charCodeAt(0) - 87 : Number(c),
            c = "+" == b.charAt(d + 1) ? a >>> c : a << c;
        a = "+" == b.charAt(d) ? a + c & 4294967295 : a ^ c
    }
    return a
}

var tk = function (a, TKK) {
    for (var e = TKK.split("."), h = Number(e[0]) || 0, g = [], d = 0, f = 0; f < a.length; f++) {
        var c = a.charCodeAt(f);
        128 > c ? g[d++] = c : (2048 > c ? g[d++] = c >> 6 | 192 : (55296 == (c & 64512) && f + 1 < a.length && 56320 == (a.charCodeAt(f + 1) & 64512) ? (c = 65536 + ((c & 1023) << 10) + (a.charCodeAt(++f) & 1023), g[d++] = c >> 18 | 240, g[d++] = c >> 12 & 63 | 128) : g[d++] = c >> 12 | 224, g[d++] = c >> 6 & 63 | 128), g[d++] = c & 63 | 128)
    }
    a = h;
    for (d = 0; d < g.length; d++) a += g[d], a = b(a, "+-a^+6");
    a = b(a, "+-3^+b+-f");
    a ^= Number(e[1]) || 0;
    0 > a && (a = (a & 2147483647) + 2147483648);
    a %= 1E6;
    return a.toString() + "." + (a ^ h)
}

得到tk以后就可以爬取谷歌翻译了

(3) 可能遇到的错误

<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 413 (Request Entity Too Large)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>413.</b> <ins>That’s an error.</ins>
  <p>Your client issued a request that was too large.
  <ins>That’s all we know.</ins>
</html>

文件太大,字数超过5000字

总体来说,谷歌还是比较给面子的,爬了几万条也没有封我ip。
换句话说,假如谷歌封我ip,我可能会为了翻译这几万条用代理+策略访问几十万次,反而会对它的服务器造成压力。
这一点我是非常敬佩的。
毕竟,程序员何苦为难程序员。

References

[1] http://www.cnblogs.com/by-dream/p/6554340.html  破解google翻译API全过程
[2] https://www.zhihu.com/question/47239748  请问如何调用谷歌翻译API?
[3] https://github.com/yixianle/google-translate  翻译工具 支持网页翻译和文本翻译
[4] https://cnodejs.org/topic/58c94ea659017af119c1d31b  给大家分享一个免费的谷歌翻译api
[5] https://www.crifan.com/teach_you_how_to_find_free_google_translate_api/
[6] http://blog.sina.com.cn/s/blog_8af106960102vci1.html  调用Google翻译 api测试 2015-01-20
[7] http://www.cnblogs.com/wcymiss/p/6264847.html  VBA:Google翻译(含tk算法)