爬取 天眼查

由于天眼查反爬措施较严,只写了思路,有部分内容没公开。仅供参考。

需要一万条左右的企业信息,然后就想到 天眼查 企查查 启信宝 去爬一点,发现天眼查做的比较好,然后就想从天眼查爬一点。
结果发现天眼查2017-06的时候改版了,改版以后全部使用https,而且反爬措施也严了。
如果不差钱,直接用天眼查提供的接口 https://open.tianyancha.com/

爬取的时候发现并没有想象的那么简单,然后再网上搜了搜,发现2017-06以前的博客代码都不能用了。2017-06以后的资料非常少。没办法,只能靠自己了。

2018-03-01 又改版了,改完感觉颜色没以前好看了。不过增加了一些功能。

如果是毕业写论文需要几万条数据,可以直接找我要,免得花费太多时间,Email: weikeqin.cn@gmail.com

网页数据1

网页数据2
入库数据1
入库数据2

天眼查数据获取分为两块,
第一块为大量索引信息获取。
第二块为企业详细信息获取。

现在一个城市能爬取5页,一页20个公司信息,大概有362个城市,这么一算能爬36200条,就算爬一半也够了
好了,动手

https://bj.tianyancha.com/search/p1

https://bj.tianyancha.com/search/p5

Cookie里主要有以下参数

TYCID  
undefined  
ssuid  
auth_token  
tyc-user-info  
RTYCID  
aliyungf_tc  
csrfToken  
OA  
_csrf  
_csrf_bk
Hm_lvt_e92c8d65d92d534b0fc290df538b4758  
Hm_lpvt_e92c8d65d92d534b0fc290df538b4758

登陆后获取几个参数,然后模拟1个参数,爬取时动态得到几个参数,这几个参数就全了。然后就可以爬取了。

不过爬取到100页以后,就会发现被反爬了,要想继续爬,第一个办法是换用户登录,第二个办法是换ip,第三个办法是输入验证码。

因为我需要的数据量并不大,所以试了试用代理。然后试了试多注册几个账号。
结果发现还是多注册账号好用点,把1810页全爬下来了,解析完一共有32158个公司信息。

如果想爬取全量数据,无需登录,需要研究穿插在几万行代码里的加密算法

Cookie中有两串字符串,分别是token和utm,今天我们分别讲一讲两种破解算法;

token获取

https://www.tianyancha.com/tongji/3871135.json?_=15100445xxxxx
token隐藏在https://www.tianyancha.com/tongji/+企业id.json
他返回的是一串数字,我们用代码直接获取

import requests
url="http://www.tianyancha.com/tongji/216908186.json"
headers={
   "Accept":"application/json, text/plain, */*"
}
data=requests.get(url,headers=headers)
print(data.text)



{"state":"ok","message":"","data":{"name":"216908186","uv":740581,"pv":138691,"v":"33,102,117,110,99,116,105,111,110,40,110,41,123,100,111,99,117,109,101,110,116,46,99,111,111,107,105,101,61,39,116,111,107,101,110,61,49,101,101,55,97,54,98,101,48,102,57,98,52,48,54,56,56,98,97,99,97,97,55,99,48,101,49,98,53,100,99,102,59,112,97,116,104,61,47,59,39,59,110,46,119,116,102,61,102,117,110,99,116,105,111,110,40,41,123,114,101,116,117,114,110,39,55,44,51,50,44,51,52,44,49,52,44,49,52,44,51,52,44,51,52,44,50,57,44,51,44,55,44,49,44,50,57,44,50,57,44,51,44,49,52,44,49,56,44,49,56,44,49,51,44,49,51,44,51,52,44,51,50,44,50,57,44,49,57,44,50,55,44,55,44,48,44,52,44,52,44,49,51,44,48,44,52,44,51,39,125,125,40,119,105,110,100,111,119,41,59"}}

以上字符串的v对应的一串数字有什么用呢?
仔细看可以发现,这串数字最大值也没有超过130,是不是和Ascii有点类似?
通过以下代码解码可以得到这么一串字符

def strfromcode(strcode):
   arr=strcode.split(",")
   stringfromcode1=""
   for lin in arr:
       stringfromcode1+=chr(int(lin))
   return stringfromcode1
!function(n){document.cookie='token=8cdd0625160146c1909dda40448e7c69;path=/;';n.wtf=function(){return'28,7,4,28,3,1,31,7,7,32,28,34,29,19,29,14,18,30,28,31,4,29,34,7,30,13,4,0,1,31,4,18'}}(window);

从上不难发现有我们需要的token字段,再通过代码将其取出,则是我们所需要的token字段,同时除了这段之外,还有一段return数字,这串数字和接下来我们要说的utm关系非常密切;

token的获取就已经完成了

utm的获取

utm的获取需要使用此链接:http://static.tianyancha.com/wap/resources/scripts/app-ce05b92dbf.js
此链接返回的字符串中有许多appendChlid字段,此为获取utm的关键字段
将其进行拆分,获取相应的字段,同时再将非数字及字母及-的字符使用正则表达式去除,获取相应的字符串列表;
再取列表中的每个元素的第1个字符,将第1个字符相同的链接在一起变成新的列表;
再使用企业id和10进行求余数,
此余数为列表的索引值
同时再将第二点中的return列表作为该字符串的索引获取字符串中的值,得到utm字符串
具体代码如下所示,由于网上对于天眼查的代码稀少,作者不知是否存在侵权,因此关键代码已删除;

Sub Main()
    '根据企业在天眼查内的ID来查询企业信息
    '原创:wcymiss
    
    Dim strText As String
    Dim objHttp As Object
    Dim strURL As String
    Dim ID As String
    Dim sgArr() As String
    Dim strToken As String
    Dim strUtm As String
    Dim strV As String
    Dim strCode As String
    Dim Index As Integer
    
    ID = "812498657"
    Set objHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
    
    strURL = "http://www.tianyancha.com/tongji/" & ID & ".json"
    With objHttp
        .Open "GET", strURL, False
        .setRequestHeader "Accept", "application/json, text/plain, */*"
        .Send
        strText = .responsetext
    End With
    strCode = Split(Split(strText, ",""v"":""")(1), """")(0)
    strV = StringFromCode(strCode)
    strToken = Split(Split(strV, "'token=")(1), ";")(0)
    strCode = Split(Split(strV, "return'")(1), "'")(0)

    strURL = "http://static.tianyancha.com/wap/resources/scripts/app-ce05b92dbf.js"
    With objHttp
        .Open "GET", strURL, False
        .Send
        strText = .responsetext
    End With
    sgArr = GetSoGou(strText)
    Index = Asc(Left(ID, 1)) Mod 10
    strUtm = GetUtm(sgArr, Index, strCode)

'    Debug.Print strToken
'    Debug.Print strUtm

    strURL = "http://www.tianyancha.com/company/" & ID & ".json"
    With objHttp
        .Open "GET", strURL, False
        .setRequestHeader "Accept", "application/json, text/plain, */*"
        .setRequestHeader "Cookie", "token=" & strToken & ";_utm=" & strUtm
        .Send
        strText = .responsetext
    End With
    
    Set objHttp = Nothing
    Debug.Print strText
End Sub

Private Function GetSoGou(strText As String) As String()
    Dim arr() As String
    Dim i As Integer
    Dim objReg As Object
    Dim sgArr(0 To 9) As String
    Dim Index As Integer
    
    Set objReg = CreateObject("VBScript.Regexp")
    objReg.Global = True
    
    arr = Split(strText, "appendChlid(")
    For i = 1 To UBound(arr)
        arr(i) = Split(Split(arr(i), ">")(1), "<")(0)
    Next
    objReg.Pattern = "&[^;]*;"
    For i = 1 To UBound(arr)
        arr(i) = objReg.Replace(arr(i), "")
    Next
    objReg.Pattern = "[^0-9a-z-]"
    For i = 1 To UBound(arr)
        arr(i) = objReg.Replace(arr(i), "")
    Next
    Set objReg = Nothing
    
    For i = 1 To UBound(arr)
        If Len(arr(i)) > 1 Then
            Index = Left(arr(i), 1)
            sgArr(Index) = sgArr(Index) & Mid(arr(i), 2)
        End If
    Next
    GetSoGou = sgArr
End Function

Private Function GetUtm(sgArr() As String, Index As Integer, strCode As String) As String
    Dim i As Integer
    Dim arr() As String
    arr = Split(strCode, ",")
    For i = 0 To UBound(arr)
        GetUtm = GetUtm & Mid(sgArr(Index), arr(i) + 1, 1)
    Next
End Function

Private Function StringFromCode(strCode As String) As String
    Dim i As Integer
    Dim arr() As String
    arr = Split(strCode, ",")
    For i = 0 To UBound(arr)
        StringFromCode = StringFromCode & Chr(arr(i))
    Next
End Function

1 Github上天眼查爬虫项目
https://github.com/guapier/tianyancha(关键词:phantomjs,xpath)
https://github.com/felixglow/Tianyancha(关键词:scrapy)
https://github.com/haijunt/tianyancha_example(关键词:scrapy, splash)
https://github.com/kestiny/PythonCrawler(关键词:phantomjs)

2 各类博客
https://ask.hellobi.com/blog/jasmine3happy/6200(关键词:selenium, phantomjs)
http://blog.csdn.net/chlk118/article/details/52937671(关键词:phantomjs)
https://sanwen.net/a/njbicqo.html(关键词:utm,token)
http://www.bubuko.com/infodetail-1917809.html(关键词:utm, token)

References

[1] 破解天眼查token,_utm,paaptp的过程 2017-07-06
[2] 天眼查企业信息获取
[3] 天眼查接口token, _utm获取
[4] selenium+chromedriver爬取天眼查 2017-10-26
[5] 利用python爬虫抓取天眼查企业信息数据,反反爬虫的一些实践
[6] 用天眼查查询企业信息(含token和_utm值算法)
[7] 用天眼查查询企业信息(含token和_utm值算法)
[8] 简单爬取天眼查数据 附代码
[9] 【爬虫】大杀器——phantomJS-selenium
[10] Python访问天眼查
[11] 理解CSRF(跨站请求伪造)
[12] 天眼查企业数据获取
[13] 天眼查柳超:公开数据里挖金矿,腾讯给了我最大启发