Bilibili 2000W用户信息爬取
[Python]
单机多线程爬虫,耗时30小时,爬取B站2000W用户公开数据,存入数据库。
为用户个性签名提供网页索引,说不定这是东半球脑洞最大的小词儿了。
网页版入口: http://cdxy.me/CI/
项目地址: https://github.com/Xyntax/POC-T/blob/master/module/spider.py
脚本很简单,已作为模块整合到我的多线程框架中:
import requests
import json
import MySQLdb
def info():
pass
def exp():
pass
def poc(str):
url = 'http://space.bilibili.com/ajax/member/GetInfo?mid=' + str
head = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36'
}
jscontent = requests.get(url, headers=head, verify=False).content
jsDict = json.loads(jscontent)
if jsDict['status'] and jsDict['data']['sign']:
jsData = jsDict['data']
mid = jsData['mid']
name = jsData['name']
sign = jsData['sign']
try:
conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, charset='utf8')
cur = conn.cursor()
conn.select_db('bilibili')
cur.execute(
'INSERT INTO bilibili_user_info VALUES (%s,%s,%s,%s)', [mid, mid, name, sign])
return True
except MySQLdb.Error, e:
pass
else:
pass