首页 - 新闻 - B站弹幕爬取及词云简单使用

B站弹幕爬取及词云简单使用

2023-10-02 08:36

-->

1。哔哩哔哩弹幕爬行

　　1。分析发现弹幕是通过www.gsm-guard.net?=cid这个文件来加载的，所以我们找到了这个文件的请求头的请求url，

　　2。打开网址查看所有评论

　3。上传代码来分析一下

#!/usr/bin/env python# -*- 编码: utf-8 -*-
#author tom 导入请求
从 lxml 导入 etree
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, 如 Gecko) Chrome/72.0.3626.109 Safari/537.36'
} #抓取功能
def yitianSpiderf(url):
res=requests.get(url,headers=headers)
树=etree.HTML(res.content)
comment_list=tree.xpath('//d/text()')
with open('倚天评论.txt','a+',encoding='utf-8') as f:
comment_list 中的评论：
f.write(注释+'\n') #主要功能，其实所有视频都是视频，找到ID就可以抓到所有弹幕
def main():
cid=''
url='https://www.gsm-guard.net/x/v1/dm/www.gsm-guard.net?oid={}'.format(cid)
yitianSpiderf(url) if __name__ == '__main__':
主要()

4.词云：

#!/usr/bin/env python
# -*- 编码：utf-8 -*-
#作者汤姆
导入重新
进口解霸
从集合导入计数器 #使用口吃分词
with open('倚天评论.txt','r',encoding='utf-8') as f:
txt=www.gsm-guard.net()
jbwords=jieba.cut(txt) #不要用这么不合时宜的词语
with open('中文停用词列表.txt' ,'r',encoding='utf-8') as f1:
停用词=www.gsm-guard.net()
结果=[]
对于 jbwords 中的单词：
word=re.sub(r'[A-Za-z0-9\!?\%\[\]\,\.~]','',word) #去掉英文符号
如果单词：
如果单词不在停用词中：
结果.append(word)
'++++++++++++++++统计'
print('======',结果,len(结果))
打印（计数器（结果）） #makewordcloud
将 matplotlib.pyplot 导入为 plt
从wordcloud导入WordCloud，ImageColorGenerator
从 PIL 导入图像
将 numpy 导入为 np#指定字体，打开图片，转换为数组
myfon=r'C:\Windows\Fonts\simkai.ttf'
# img1=www.gsm-guard.net('狗.jpg')
# graph1=np.array(img1)
img2=www.gsm-guard.net('1.png')
graph2=np.array(img2)
text='/'.join(结果)
#WordObject
wc=WordCloud(font_path=myfon,background_color='white',max_font_size=50,max_words=500, mask=graph2)
wc.generate(文本)
img_color=ImageColorGenerator(graph2)#根据背景图片生成颜色值
plt.imshow(wc.recolor(color_func=img_color))
plt.imshow(wc)
plt.axis('off')
www.gsm-guard.net()

　　5。效果：

2。关于抓取B站直播弹幕

1、分析发现b站直播的弹幕存储在一个名为msg的文件中

　　2。我们使用postman向这个网站发起post请求，果然可以拿到数据了，

　　3。代码

#!/usr/bin/env python
# -*- 编码：utf-8 -*-
#作者汤姆
导入请求
导入时间
从 jsonpath 导入 jsonpath #抓取功能
defcrawl(url,headers,data):
res=www.gsm-guard.net(url=url,headers=headers,data=data)
#得到响应后，直接将res.json转换为字典格式。 jsonpath需要处理的也需要是一个python字典
#jsonpath第一个参数是python字典，第二个参数是匹配规则。这表示从根目录递归搜索文本和昵称
comment_list=jsonpath(res.json(),'$..text')
nicname_list=jsonpath(res.json(),'$..昵称')
#同时循环两个列表，需要使用zip打包
for (nicname,comment) in zip(nicname_list,comment_list):
迪克={
'昵称':昵称,
‘评论’:评论
}
打印（字典） def main():
url = 'https://www.gsm-guard.net/ajax/msg'
标题 = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, 如 Gecko) Chrome/72.0.3626.109 Safari/537.36'}
数据 = {'roomid': ''}
#最好用while循环休眠，哪怕是0.1，不然内存就装不下了
而正确时：
抓取（网址，标题，数据）
时间.睡眠(2) if __name__ == '__main__':
主要()