小編來表演ptt爬文
我們是由ptt網頁板去爬文
- 抓取PTT Movie版文章 https://www.ptt.cc/bbs/movie/index.html
- 把成果存成一個CSV檔案
重點¶
- PTT有做最基本的檢查,檢查你的HTTP Header的部分使用的瀏覽器,所以你不能像以前什麼都不設定,不設定的情況就會發出403Forbidden的Error
- 403是因為對面拒絕回答你,所以你應該做的是看看是為什麼被拒絕回答
內容:安'裝BeautifulSoup和 pandas
內建urlib
1. def open_ptt_url(url)
這是處理header的程式,並填入user-Agent
在把網頁打開,在用BeautifulSoup來處李
在回傳html
2.輸入https://www.ptt.cc/bbs/movie/index.html網頁
在呼叫def open_ptt_url(url)處理,在進行爬蟲
最後用padas來處理,轉成CSV
from urllib.request import urlopen,Request
from bs4 import BeautifulSoup
import pandas as pd
def open_ptt_url(url):
r = Request(url)
r.add_header("user-agent", "Mozilla/5.0")
response = urlopen(r)
html = BeautifulSoup(response)
return html
u="https://www.ptt.cc/bbs/movie/index.html"
###u="https://www.ptt.cc/bbs/Palmar_Drama/index.html"
html=open_ptt_url(u)
### ADD
posts=html.find_all("div",{"class":"r-ent"})
df=pd.DataFrame(columns=["標題","網址","內容"])
for single_post in posts:
a_area=single_post.find("div",{"class":"title"}).find("a")
if a_area:
post_url="https://www.ptt.cc"+a_area["href"]
print(a_area.string,post_url)
if not "公告" in a_area.string:
post_html=open_ptt_url(post_url)
content=post_html.find("div",{"id":"main-content"})
removes = content.find_all("div", {"class": "article-metaline"})
for single_remove in removes:
single_remove.extract()
removes=content.find_all("div",{"class":"article-metaline-right"})
for single_remove in removes:
single_remove.extract()
removes = content.find_all("span", {"class": "f2"})
for single_remove in removes:
single_remove.extract()
removes = content.find_all("div", {"class": "push"})
for single_remove in removes:
single_remove.extract()
###print(content)
###print("\n")
revise=content.text.replace("\r",'').replace("\n","")
s=pd.Series([a_area.string, post_url, revise], index = ["標題","網址","內容"])
df=df.append(s,ignore_index=True)
df.to_csv("result.csv",encoding="utf-8",index=False)
留言列表