ptt爬文－阿藏哥的部落格~不應該阿

小編來表演ptt爬文

我們是由ptt網頁板去爬文

抓取PTT Movie版文章 https://www.ptt.cc/bbs/movie/index.html
把成果存成一個CSV檔案

重點¶

PTT有做最基本的檢查，檢查你的HTTP Header的部分使用的瀏覽器，所以你不能像以前什麼都不設定，不設定的情況就會發出403Forbidden的Error
403是因為對面拒絕回答你，所以你應該做的是看看是為什麼被拒絕回答

內容:安'裝BeautifulSoup和 pandas

內建urlib

1. def open_ptt_url(url)

這是處理header的程式,並填入user-Agent

在把網頁打開,在用BeautifulSoup來處李

在回傳html

2.輸入https://www.ptt.cc/bbs/movie/index.html網頁

在呼叫def open_ptt_url(url)處理，在進行爬蟲

最後用padas來處理,轉成CSV

from urllib.request import urlopen,Request
from bs4 import BeautifulSoup
import pandas as pd

def open_ptt_url(url):
r = Request(url)
r.add_header("user-agent", "Mozilla/5.0")
response = urlopen(r)
html = BeautifulSoup(response)
return html

u="https://www.ptt.cc/bbs/movie/index.html"
###u="https://www.ptt.cc/bbs/Palmar_Drama/index.html"
html=open_ptt_url(u)

### ADD
posts=html.find_all("div",{"class":"r-ent"})
df=pd.DataFrame(columns=["標題","網址","內容"])
for single_post in posts:
a_area=single_post.find("div",{"class":"title"}).find("a")
if a_area:
post_url="https://www.ptt.cc"+a_area["href"]
print(a_area.string,post_url)
if not "公告" in a_area.string:
post_html=open_ptt_url(post_url)
content=post_html.find("div",{"id":"main-content"})
removes = content.find_all("div", {"class": "article-metaline"})
for single_remove in removes:
single_remove.extract()
removes=content.find_all("div",{"class":"article-metaline-right"})
for single_remove in removes:
single_remove.extract()
removes = content.find_all("span", {"class": "f2"})
for single_remove in removes:
single_remove.extract()

removes = content.find_all("div", {"class": "push"})
for single_remove in removes:
single_remove.extract()
###print(content)
###print("\n")
revise=content.text.replace("\r",'').replace("\n","")
s=pd.Series([a_area.string, post_url, revise], index = ["標題","網址","內容"])
df=df.append(s,ignore_index=True)
df.to_csv("result.csv",encoding="utf-8",index=False)