ããã«ã¡ã¯ãShoã§ãã
ä»å¹´ã®6æã«ãã·ã¬ã³å¤§å¦ãã¹ãåæ¥ããæ´ãã¦MBAãã«ãã¼ã¨ãªãã¾ããã12æã¾ã§ã¯å¤§å¦ã«æ®ã£ã¦æ©æ¢°å¦ç¿ã®ç 究ããã¦ããã®ã§ããããããã帰å½ã®æãè¿ã¥ãã¦ã¾ããã¾ããã
æ¥å¹´ã®é ããæ±äº¬ã«æ»ãã®ã§ãã©ã®ã¸ãã«ä½ããããªãã¨ææ¡ãã¦ããã¨ããã§ãã
ãããä½å± é¸ã³ã¨ããã®ã¯èããªããã°ãããªãè¦å ãå¤ãã¦å¤§å¤ã§ããããªãã¹ããè²·ãå¾ãªç©ä»¶ãé¸ã³ããã¨ããã§ãããã©ã®åºãããã®ããåºãã¯ã©ã®ãããã®é¨å±ã«ããããã2LDKã¨3Kã ã¨ã©ã£ã¡ãããã®ï¼ã¨ããããã¯äººéã®é ã§èããæ¡ä»¶ã§ã¯ããã¾ããããã³ã³ãã¥ã¼ã¿ã¼ãã§ãããã¨ã¯å ¨é¨èªååãã¦ãã¾ãããã
ã¨ãããã¨ã§ããã£ã¦ã¿ã¾ããã
æ©æ¢°å¦ç¿ã使ã£ã¦æ±äº¬23åºã®ãè²·ãå¾è³è²¸ç©ä»¶ãæ¢ãã¦ã¿ã
ç©ä»¶æ å ±ãµã¤ãã¯è²ã ããã¾ãããä»åã¯Suumoãããé¸æãèä½æ¨©ã«é¢ãã¦ã¯ãå©ç¨è¦ç´ã«ä»¥ä¸ã®ããã«æ¸ãã¦ããã¾ãã
ãã¦ã¼ã¶ã¼ã¯ãæ¬ãµã¤ããéãã¦æä¾ããããã¹ã¦ã®ã³ã³ãã³ãã«ã¤ãã¦ãå½ç¤¾ã®äºåã®æ¿è«¾ãªãèä½æ¨©æ³ã§å®ããã¦ã¼ã¶ã¼å人ã®ç§çå©ç¨ã®ç¯å²ãè¶ ãã使ç¨ããã¦ã¯ãªããªããã®ã¨ãã¾ããã
ãããããã¯ç§çå©ç¨ã ãã大ä¸å¤«ã ããã
ã¾ãã¯æ±äº¬é½23åºã®è³è²¸ç©ä»¶ãå ¨ã¦ã¹ã¯ã¬ã¤ãã³ã°ãã¦ãã¼ã¿ãã¬ã¼ã ã®å½¢ã«ã¾ã¨ãã¾ããã¨ãããã足ç«åºããå§ãã¾ãããã
#å¿ è¦ãªã©ã¤ãã©ãªãã¤ã³ãã¼ã from bs4 import BeautifulSoup import requests import pandas as pd from pandas import Series, DataFrame import time #URLï¼æ±äº¬é½è¶³ç«åºã®è³è²¸ä½å® æ å ± æ¤ç´¢çµæã®1ãã¼ã¸ç®ï¼ url = 'http://suumo.jp/jj/chintai/ichiran/FR301FC001/?ar=030&bs=040&ta=13&sc=13121&cb=0.0&ct=9999999&et=9999999&cn=9999999&mb=0&mt=9999999&shkr1=03&shkr2=03&shkr3=03&shkr4=03&fw2=&srch_navi=1' #ãã¼ã¿åå¾ result = requests.get(url) c = result.content #HTMLãå ã«ããªãã¸ã§ã¯ããä½ã soup = BeautifulSoup(c) #ç©ä»¶ãªã¹ãã®é¨åãåãåºã summary = soup.find("div",{'id':'js-bukkenList'})
ããã§ã足ç«åºã§æ¤ç´¢ãã1ãã¼ã¸ç®ã®ãç©ä»¶æ å ±ãæ ¼ç´ããã¦ããé¨åãåãåºããã¨ãã§ãã¾ããã1ãã¼ã¸ã«30件表示ããç¶æ ã§ã足ç«åºã®å ´åã¯225ãã¼ã¸ï¼10æ8æ¥æç¹ï¼ããã¾ãããããã¯å»ä¸å»ã¨å¤ããã®ã§ããã®ãã¼ã¸æ°ãèªåã§æ¾ã£ã¦ããããã«ããªããã°ããã¾ãããã
#ãã¼ã¸æ°ãåå¾ body = soup.find("body") pages = body.find_all("div",{'class':'pagination pagination_set-nav'}) pages_text = str(pages) pages_split = pages_text.split('</a></li>\n</ol>') pages_split0 = pages_split[0] pages_split1 = pages_split0[-3:] pages_split2 = pages_split1.replace('>','') pages_split3 = int(pages_split2)
ã¯ããå ¥ãã¾ããããã£ã¨è¯ãããªæ¹æ³ãããããã§ãããã¨ããããããã§åãã®ã§é²ã¿ã¾ãããã次ã¯ãåå¾ãããã¼ã¸æ°ã使ã£ã¦ãURLã®ãªã¹ããä½ãã¾ãã
#URLãå ¥ãããªã¹ã urls = [] #1ãã¼ã¸ç®ãæ ¼ç´ urls.append(url) #2ãã¼ã¸ç®ããæå¾ã®ãã¼ã¸ã¾ã§ãæ ¼ç´ for i in range(pages_split3-1): pg = str(i+2) url_page = url + '&pn=' + pg urls.append(url_page)
1ãã¼ã¸ç®ã¨2ãã¼ã¸ç®ä»¥éã§åãã¦ããã®ã¯ãå°ãURLã®æ§é ãéãããã§ãã1ãã¼ã¸ç®ã®URLèªå°¾ãã~&srch_navi=1ãã«å¯¾ãã¦ã2ãã¼ã¸ç®ã225ãã¼ã¸ç®ã¯ããããã~&srch_navi=1&pn=2ããã~&srch_navi=1&pn=225ãã¨ãªã£ã¦ããã®ã§è¦æ³¨æã§ãã
ä»åã¯ã以ä¸ã®æ å ±ãããããã®ç©ä»¶ã«ã¤ãã¦åå¾ããã®ã§ããªã¹ããç¨æãã¦ããã¾ãã
name = [] #ãã³ã·ã§ã³å address = [] #ä½æ locations0 = [] #ç«å°1ã¤ç®ï¼æå¯é§ /å¾æ©~åï¼ locations1 = [] #ç«å°2ã¤ç®ï¼æå¯é§ /å¾æ©~åï¼ locations2 = [] #ç«å°3ã¤ç®ï¼æå¯é§ /å¾æ©~åï¼ age = [] #ç¯å¹´æ° height = [] #建ç©é«ã floor = [] #é rent = [] #è³æ admin = [] #管çè²» others = [] #æ·/礼/ä¿è¨¼/æ·å¼,åå´ floor_plan = [] #éåã area = [] #å°æé¢ç©
ãã¦ãããããã¯ãã¼ãªã³ã°ã«å ¥ãã¾ããåãã¼ã¸ã§ãããã¨ã¯åããªã®ã§ããããããURLãæ ¼ç´ãã¦ãããªã¹ãã®ä¸ã§ã«ã¼ããåããä¸ã§ç¨æãããªã¹ãã«ã©ãã©ãæ¾ãè¾¼ãã§è¡ãã¾ãããã
#åãã¼ã¸ã§ä»¥ä¸ã®åä½ãã«ã¼ã for url in urls: #ç©ä»¶ãªã¹ããåãåºã result = requests.get(url) c = result.content soup = BeautifulSoup(c) summary = soup.find("div",{'id':'js-bukkenList'}) #ãã³ã·ã§ã³åãä½æãç«å°ï¼æå¯é§ /å¾æ©~åï¼ãç¯å¹´æ°ã建ç©é«ããå ¥ã£ã¦ããcassetteitemãå ¨ã¦æãåºã cassetteitems = summary.find_all("div",{'class':'cassetteitem'}) #åcassetteitemsã«å¯¾ãã以ä¸ã®åä½ãã«ã¼ã for i in range(len(cassetteitems)): #å建ç©ãã売ãã«åºã¦ããé¨å±æ°ãåå¾ tbodies = cassetteitems[i].find_all('tbody') #ãã³ã·ã§ã³ååå¾ subtitle = cassetteitems[i].find_all("div",{ 'class':'cassetteitem_content-title'}) subtitle = str(subtitle) subtitle_rep = subtitle.replace( '[<div class="cassetteitem_content-title">', '') subtitle_rep2 = subtitle_rep.replace( '</div>]', '') #ä½æåå¾ subaddress = cassetteitems[i].find_all("li",{ 'class':'cassetteitem_detail-col1'}) subaddress = str(subaddress) subaddress_rep = subaddress.replace( '[<li class="cassetteitem_detail-col1">', '') subaddress_rep2 = subaddress_rep.replace( '</li>]', '') #é¨å±æ°ã ãããã³ã·ã§ã³åã¨ä½æãç¹°ãè¿ããªã¹ãã«æ ¼ç´ï¼é¨å±æ å ±ã¨æ°ãåè´ãããããï¼ for y in range(len(tbodies)): name.append(subtitle_rep2) address.append(subaddress_rep2) #ç«å°ãåå¾ sublocations = cassetteitems[i].find_all("li",{ 'class':'cassetteitem_detail-col2'}) #ç«å°ã¯ã1ã¤ç®ãã3ã¤ç®ã¾ã§ãåå¾ï¼4ã¤ç®ä»¥éã¯ç¡è¦ï¼ for x in sublocations: cols = x.find_all('div') for i in range(len(cols)): text = cols[i].find(text=True) for y in range(len(tbodies)): if i == 0: locations0.append(text) elif i == 1: locations1.append(text) elif i == 2: locations2.append(text) #ç¯å¹´æ°ã¨å»ºç©é«ããåå¾ tbodies = cassetteitems[i].find_all('tbody') col3 = cassetteitems[i].find_all("li",{ 'class':'cassetteitem_detail-col3'}) for x in col3: cols = x.find_all('div') for i in range(len(cols)): text = cols[i].find(text=True) for y in range(len(tbodies)): if i == 0: age.append(text) else: height.append(text) #éãè³æã管çè²»ãæ·/礼/ä¿è¨¼/æ·å¼,åå´ãéåããå°æé¢ç©ãå ¥ã£ã¦ããtableãå ¨ã¦æãåºã tables = summary.find_all('table') #å建ç©ï¼tableï¼ã«å¯¾ãã¦ã売ãã«åºã¦ããé¨å±ï¼rowï¼ãåå¾ rows = [] for i in range(len(tables)): rows.append(tables[i].find_all('tr')) #åé¨å±ã«å¯¾ãã¦ãtableã«å ¥ã£ã¦ããtextæ å ±ãåå¾ããdataãªã¹ãã«æ ¼ç´ data = [] for row in rows: for tr in row: cols = tr.find_all('td') for td in cols: text = td.find(text=True) data.append(text) #dataãªã¹ããããéãè³æã管çè²»ãæ·/礼/ä¿è¨¼/æ·å¼,åå´ãéåããå°æé¢ç©ãé çªã«åãåºã index = 0 for item in data: if 'é' in item: floor.append(data[index]) rent.append(data[index+1]) admin.append(data[index+2]) others.append(data[index+3]) floor_plan.append(data[index+4]) area.append(data[index+5]) index +=1 #ããã°ã©ã ã10ç§éåæ¢ããï¼ã¹ã¯ã¬ã¤ãã³ã°ããã¼ï¼ time.sleep(10)
å°ãåé·ãªæãããã¾ãããããã§ãå¿ è¦ãªæ å ±ãåãªã¹ãã«æ ¼ç´ãããã¨ãã§ãã¾ããã建ç©ä¸ã¤ã«ã¤ãã売ãã«åºã¦ããé¨å±ãè¤æ°ä»¶ããå ´åãããã®ã§ãæ°ããºã¬ãªãããã«èª¿æ´ãã¦ãã¾ãã
ãã¦ããã¨ã¯ããããã£ã¤ãã¦ãã¼ã¿ãã¬ã¼ã ã«ããã ãã§ãã
#åãªã¹ããã·ãªã¼ãºå name = Series(name) address = Series(address) locations0 = Series(locations0) locations1 = Series(locations1) locations2 = Series(locations2) age = Series(age) height = Series(height) floor = Series(floor) rent = Series(rent) admin = Series(admin) others = Series(others) floor_plan = Series(floor_plan) area = Series(area) #åã·ãªã¼ãºããã¼ã¿ãã¬ã¼ã å suumo_df = pd.concat([name, address, locations0, locations1, locations2, age, height, floor, rent, admin, others, floor_plan, area], axis=1) #ã«ã©ã å suumo_df.columns=['ãã³ã·ã§ã³å','ä½æ','ç«å°1','ç«å°2','ç«å°3','ç¯å¹´æ°','建ç©é«ã','é','è³æ','管çè²»', 'æ·/礼/ä¿è¨¼/æ·å¼,åå´','éåã','å°æé¢ç©'] #csvãã¡ã¤ã«ã¨ãã¦ä¿å suumo_df.to_csv('suumo_adachi.csv', sep = '\t',encoding='utf-16')
å®æï¼ï¼
ä»åã¯è¶³ç«åºã®ã¿ã§ããããæåã®URLãå ¥ãæ¿ããã°ä»ã®åºã«ã使ãã¾ãã
次のブログã§ã¯ããã®ãã¼ã¿ãåæããããããã«åå¦çãã¦ãããã¨æãã¾ãã