This project was created to illustrate scraping data from Amazon with R and rvest.
Web scraping allows the extraction of data elements from the HTML/CSS of a website.
ALWAYS ensure that you have permission from the site before scraping. This is done by checking the robots.txt
file of a site. This can be done simply with the library robotstxt
paths_allowed()
command.
For example, to determine if you can scrape a site, you can run the following:
library(robotstxt)
paths_allowed(
paths = c("https://www.imdb.com/")
)
If the result is TRUE
, you are permitted to scrape the site.
There are a few ways to select elements from a webpage.
- Inspecting the page via developer tools in any major browser.
- Selector Gadget (https://selectorgadget.com/), which allows point and click selection of elements.
The .html files do not render directly as a true .html file within GitHub. GitHub & BitBucket HTML Preview works well to convert these .html files into viewable webpages.
The presentation can be accessed here.