Add this to your Gemfile
and run bundle install
in the Terminal.
gem "http"
You can follow along in the rails console
.
Choose a URL with the data you want on it. For this example, let’s pick chapters.onrender.com
.
url = "https://chapters.onrender.com/chapters/753"
Make a request to that URL and save the result in a variable.
webpage = HTTP.get(url)
The result of our request is a Response
object.
There is a LOT of data that the browser can return when we request a page. For web scraping purposes, we only care about the visible content of the page. To get that, we use the body
method.
p webpage.body.to_s
This should return something like:
=> "<!DOCTYPE html>\n<html>\n <head>\n <title>README.md | firstdraft Chapters</title>\n <meta name=\"csrf-param\" content=\"authenticity_token\" />..."
Now we have a big String that has all the HTML elements on the page. But as we know from reading from an API, searching through a String is a pain! It’d be a lot nicer if we could convert it to an Array or a Hash or some other Ruby structure so we could search through it easier.
Similar to parsing the JSON of an API, we next need to parse the page’s HTML.
parsed_page = Nokogiri::HTML(webpage.body.to_s)
p parsed_page
Nokogiri is a gem that all Rails apps have and can parse HTML into a structured Ruby object. Now we have the page data in a structured format! It’s not an Array or a Hash, but it is a Ruby object that has methods we can use to better search through HTML, which is all that matters.
Next, how do we select the specific parts of the page to get the data from?
We need to use CSS selectors to pick which elements we want to grab from the page.
Let’s say I want to select all the links in a list on the Chapters homepage.
# This selects all <a> element that are inside an <li> element
links = parsed_page.css("li a")
If you need to you can further filter page content with multiple uses of the .css
method.
# This selects all <a> element that are inside a <div> element
div_links = parsed_page.css("div").css("a")
For more details about how to figure out the selector you want to use, see the CSS Resources section at the bottom.
The .css
method will return a list of HTML elements that match whatever selector we gave it as an argument.
Since we selected a list of elements, we can now loop through all of them and grab only the visible text of the element with the text
method.
links.each do |link|
p link.text
end
If we run the whole code we should get the text of the links we wanted:
"The One Reference"
"Nouns, verbs, and grammar"
"A few program notes"
"String"
"Getting strings from users"
...
Happy hacking!