Scraping the web with Selenium and Python
Sometimes your favorite (or not so favorite) websites don’t give you all the information you want in a simple single-page format.
For example, I recently published a book on Amazon using KDP, and wanted to see how it was performing the Amazon Best-Seller lists. But I got tired of having to visit all the different sites in order to see how things were going.
Selenium is a package that calls up a web browser and allows you to automate all sorts of web activities, as well as extracting information from the pages browsed.
In this tutorial, I am going to show you how you can use Selenium with Python and the Chrome web browser, in order to visit a series of websites in turn and extract information that you’re interested in. The information is conveniently packaged in a single web page, which is then saved to your hard drive and opened in the browser that Selenium is using.
Installation and running
As is always the case with Python, the hardest part is getting your environment set up correctly. I have put the code for the project in this Github repository: https://github.com/kf106/amazon-bestseller
This script requires Chrome, and a matching piece of software called ChromeDriver that can be downloaded from https://chromedriver.chromium.org/downloads . You can find the version of Chrome that you’re using by clicking on ⋮ > Help > About Google Chrome. Once you’ve downloaded ChromeDriver, put it in ~/.local/bin
Then run sudo ./install.sh
to install a Python virtual environment with the correct packages.
Next, activate the environment using source venv/bin/activate
Finally, you can run the program with:
python amazon-bestseller.py <ASIN>
where <ASIN> is the Amazon Standard Identification Number.
How it works
If you look at the code, you can see that it is not very long. The first import gets the webdriver module, which allows us to programmatically drive the web browser:
from selenium import webdriver
We then create an instance of a webdriver, that we can use to do webby stuff:
home = os.getenv("HOME")driver = webdriver.Chrome(home + '/.local/bin/chromedriver')
The object driver
has all sorts of functionality to manipulate the web browser that appears as you want. We’ll be looking at a few simple possibilities.
Products of interest
In this example, I want to visit all the Amazon sites for my product of interest. These pages have the following format:
https://www.amazon.<top-level domain>/dp/<ASIN>
But the best-seller section, half way down the page is in different languages. So we make a Python dictionary with the top-level domain as the key, and the string for the best-seller section as a value for each of those keys:
sites = {
"com": "Best Sellers Rank:",
"co.uk": "Best Sellers Rank:",
"de": "Amazon Bestseller-Rang:",
"fr": "Classement des meilleures ventes ",
"es": "Clasificación en los más vendidos de Amazon:",
"com.mx": "Clasificación en los más vendidos de Amazon:",
"ca": "Best Sellers Rank:",
"co.jp": "Amazon Bestseller:",
"com.br": "Ranking dos mais vendidos:",
"com.au": "Best Sellers Rank:",
"in": "Best Sellers Rank:"
}
We are going to store the results in an HTML page, which we can keep in the program as a string. Here we initialize the string with the most basic web page heading structure:
html = "<html><head></head><body><h1>" + sys.argv[1] + "</h1>"
Visiting pages
Now we need to iterate across each key/value entry, which is really simple:
for locale in sites:
The variable locale
will contain the value “com”, “co.uk”, “de”, “fr” and so on until “in”, as it loops through each key of our sites
variable in turn.
For each locale
we need our web browser to retrieve the relevant site page. The Selenium web driver has the .get(<website URL>)
function for that:
driver.get('https://www.amazon.' + locale + '/dp/' + sys.argv[1])
So if the locale
is com, and the first argument is an ASIN of B07W5HS8XZ, then our automated browser will retrive the web page https://www.amazon.com/dp/B07W5HS8XZ
and you’ll see that in the browser window.
Constructing a results page
We’ll use the title of the first site we visit to get a heading for our page. The title of the current page that’s loaded is in driver.title
and we’ll add it to our HTML string with the following line:
html = html + "<p><b>" + driver.title + "</b></p>"
Now we have to find the relevant HTML in the page and extract it, in order to add it to our results HTML page. When I first wrote this script, the best-seller section had a handy HTML element ID that I could use, but Amazon has since removed it. So we need to find the correct one by searching for the text content of the page.
Seek and you will find
With Selenium you can find the first element containing your required text content using find_element_by_xpath
, or you can get an array of all the elements that match it using find_elements_by_xpath
. An xpath is a file system-like hierarchy of the elements in a web page.
To search for elements containing a particular string, use
rank_element = driver.find_elements_by_xpath("//*[contains(text(), '<string to search for>')]")
If there is a match, we’ll probably want the first and only array entry, which would be rank_element[0]
, and if there isn’t a match an exception is thrown.
Because we’re looping through each page in turn, we need to feed the search string that’s retrieved from the relevant key/value entry of our sites
dictionary:
search_string = sites.get(locale)
rank_child = driver.find_elements_by_xpath(“//*[contains(text(), ‘“
+ search_string
+ “‘)]”)
An inspector calls
If you right-click on a web page and click on Inspect, you get the Developer Tools pane popping up, which you can use to examine the current web page. Click in the HTML pane for the Elements tab, and hit Ctrl+F. Then search for the string that you’re looking for. In this case, it’s “Best Sellers Rank”, and inspecting shows the following HTML element structure:
So what we actually want is the content two elements up. Remember how I said xpath is like a file system? To go up two directories on your hard drive you can use ./../../
(the first ./
is the current directory, the ../
takes you up one level, and the ../
takes you up another level).
We can retrieve the “grandparent” element by using the find_element_by_xpath
method on our retrieved rank_element
variable like this:
rank_element = rank_child[0].find_element_by_xpath('./../../')
Now we have an object that contains the HTML element (and all its children). We can use the Amazon web page formatting for our results if only we can extract the contents of our object. That’s easily done, and the content is known as innerHTML:
rank = rank_element.get_attribute('innerHTML')
And so we add that to our html
results variable:
html = html + “<h2>” + locale + “</h2>”
html = html + rank + “\n”
Product not found
If the page that we get back for a particular website doesn’t contain the information we are looking for, our script will throw an error when looking for that information. So wrapping the above code in a try/except block allows us to add a “no rank found” message for that particular site in the except part:
except Exception as err:
rank = “<p>No rank found</p>”
success = False
failure = failure + “<li>” + locale + “</li>”
If we were being thorough, then there would be tests for all sorts of different exceptions that could arise — for example, if the site is not found, or if the product is not found, or if the product is found, but doesn’t have a best-seller ranking (for example, because it’s not a book).
But this is just a quick and dirty script, not a product, so I can’t be bothered.
Learn to read and write
After the loop has completed, our html
variable contains a string that represents a web page for our results. The simplest way to display this in a human-readable fashion is to show it in a web browser. So why not use the Selenium/ChromeDriver browser that is open at this very moment?
We might also want a permanent record of the results. So we write the string to a file:
file = open(“result.html”, “w”)
file.write(html)
file.flush()
file.close
Remember to flush the file data to the drive, or the later code that retrieves it might get an empty file back instead.
And then we open it in the web browser:
html_file = Path.cwd() / “result.html”
driver.get(html_file.as_uri())
That Path.cwd
bit is needed to create the proper full file resource locator string for the Linux operating system, which is why at the very beginning of the script we have to import the Path module:
from pathlib import Path
Conclusion
And there you have it — all the Amazon best-seller rankings for a given book on one simple page.
There are plenty of opportunities for improvement: the links don’t work because they are relative links, and so during the results scraping we should change them to absolute links to the relevant sites. And as mentioned, the error checking is sloppy. But it works, and that’s what matters.
And it’s probably not as complicated as you thought it would be.
Good luck scraping the web!
About the Author
Keir Finlow-Bates is a blockchain researcher, inventor, and author.
You can buy a copy of his book, “Move Over Brokers Here Comes The Blockchain” (the ASIN of which was used in the examples above) at http://mybook.to/moveover.
He does not spend his time obsessively monitoring how well his book is doing on Amazon. Honestly.