Your Bot Reaps What Judges Sow
Automated data harvesting from the UKSC website
This month marks ten years since the Supreme Court of the United Kingdom (UKSC) gained its independence from the upper House of Parliament. This can mean only one thing to all data enthusiasts: a decade’s worth of data on the latest law of the land. But beginners might yet find it tricky to code from scratch to extract information directly from a webpage. This post guides the readers on how to save that hassle by using robotic process automation (RPA).
Robotic Process Automation
Robotic process automation is an automation technology involving a bot, or your own AI workforce. Unlike conventional automation tools, RPA uses the graphical user interface (GUI), which allows your computer to “watch” your action and repeat after you. You merely show, not tell, what is to be done. You do not have to speak in a complicated machine language to make yourself understood. The application will mimic the task repetitively. This is simple yet powerful.
There are four kinds of RPA, listed below in the ascending order of sophistication and specialisation.
- Web scraping software: collects data and saves it into a structured, processable format
- Templatised software: provides a kit with which specialist programmers can build a bot that delivers beyond data collection and synthesis
- Enterprise-level software: scalable, reusable, and can automate large business operations
- Sector-specific software: customised to a particular and complex procedure, eg accounting
In this guide, we are going to try the basic web scraping. The RPA platform I will use is called Automation Anywhere, but there are also other providers like Blue Prism, UIPath, and so on. You might want to consider the scale and the complexity of automation you aim to achieve when choosing a bot provider, but community editions should suffice in most cases and are available free of charge.
For our purposes, the harvesting process is straightforward. It is threefold: (i) setting up, (ii) showing your bot what to do, and (iii) letting it do the job for you.
Step 1: Setting up
Launch the RPA application and change the recording option in the top left-hand corner to “Web Recorder.” This is found in the drop-down menu.
Open the website that you want to harvest data from. In my case, that would be the UKSC website, specifically the page in which the decided cases are published. Note there should be a set of data available in a tabular format.
Copy the URL and go back to the application. Hit the Record button and paste the address into the box that pops up.
Press Start. This will enable you to demonstrate to your bot what to do. Upon clicking Start, the URL will load on a new browser window. The website should look the same as when you initially visited it to retrieve the URL, except there will be a control bar hovering over it.
Step 2: Showing your bot what to do
There are two modes of extraction on the control bar — “Extract Data” and “Extract Table.” Since I want a wholesale extraction of data, that is, every single row and column, I will proceed with “Extract Table.” If you, however, would rather drop certain columns, you may click on “Extract Data” and show the bot a desired pattern of what to include and what not to.
Once the bot identifies the tabular data that you wish to extract, it would enclose the content in a green boundary. Click on the table to confirm.
You are then allowed a preview up to 50 rows of what the harvested data would look like. Everything looks fine, so I will proceed.
Be sure to check the box if the table spans across multiple pages. You will also need to show the bot how to move on to the next page by capturing the link to that page.
Click Next, give this set of data a name so it can be saved as a csv file, and select an appropriate encoding if necessary. Hit the “Finish” button and also “Stop Recording.” This will take you back to the application, where you can save the entire process as a single task for your bot to repeat.
Step 3: Letting the bot do the job for you
We’re almost there. Run the saved task.
There will be a run time window and, finally, a fresh set of data — sowed by the UK Supreme Court Justices for the past 10 years and reaped by my bot in less than 10 seconds.
When you’re a beginner, coding to gather data online could feel daunting, if not impossible. But with RPA, you do not have to give up the data that you want to work with.