Data extraction tools for beginners and professionals
There are all kinds of tools for extracting unstructured data from files that cannot be reused such as a PDF or websites run by governments and organizations. Some are free, others are fee based and in some cases languages like Python are used to do this.
To work with data, it is essential to have data. Sometimes that information is structured and on other occasions it is unstructured. Nowadays there are many tools or processes through which a developer can extract data from complex formats such as PDF or one or more websites, which is known as web scraping. The goal is to have the data to be able to view and understand.
Web scraping could be defined as the technique by which a team of developers is able to scrape or free data from websites run by governments, state institutions and organizations to access private or public data that may be published or distributed in open format. The problem is that the format of most interesting data is not reusable and it is opaque such as a PDF for example.
In order to access and distribute this information, there are a lot of tools or processes through the use of programming languages. This is a guide to using the main data extraction methods.
Web scraping tools
● ImportHTML formula
As part of Google applications, the grand search engine has developed its own Excel called Google Spreadsheet. This tool provides almost all the features offered by Microsoft Excel, but also has some added functionality through content that is indexed to the internet by the search engine: RSS feed reader, website updates and data extraction.
All this is possible through the use of formulas like ImportFeed, ImportHTML and ImportXML. The second of these allows any user to extract data from tables or lists in an orderly fashion from any website. One of the elements of the formula type varies depending on whether it is a table or a list. Here are two practical examples:
=ImportHTML("url página web", "table", 2)
=ImportHTML("url página web", "list", 2)
By including any of these formulas in the first cell of Google Spreadsheet, it possible to extract the second table or list of the URL that the user adds within double quotes. It is very simple.
Table Capture is an extension for the Chrome browser, which provides a user with data on a website with little difficulty. It extracts the information contained in an HTML table of a website to any data processing format such as Google Spreadsheet, Excel or CSV. This is similar to the ImportHTML formula.
ScraperWiki is the perfect tool for extracting data arranged in tables in a PDF. All you need to do is load the file and export it. If the PDF has multiple pages and numerous tables, ScraperWiki provides a preview of all the pages and the various tables and the ability to download the data in an orderly way and separately.
With ScraperWiki you can also clean the data before it is exported to a Microsoft Excel file. This is useful as it makes things much easier when clean data is added to a visualization tool.
Tabula is a desktop application for Windows, Mac OSX and Linux computers that provides developers and researchers with a simple method to extract data from a PDF to a CSV or Microsoft Excel file for editing and viewing. Tabula is a tool that is widely used in data journalism.
The steps to use Tabula are as follows:
- Load a PDF with the data table you want to export.
- Select the table with all the information.
- Select the option 'Preview and data extraction'. Tabula scrapes the data in the table and provides the user with a preview of the information extracted for it to be checked.
- Click 'Export'.
- The data are exported to a Microsoft Excel file, or a LibreOffice file if you do not have Microsoft Office.
Tabula is an open source project available on GitHub.
Import.io is a free online tool, but there is also a fee-based version for companies. This aids structured extraction of data and downloading in CSV format or generating an API with the information. API data are updated as data is modified in the source environment.
Import.io has a desktop application that can be downloaded by any user to their Windows, Mac OSX or Linux computer. In this application, Import.io offers several methods for extracting very different data: information contained in a URL, information in HTML or XML, images, numerical values, maps, i.e. everything.
Extracting data with Python
In BBVAOpen4U we have seen what Python is and how it works when developing digital projects or using libraries for data visualization, but this is the first time one of its most interesting and professional features has been mentioned: extracting unstructured data. There are also many libraries in this language for data access.
BeautifulSoup is a Python library used to easily extract specific data from a web page in HTML without much programming. It is technically called parsing HTML. One of the advantages of this library in Python is that all of the output documents of the data extraction are created in UTF-8, which is quite interesting because the typical problem of encoding is completely solved.
Another powerful feature of BeautifulSoup is that it uses Python analyzers such as lxml or html5lib, which makes it possible to crawl tree-structured websites. These enable you to go through each 'room' of a website, open it, extract your information and print it.
from bs4 import BeautifulSoup
redditFile = urllib2.urlopen("http://www.reddit.com")
redditHtml = redditFile.read()
soup = BeautifulSoup(redditHtml)
redditAll = soup.find_all("a")
for links in soup.find_all('a'):
Mechanize is a virtual browser that manages to track a website with Python programming language. It is based on the urllib module.
Scrapy is an open code development framework for data extraction with Python. This framework allows developers to program spiders used to track and extract specific information from one or several websites at once. The mechanism used is called selectors; however, you can also use libraries in Python such as BeautifulSoup or lxml.
Sign up to the BBVAOPEN4U newsletter and receive tips, tools and the most innovative events directly in your inbox.