Scraping and Data Mining for Journalists

August 18, 2016

New online course in Portuguese on data journalism teaches how to automate collection of web data

This post is also available in: English Spanish Portuguese (Brazil)

The most challenging part of data journalism is often the process of collecting information, especially when the journalist needs to make structured spreadsheets from PDF reports, websites or social networks. The new online course in Portuguese offered by the Knight Center and Escola de Dados will show you how to work around this problem and automate the collection of web data.

The course “Data Scraping and Mining for Journalists” will be taught from Sept. 5 to Oct. 2 by Marco Túlio Pires, the global network coordinator for Escola de Dados, on ​JournalismCourses.org, the learning program Knight Center for Journalism in Americas at the University of Texas at Austin.

This is a BOC (big online course), a type of online courses launched last year by the Knight Center to create more advanced training opportunities that are also more specialized than MOOCs (massive open online courses). Unlike MOOCs, which usually have thousands of students and are free, BOCs have a limited number of students and a fee. This BOC has a registration fee of US $95, which must be paid by credit card. Registration is limited and can be made at this ​link.

Raspagem e Mineração de Dados para Jornalistas
Raspagem e Mineração de Dados para Jornalistas

“Scraping is important because it gives the journalist the ability to organize their own databases from information that wasn’t gathered in a structured way,” Pires said. “You can automate data collection on the web, from text documents and even photos: which means that tasks that would take weeks or months to be completed by a task force, can be performed in an instant with scraping and programming.”

Participants should have some prior experience with construction of web pages, common formats such as CSV and XLS and basic data journalism ideas, but don’t need to have taken the introductory course from the Knight Center – “Basic Techniques of Data Journalism.” All students will have free access to last year’s MOOC materials to learn or remember the fundamentals of data journalism.

The BOC will be divided into weekly modules that include multimedia materials, quizzes and discussion forums. Each week will look at a different approach: scraping of social networks, of PDF files, of web pages and scraping with computer programming using Python language. Participants will learn the principles of data scraping in the context of journalism with examples and practical exercises.

The majority of the course activities can be completed on days and times chosen by the students. However, there are suggested deadlines for each week. The fee for the course includes an electronic certificate of completion that will be available for students who successfully complete the course. However, this certificate does not count as academic credit.

The data scraping BOC is an advanced version of the five-week massive open online course (MOOC) on data journalism offered by the Knight Center in 2015. Data scraping was addressed in the second module. Now, participants will have the opportunity to hone that particular skill. “Basic techniques of data journalism” reached more than 5,000 people from 92 countries. This is the first time the Knight Center will offer an intermediate level course on this topic.​

Marco Tulio Pires is a journalist and programmer. He was an associate-fellow of the Knight-Wallace Fellowshi and studied Data Visualization and Statistics at the School of Information at the University of Michigan. Pires also studied project coordination and social businesses in the School of Business at Georgetown University in the Global Competitiveness Leadership program. He worked as a producer and TV news coordinator at TV Globo, as a science reporter for VEJA and technical advisor to the office of Social Development Secretariat of São Paulo government, responsible for Innovation, Technology and Transparency. Currently, he is the global coordinator of the Escola de Dados and one of the founders of data journalism agency Journalism ++.

“We are delighted with this partnership with Escola de Dados Brazil to bring Brazilian journalists such a practical and useful course that solves specific problems faced on a daily basis by reporters that are specializing in data journalism,” said Professor Rosental Alves, founder and director of the Knight Center at the University of Texas at Austin. “We are surrounded by large amounts of data, and journalists and the media have struggled to get the information and present it in the best and most effective way possible.”

“Data journalism is not a luxury, nor a genre for the most sophisticated and wealthiest newsrooms in the world. It includes a series of techniques and working methods for journalistic investigation that are essential in any newsroom today,” Alves said.

Escola de Dados is a network of non-profit organizations operating in 13 countries and has existed in Brazil since 2013. Its primary focus is to help activists and journalists understand the world of data so it can have the greatest impact on their professional activities. Escola de Dados also hosts a Global Fellowship, identifying and training local leaders to raise the level of information literacy in different parts of the world.

The Knight Center for Journalism in the Americas was created in 2002 by Professor Rosental Calmon Alves, holder of the Knight Chair in Journalism and UNESCO Chair in Communication at the School of Journalism at the University of Texas at Austin. The distance learning program provided by the Knight Center is possible thanks to the support of the John S. and James L. Knight Foundation, the Moody College of Communication at the University of Texas at Austin and other donors, as well as income from registration fees and issued certificates. Since 2012, MOOCs and other online journalism courses from the Knight Center have reached more than 70,000 people from 169 countries.