Paul Bradshaw Data Journalism Massive -- scattered notes
October 19, 2011
Scraping
-
Tools:
- OutWit Hub
- Needlebase
- Scraperwiki
- Google Spreadsheets
- Formulae
-
Walkthru using Google Docs (=import)
- Open a spreadsheet
- In A1, type the URL of a page with a table.
- In cell A2, type: =ImportHTML(A1, “table”, 1)
-
Function importHTML($source, $element, $index)
- Source = Where you’re getting data from. Can be a spreadsheet cell.
- Object = Which type of object in the HTML document you want to parse. Likewise.
- Index = Which object? Ditto.
-
Use Google News RSS; Google Alerts
-
Set up a regular supply of data:
- RSS for regulators, campaigns, gov, EU, ONS, data.gov.uk
- RSS feeds for WDTK, OpenlyLocal, OpenCorporates, OpenCharities, disclosure logs
-
Advanced spreadsheet stuff:
- “filetype:”, “site:” do what you expect.
-
”~” is for synonyms
lunchbreak
-
Using importXML($url, $xpath)
-
Useful xpaths:
- “//div[starts-with(@class, ‘jobWrap’)]”
- “//p[starts-with(@style, ‘font-size: 10pt’)]”
- =transpose($range) changes from rows to columns.
For next class:
- Play around with some scraping tools and write a blog about it.
- Start shaping your project.

© 2018 Ændrew Rininsland, except where otherwise noted.