Web Scraping
Web scraping is the process of extracting data from websites through scheduled/automated scripts.
On a nightly basis, data are extracted from a number of open data sources, whether they be:
- “APIs”—Databases that can be directly queried by sending a web-address (URL), such as:
- KISTERS Services (KiWIS)
- AQUARIUS Time-Series Software, Aquatic Informatics Inc.
- FlowWorks
- WaterTrax
- File repositories—Typically an FTP server hosting a number of general use files, like comma-separated-values (.csv) files
- HTML tables—Readable tables posted online are converted into a dataframe—a form needed to insert into our database. This is the least reliable and thus the most effort is required.
Notes:
Streamflow discharge and stage are re-scaled to daily mean timesteps when inserted into our database.
Sources:
A number of our partners maintain internal databases. ORMGP is continuing to integrate these sources into our workflow without the need for data duplication. This is (hopefully) accomplished by establishing an Application Programming Interface (API) on the partners’ end. Currently, we have:
APIs
- Region of Peel
- York Region
- Durham Region
- TRCA
- LSRCA
- CVC
- CLOCA
- MNRF
File repositories
- MSC Datamart
- WSC HYDAT
- ECCC CaPA-HRDPA
- NOAA SNODAS
HTML Tables
- MSC historical (hourly and daily)