View on GitHub

Oak Ridges Moraine Groundwater Program - Database Manual

ORMGP Database Manual

Appendix A - Basic Outline and Use of Structured Query Language (SQL)

This is an introduction to the syntax and applied use of SQL with specific reference to the ORMGP database. SQL allows users to access (and manipulate) data from relational databases (or more formally known as a ‘Relational Database Management System’ - RDBMS) in a fairly straightforward manner using a language-natural syntax without (with the exception of advanced queries) having to know computer programming. Almost all cases can be considered to use ‘standard’ SQL (also referred to as ‘ANSI’ SQL) that should be ‘mostly’ applicable across a number of relational database software (e.g. Microsoft SQL Server, Microsoft Access, Oracle, MySQL, SQLITE, etc …). Note that there are slight differences in the supported syntax between these packages. In addition, syntax described below may also be applicable across multiple SQL statement types (instead of just within the example given).

Note that all SQL syntax will be shown in capitals. Comments and placeholders will be shown in lowercase. Single quotes (i.e. ‘example’) are used where text is being examined - note that in some software, double quotes (i.e. “example”) would be used instead.

A.1 Select Statement

Select Statement - Overview

SELECT is the most common statement used within a relational database, allowing the user to ‘select’ specific information from an existing table (or tables). The general syntax is of the form

SELECT <comma-delimited list of fields> FROM <a single table name>

Examining the contents of the D_LOCATION table would require

SELECT * FROM D_LOCATION

which would return all the fields and all the rows within the D_LOCATION table. Note the use of the ‘*’ wildcard, this allows the user to allow selection of all existing fields without having to list the fields individually, as

SELECT 
LOC_ID,
LOC_NAME,
LOC_NAME_ALT1,
LOC_TYPE_CODE,
OWN_ID,
...
FROM
D_LOCATION

where the ‘…’ indicates the remainder of the fields (this is not valid SQL syntax, just a placeholder for descriptive purposes). Note that the fields are comma-delimited and are listed across multiple lines - line spacing (and additional blank space on lines themselves) does not affect the query and can be used as a formatting tool (enabling ease of viewing).

You likely would not want to examine all rows from a particular table at once. If you were assembling a query (i.e. putting a complicated query together based upon succeeding steps) you could choose to examine only the first few records returned using the TOP keyword.

SELECT TOP 1000 LOC_ID,LOC_NAME FROM D_LOCATION

This extracts the location identifier and associated location name from the D_LOCATION table but limits the number of rows returned to the first 1000. The order the rows are returned should not be relied upon - the means for specifying a particular order will be outlined in a subsequent section.

In many cases, the user is looking for particular information. For example, if you wanted to see the names associated with a particular location in the database, given a LOC_ID of ‘-828830263’, then the query

SELECT * FROM D_LOCATION WHERE LOC_ID = -828830263

would return all fields but only a single row from the D_LOCATION table - the row matching the provided location identifier. Other operators, instead of using ‘=’ (equal to) as the test operator, can also be used including

So a user could determine all those boreholes (from the D_BOREHOLE table) with a BH_BOTTOM_DEPTH of 20m by

SELECT * FROM D_BOREHOLE WHERE BH_BOTTOM_DEPTH > 20

The name fields (e.g. LOC_NAME, LOC_NAME_ALT1, etc …) can also be used to extract information about a particular location (stored in D_LOCATION). If only a name was known

SELECT * FROM D_LOCATION WHERE LOC_NAME='Port Perry OW 05/3'

would pull the same information as using the LOC_ID (above) of ‘-828830263’. However, searching through text fields is different from numeric in that rarely do the text fields match the exact text the user enters. Instead of an exact match, the user - when looking for location names containing certain text - can use the LIKE keyword

SELECT * FROM D_LOCATION WHERE LOC_NAME LIKE '%UGAIS%'

This extracts all rows (and fields) of information where the location name contains, anywhere within its name, the ‘UGAIS’ key. Note the use of the ‘%’ signs (in Microsoft Access and other packages, a ‘*’ can be used instead) as part of the quoted text - these are text wildcards specifying that any number of characters can be in front of the ‘UGAIS’ text and that any number of characters can take place after it. This includes the case of no characters before and/or after.

Multiple checks can be made (after WHERE) against the table. This is accomplished through additional boolean (or logical) operators.

SELECT 
*
FROM D_LOCATION
WHERE
LOC_NAME LIKE '%UGAIS%' AND LOC_ORIGINAL_NAME LIKE 'TR%'

In this case, we’re specifying two checks - one against the LOC_NAME field, another against the LOC_ORIGINAL_NAME field. For the latter, notice that the ‘%’ character only appears after the ‘TR’, indicating this text must be placed at the beginning of the name. The boolean operator used is AND - both checks must be true, i.e. ‘UGAIS’ must appear in LOC_NAME and ‘TR’ must be at the beginning of the LOC_ORIGINAL_NAME. Another boolean operation is OR, as in

SELECT
* 
FROM D_LOCATION
WHERE 
LOC_NAME LIKE '%UGAIS%' OR LOC_NAME LIKE '%OGS'

where the rows returned from D_LOCATION depend on whether the ‘UGAIS’ or the ‘OGS’ text appear as part of LOC_NAME. To return to numeric comparisons, if we wanted to find all boreholes from D_BOREHOLE with BH_BOTTOM_DEPTH of greater than 20m but less than or equal to 40m the following could be used

SELECT
*
FROM D_BOREHOLE
WHERE
BH_BOTTOM_DEPTH > 20 AND BH_BOTTOM_DEPTH <= 40

More complicated queries can be built by combining multiple checks on values within a table, for example the following

SELECT 
*
FROM 
D_LOCATION
WHERE
(LOC_COORD_EASTING >= 620000 AND 
 LOC_COORD_EASTING <=640000)
AND
(LOC_COORD_NORTHING >= 4831000 AND 
 LOC_COORD_NORTHING <= 4841000)
AND 
(LOC_NAME LIKE '%UGAIS%' OR LOC_NAME LIKE '%OGS%')

where we are searching for all locations within a designated spatial area (using coordinates; i.e. LOC_COORD_EASTING and LOC_COORD_NORTHING) as well as extracting only those locations which include ‘UGAIS’ or ‘OGS’ in their names.

Select Statement - IN condition

Multiple OR statements can be used to find all locations given a particular characteristic, like any location that could have geological information

SELECT LOC_ID,LOC_NAME,LOC_NAME_ALT1,LOC_TYPE_CODE
FROM D_LOCATION
WHERE
LOC_TYPE_CODE=1
OR LOC_TYPE_CODE=7
OR LOC_TYPE_CODE=11
OR LOC_TYPE_CODE=17
OR LOC_TYPE_CODE=18
OR LOC_TYPE_CODE=19

where the LOC_TYPE_CODES (refer to R_LOC_TYPE_CODE) are

However, this statement can be simplified by using the IN statement which allows the possible values the user is checking against to be listed, as in

SELECT LOC_ID,LOC_NAME,LOC_NAME_ALT1,LOC_TYPE_CODE
FROM D_LOCATION
WHERE
LOC_TYPE_CODE IN (1,7,11,17,18,19)

which is shorter and much easier to understand.

The IN condition also allows us to combine SELECT queries (instead of using, for example, JOINS - as described below). The first SELECT query (actually occurring at the end of the ‘full’ query) would come up with a list of single values

SELECT LOC_ID FROM D_BOREHOLE

Here we’re getting a list of all LOC_ID (i.e. locations) that are found in the D_BOREHOLE table. This is approximately equivalent to using the LOC_TYPE_CODE in the previous example. Now that we have a list of LOC_ID’s, we can pull their names and type codes from D_LOCATION except, this time, we’re using the IN keyword to include the first SELECT statement

SELECT LOC_ID,LOC_NAME,LOC_NAME_ALT1,LOC_TYPE_CODE
FROM D_LOCATION
WHERE 
LOC_ID IN
(
    SELECT LOC_ID FROM D_BOREHOLE
)

This should return equivalent results to that of using the LOC_TYPE_CODE earlier. Again note that, for this type of use, the first SELECT statement (i.e. the one in brackets) should only return a list of a single value (otherwise an SQL error will be returned).

Select Statement - BETWEEN condition

The BETWEEN condition can be used for comparisons for a range of values. Thus, to simplify a previous example

SELECT 
*
FROM 
D_LOCATION
WHERE
LOC_COORD_EASTING BETWEEN 620000 AND 640000
AND
LOC_COORD_NORTHING BETWEEN 4831000 AND 4841000
AND 
(LOC_NAME LIKE '%UGAIS%' OR LOC_NAME LIKE '%OGS%')

with similar results.

Select Statement - Joins

In a relational database, only certain information on a particular object/entity would be contained in a single table (e.g. D_LOCATION containing basic locational information including name, coordinates, location type, etc …). Additional information would then be stored in additional tables (dependent upon the particular location). For example, borehole locations would have information stored in D_BOREHOLE, including depths and elevations of the top and bottom of the hole, the start and ending dates of drilling, etc … A borehole location might also have a screen - that information would be stored in other tables including D_INTERVAL (the interval name and type), D_INTERVAL_MONITOR (the top and bottom of the screen), etc … The information in the tables would be linked (i.e. related) by keys (i.e. primary keys where each row in the table can be distinguished between based upon the key, each being unique). Examples of primary keys in the database include LOC_ID (found in D_LOCATION, D_BOREHOLE, etc …) and INT_ID (found in D_INTERVAL, D_INTERVAL_MONITOR, etc …). Means of combining information from multiple tables, based upon these primary keys, are called Joins. We’ll examine two types - inner joins and outer joins

Select Statement - Inner Joins

All joins use the primary key to relate tables. Inner joins require that this key exists in both tables being referenced. For example, if we wanted the location information (found in D_LOCATION) and borehole elevation and depth for any borehole (in D_BOREHOLE) with depths greater than 20m, our query would look like

SELECT
dloc.LOC_ID,
LOC_NAME,
LOC_NAME_ALT1,
LOC_NAME_ORIGINAL,
BH_GND_ELEV,
BH_BOTTOM_DEPTH
FROM
D_BOREHOLE AS dbore
INNER JOIN
D_LOCATION AS dloc
ON
dbore.LOC_ID = dloc.LOC_ID

where we’re ‘joining’ our two tables (using the INNER JOIN keywords) based upon the presence of LOC_ID (which is our primary key) in each table (note that in the table we’re relating, it’s referred to as the foreign key). We’ve also introduced another concept here, the AS keyword. As some of the fields we’re using occur in both tables (namely, in this case, LOC_ID) we should specify which one we’re actually referencing/using in the query - the AS keyword allows us to use an alias to reference the table (and also fields, but this is not shown here) using a different name. Here, D_BOREHOLE is aliased as ‘dbore’ and D_LOCATION is aliased as ‘dloc’. This allows us to be clear regarding the source of the fields as well as shortening the names required in the query itself (using ‘Dot Notation’, i.e. having a ‘period’ between the name of the table and the name of the field).

Multiple tables can be joined at the same time. If we wanted to include interval data as part of our, above, query then we could use

SELECT
dloc.LOC_ID,
dloc.LOC_NAME,
dloc.LOC_NAME_ALT1,
dloc.LOC_ORIGINAL_NAME,
dbore.BH_GND_ELEV,
dbore.BH_BOTTOM_DEPTH,
dint.INT_ID,
dint.INT_NAME
FROM
D_BOREHOLE AS dbore
INNER JOIN
D_LOCATION AS dloc ON dbore.LOC_ID = dloc.LOC_ID
INNER JOIN
D_INTERVAL AS dint ON dloc.LOC_ID = dint.LOC_ID

Note that we’ve now specified exactly (using the <table>. syntax) which field is coming from which table.

Select Statement - Outer Joins

Outer joins are used when we’re unsure whether the key will appear in one of the tables. Outer joins are different in that they’re specified using a LEFT or RIGHT keyword - this stresses which of the two tables (i.e. the left or right table in the query) is the one we’re using as the base (i.e. which has the key). Using our previous example, if we wanted to include the bedrock elevation for each borehole, knowing that some (most?) boreholes will not actually have bedrock values we would

SELECT
dloc.LOC_ID,
dloc.LOC_NAME,
dloc.LOC_NAME_ALT1,
dloc.LOC_ORIGINAL_NAME,
dbore.BH_GND_ELEV,
dbore.BH_BOTTOM_DEPTH,
dint.INT_ID,
dint.INT_NAME,
vbed.BEDROCK_ELEV
FROM
D_BOREHOLE AS dbore
INNER JOIN
D_LOCATION AS dloc ON dbore.LOC_ID = dloc.LOC_ID
INNER JOIN
D_INTERVAL AS dint ON dloc.LOC_ID = dint.LOC_ID
LEFT OUTER JOIN
V_GEN_BOREHOLES_BEDROCK as vbed ON vbed.LOC_ID = dbore.LOC_ID

Select Statement - Calculations (Arithmetic Operations) and Mathematical Functions

In many cases, a user would wish to combine values in some sort of calculation and return the result of that calculation as part of the query. In the previous query, for example, instead of returning just the borehole bottom depth (i.e. BH_BOTTOM_DEPTH) we may wish to calculate the elevation at the bottom of the borehole as well as the depth at which the bedrock surface is encountered. In which case we would change the above to

SELECT
dloc.LOC_ID,
dloc.LOC_NAME,
dloc.LOC_NAME_ALT1,
dloc.LOC_ORIGINAL_NAME,
dbore.BH_GND_ELEV,
dbore.BH_BOTTOM_DEPTH,
(dbore.BH_GND_ELEV - dbore.BH_BOTTOM_DEPTH) as BH_BOT_ELEV,
(dbore.BH_GND_ELEV - vbed.BH_BEDROCK_ELEV) as BED_DEPTH_M,
dint.INT_ID,
dint.INT_NAME
FROM
D_BOREHOLE AS dbore
INNER JOIN
D_LOCATION AS dloc ON dbore.LOC_ID = dloc.LOC_ID
INNER JOIN
D_INTERVAL AS dint ON dloc.LOC_ID = dint.LOC_ID
LEFT OUTER JOIN
V_GEN_BOREHOLES_BEDROCK as vbed ON vbed.LOC_ID = dbore.LOC_ID

Notice the use of the AS keyword here to assign the name BED_DEPTH_M and BH_BOT_ELEV to the result of the calculations; if not specified, an automatically generated column name would be assigned instead.

Standard arithmetic operations include

The results of ‘Modulo’ (for those unfamiliar with the operation) would be the (integer) remainder after a division operation. (For example: ‘3 % 2’ would return a value of 1; ‘2 % 2’ would return a value of 0.)

In addition there are a number of non-standard (i.e. not necessarily available in all relational database software) mathematical functions that can be useful.

Other functions, including trigonometric, are also available in the various relational database software (but are not described here).

Select Statement - CASE condition

The CASE conditional statement can be used when multiple values are present and the user wants to modify the result based upon the particular value. For example, if we were performing a check on the calculation of depth (in metres) for boreholes based upon their OUOM values there is an issue - the units of the OUOM values can vary; we’ll need to account for this, as follows

SELECT
dbore.LOC_ID
,dbore.BH_BOTTOM_DEPTH
,CASE
WHEN dbore.BH_BOTTOM_UNIT_OUOM LIKE 'mbgs' THEN dbore.BH_BOTTOM_OUOM
WHEN dbore.BH_BOTTOM_UNIT_OUOM  LIKE 'masl' THEN 
(dbore.BH_GND_ELEV - dbore.BH_BOTTOM_OUOM)
WHEN dbore.BH_BOTTOM_UNIT_OUOM LIKE 'fbgs' THEN
(dbore.BH_BOTTOM_OUOM * 0.3048)
ELSE -9999
END AS CHK_BH_BOTTOM_DEPTH
FROM 
D_BOREHOLE as dbore

This returns a single value, calculated in certain instances, in ‘mbgs’ (i.e. metres below ground surface) to compare against BH_BOTTOM_DEPTH; the returned value is assigned the column name CHK_BOTTOM_DEPTH. Note that the default, if none of the conditions are met, is a particular value (i.e. ‘-9999’) that can be used to check for conditions the user has not specified.

Select Statement - ORDER BY

Information returned from a query can, at times, be more useful when it is ordered in some way. This can be the case in particular when working with time-stamped data (i.e. data with an associated date-time). If we, for example, wanted to examine all manual water levels from the borehole ‘Port Perry OW 05/3’ (LOC_ID -828830263) we would first need to determine the interval associated with the location (remember, water level - temporal - information is linked against a particular screen/interval)

SELECT INT_ID,INT_NAME,INT_NAME_ALT1
FROM D_INTERVAL WHERE D_LOCATION=-828830263

This would return the interval identifier (a numeric) and the names associated with the particular location we’re examining. However, we can skip this step by including the determination of the INT_ID as part of the water level query

SELECT
*
FROM
D_INTERVAL_TEMPORAL_2 AS dit2
INNER JOIN
D_INTERVAL AS dint
ON
dit2.INT_ID=dint.INT_ID
WHERE
dint.LOC_ID=-828830263

where we’ve only specified the location identifier. Alternately, the name of the location could be used to extract the particular borehole (and compared against, for example, LOC_NAME). This query, though, extracts all information linked against this location (and interval). We need to limit what is returned to only that in which we are interested

SELECT
 dit2.INT_ID
,dit2.RD_DATE 
,dit2.RD_VALUE 
,ruc.UNIT_DESCRIPTION
FROM
D_INTERVAL_TEMPORAL_2 AS dit2
INNER JOIN
D_INTERVAL AS dint
ON
dit2.INT_ID=dint.INT_ID
INNER JOIN
R_UNIT_CODE as ruc
on
dit2.UNIT_CODE=ruc.UNIT_CODE
WHERE
dint.LOC_ID=-828830263
and dit2.RD_NAME_CODE=628

We have not specified that we’re only interested in water level manual readings (RD_NAME_CODE 628; refer to R_RD_NAME_CODE). Note also that the UNIT_DESCRIPTION from R_UNIT_CODE has been included as a check that all values returned (RD_VALUE) are in consistent units (in this case, all values should be in ‘masl’). The information returned would be in some random order making it difficult to look for trends - we’ll now (finally) order the information by date (RD_DATE)

SELECT
 dit2.INT_ID
,dit2.RD_DATE 
,dit2.RD_VALUE 
,ruc.UNIT_DESCRIPTION
FROM
D_INTERVAL_TEMPORAL_2 AS dit2
INNER JOIN
D_INTERVAL AS dint
ON
dit2.INT_ID=dint.INT_ID
INNER JOIN
R_UNIT_CODE as ruc
on
dit2.UNIT_CODE=ruc.UNIT_CODE
WHERE
dint.LOC_ID=-828830263
and dit2.RD_NAME_CODE=628
ORDER BY
dit2.RD_DATE

By default, the returned information would be in ascending order (i.e. from the earliest date to the latest). If we wished to reverse the order, we would append DESC after the final ‘dit2.RD_DATE’ reference (i.e. at the end of the query).

Select Statement - Aggregate Functions and GROUP BY

If we needed to examine the average, the minimum/maximum or the total number of recorded values of the previous water level manuals we would make use of an aggregate function. This requires, in addition, a GROUP BY statement as part of the query. The GROUP BY allows a list of values to be created based upon a particular field. These functions include

These are used as following

SELECT
 dit2.INT_ID 
,AVG(dit2.RD_VALUE) as AVG_WL
FROM
D_INTERVAL_TEMPORAL_2 AS dit2
INNER JOIN
D_INTERVAL AS dint
ON
dit2.INT_ID=dint.INT_ID
INNER JOIN
R_UNIT_CODE as ruc
on
dit2.UNIT_CODE=ruc.UNIT_CODE
WHERE
dint.LOC_ID=-828830263
and dit2.RD_NAME_CODE=628
GROUP BY
dit2.INT_ID

Here we’re grouping the returned information by the INT_ID - this means that any values related to a particular INT_ID will become part of a (virtual) list. From this list, we then calculate the average value using the AVG keyword. The result would be one value/row returned for each INT_ID. We can perform multiple aggregate functions at once

SELECT
 dit2.INT_ID 
,AVG(dit2.RD_VALUE) AS AVG_WL
,MIN(dit2.RD_VALUE) AS MIN_WL
,MAX(dit2.RD_VALUE) AS MAX_WL
,COUNT(dit2.RD_VALUE) AS NUM_WL
FROM
D_INTERVAL_TEMPORAL_2 AS dit2
INNER JOIN
D_INTERVAL AS dint
ON
dit2.INT_ID=dint.INT_ID
INNER JOIN
R_UNIT_CODE AS ruc
ON
dit2.UNIT_CODE=ruc.UNIT_CODE
WHERE
dint.LOC_ID=-828830263
AND dit2.RD_NAME_CODE=628
GROUP BY
dit2.INT_ID

This query returns the average, minimum, maximum and number of manual water levels associated with this particular location (and interval).

A.2 UNION and UNION ALL

Allows combinations of SELECT statements to be ‘appended’ together resulting in a single output set. The number of fields, the order and the field types must match between the statements. Generally, the names of the fields should match as well. Note that UNION will remove duplicate records between (and possibly within) the statements while UNION ALL will retain all records. Thus, a statement like the following

SELECT
dit1a.SAM_ID
,dit1a.SAM_SAMPLE_NAME
,dit1a.SAM_SAMPLE_DATE
,dit2.RD_NAME_CODE
,dit2.RD_VALUE 
,dit2.UNIT_CODE
,NULL AS RD_MDL
,NULL AS RD_VALUE_QUALIFIER
,NULL AS RD_UNCERTAINTY
FROM
D_INTERVAL_TEMPORAL_2 AS dit2
INNER JOIN
D_INTERVAL_TEMPORAL_1A AS dit1a
ON
dit2.INT_ID=dit1a.INT_ID AND dit2.RD_DATE=dit1a.SAM_SAMPLE_DATE
WHERE
dit2.RD_TYPE_CODE=77

UNION ALL

SELECT
dit1a.SAM_ID
,dit1a.SAM_SAMPLE_NAME
,dit1a.SAM_SAMPLE_DATE
,dit1b.RD_NAME_CODE
,dit1b.RD_VALUE 
,dit1b.UNIT_CODE
,dit1b.RD_MDL
,dit1b.RD_VALUE_QUALIFIER
,dit1b.RD_UNCERTAINTY
FROM
D_INTERVAL_TEMPORAL_1B AS dit1b
INNER JOIN
D_INTERVAL_TEMPORAL_1A AS dit1a
ON
dit1b.SAM_ID=dit1a.SAM_ID

would combine the parameter results from both D_INTERVAL_TEMPORAL_1A/1B and D_INTERVAL_TEMPORAL_2 that were taken at the same date-time for the same interval. Note that, as D_INTERVAL_TEMPORAL_2 does not have uncertainty and mean-detection limit fields, NULL values are used as placeholders. Also, the RD_TYPE_CODE of ‘77’ indicates that the parameter was taken in the field even though it is otherwise generally assumed to be a ‘laboratory’ parameter.