Appendix B: Database Managers
Database managers are sophisticated computer programs used to manipulate, query, and organize tabular electronic data. The simplest database structure is a flat-file that stores data in rows and columns. When used to analyze data, Excel functions as a flat-file database, allowing for simple operations such as sorting and filtering, as well as more sophisticated summary calculations.
Relational database managers do everything flat-file systems do, but they also allow data to be stored in more than one table. These programs employ sophisticated querying using a special language called SQL. It allows users to analyze their data any way they like, as well as alter and delete the data, and manage the database system itself.
Desktop programs such as Microsoft Access® and Corel Paradox® are best used to work with smaller datasets. Server applications such as Microsoft SQL Server®, Oracle®, MySQL®, and PostgreSQL® work with databases of nearly any size. The latter two are open-source programs that are free to download and use. Any time you enter or retrieve information on a website, it is likely that one of these sophisticated server databases is doing the hard work in the background.
We focus here on Access, a commonly-used desktop application, and one that, so long as the database has no more than a few hundred thousand records of average length, serves well. We are using the 2013 version for the illustrations; it is nearly identical to the 2007 and 2010 editions that preceded it. Users who want a more complete guide to Access may wish to purchase Fred Vallance-Jones and David McKie’s Computer-Assisted Reporting: A Comprehensive Primer, available from Oxford University Press.
Microsoft Access® is a relational database management system (RDMS). It is specially designed to analyze and extract information contained in multi-column lists called "tables."
A relational database allows information to be stored in more than one table, saving both disk storage and memory space. The related information is linked by common ID codes.
In Access, the term database refers to a kind of holding area that includes tables, queries, forms, and other Access objects. The following illustration shows an Access database containing one table.
A table is where actual data are stored. It is made up of columns called fields, and rows called records.
A record contains all of the information about one instance of whatever is being recorded or tracked. For example, a record may contain all of the information about one government grant, one restaurant inspection, one police arrest, or one adverse drug reaction. A record is made up of fields.
A field contains one piece of the information recorded in a table. Sometimes, fields will be referred to as data elements. A field may contain a first or last name, a date, an address, a narrative description, an amount of currency, or just about any kind of information. Together, all of the fields in a record contain all of the information in that record.
A query is a question you ask of the database to extract information or summarize information from the database. Queries may be written in structured query language (SQL), but we will utilize a more user-friendly function in Access, which writes the SQL for us. This interface is called the query grid.
Record length is the total length in bytes (1 byte = 1 character) of a record in a table. The record length multiplied by the number of records gives a rough idea of the size of the table (e.g., if the record length is 100, and there are 10,000 records, the data takes up roughly 1,000,000 bytes, which approximates to 1 megabyte).
A record layout is a list of all of the fields in a table or tables. The list will generally include the field name, its length, and its data type.
Different data types store different kinds of data. Some of the key data types are text, number, currency, date/time, memo (for very long text entries), and yes/no. You need to use the right data type. Text fields can store any characters, but you cannot do math with them (you can count them, however). Number fields can store numbers only, and allow for mathematical operations. Date/time fields are self-explanatory, as are currency fields (a specialized number type). Yes/no fields are used to define whether a certain variable is true or false (e.g., to record whether an event mentioned in another field, such as "test_completed," occurred or not).
The following is an illustration of a table in Microsoft Access. It is a list of credit card transactions in Toronto in June 2014. The database file used in this tutorial is available for download from this site, and the original data is found at http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=f7d66dc329171310VgnVCM1000003dd60f89RCRD&vgnextchannel=1a66e03bb8d1e310VgnVCM10000071d60f89RCRD. If the URL changes, you should be able to find the data by navigating to the City of Toronto’s open data portal.
This table is not unlike the kind that a reporter might use either for a quick story produced close to the deadline, or for a more thorough investigation. It records the expenditures of public money by political figures, always a hot-button issue with readers and electronic-media consumers.
The next image shows what Microsoft calls the "design view" of the same table, displaying the different data types. You can add, remove, and edit field entries in design view. You toggle between these views using the leftmost button on the "home" ribbon.
The simplest way to manipulate the information is to sort it. To do so, click on the ﬁeld name so the entire ﬁeld is selected, and then click on either the A to Z button on the "Home" ribbon, to sort in ascending order, or the Z to A button, to sort in descending order. In the following illustration, the table has been resorted on the merchant name:
Another simple way to manipulate data is to use the "filter by selection" function. Select and highlight a word in one of the fields, as in this illustration:
When you click on the "selection" button (it's on the "home" ribbon and has an icon that looks like a funnel with a lightning bolt beside it), you will be offered four choices, one of which is "Equals [whatever you highlighted in the table]."
The table will be filtered to show only those records in the table that include the selected word in the selected field:
To remove the filter, click on "Toggle filter." Simple sorting and filtering, however, are relatively primitive functions. If you want to see the real power of a database manager, start working with queries. The following illustration shows the query grid with different parts illustrated.
The simplest query duplicates the filter, showing only the records that meet certain criteria. To start designing a query, choose "Query design" on the create ribbon, then in the "Show table" dialogue box, select and add the table you want to use in the query. When you click "close," the empty query grid, populated with the table you chose, will appear.
To pick what fields you want to include, double-click on the selected fields in the small box in the upper left of the query grid. This will drop the fields into the grid.
To view only transactions for merchant THE HOME DEPOT #7080, we would type "THE HOME DEPOT #7080" (include the double quotation marks) into the criteria line below the "MerchantName" field in the grid. If we would like to display the amount of each transaction in descending order, we can pick "descending" from the list that appears when we click in the "sort" line under the "amount" field.
This illustration shows the query ready to run.
To run the query, click on the red exclamation point on the "design" ribbon. This is what the result looks like:
You can also use part of a field entry in the criteria line by using the "like" operator. Instead of typing in THE HOME DEPOT #7080, for example, you can type in *HOME DEPOT*. This would show all Home Depot stores rather than just store 7080. The asterisk is a wild-card character that stands for any number of characters. You can use the asterisk at the beginning or end of a string.
The "show" box below each field controls whether that field is displayed in the final result.
You can use standard mathematical operators in fields with number, currency, or date field types. You can limit results to amounts greater than 100, for example, by writing 100 in the criteria line under the "amount" field. This would eliminate transactions of $100 or less. (To include $100 transactions, you would write >=100.)
You can write queries that limit results to a certain date by entering a date in the criteria line under the date field. For records between two dates, type Between date1 and date2 in the criteria line. You can also use the >, >=, <, and <= operators with date fields.
If you want to limit results based on a yes/no field, type "yes" or "no" in the criteria line.
Database queries take advantage of Boolean logic, a system invented by British mathematician George Boole a century and a half ago. We won't go into the history or its more complex applications here, but the key thing to know is that Boolean logic uses the operators AND, OR, and NOT. You can use these Boolean operators to fine-tune your queries:
- AND allows you to specify that more than one condition be true;
- OR allows you to specify that one or another (or one of several) conditions be true;
- NOT simply reverses the result, so that something must not be true to satisfy the criteria.
If we want to see all credit card transactions at Home Depot that were for an amount more than $100, we would start a new query and bring our chosen fields down into the query grid. Next, we would write Like 0"*Home Depot*" in the criteria line under the MerchantName field, and > 100 under TransAmt, as in the following illustration:
When Access reads across the query grid, it assumes AND is the operator when the criteria line is filled in straight across in this way. This is the result:
The OR operator can be used in two different ways. You can write "OR" between different terms under one field to get the query to return records that meet one of the terms. In our table of credit card transactions, we could write a query that would return transactions either at Home Depot or at Loblaws by writing Like "*Home Depot*" or Like "*Loblaws*" in the criteria line under the MerchantName field, as in this illustration:
You can also use the OR operator between more than one field, by putting the entries in the criteria lines on different levels, as in this example, which would find transactions either by TRANSPORTATION SERVICES or at Canadian Tire.
You'll note in the above illustration that one of the criteria is written in mixed case and the other in upper case. Access is not case sensitive when writing criteria in queries.
Access is also adept at doing simple math, such as counting and adding. Click on the "Totals" button (which is a Greek sigma) on the "Design" ribbon. This will make the "totals" line appear, which is where you enter math instructions.
If we wanted to know how many records in the table relate to each city division, we would start a new query and select the "Division" field. The totals line automatically displays "Group By" under the field. If we changed the "Group By" line under the ID field to "count," the program will give us our answer. We can sort the result from largest to smallest count, by selecting "descending order" in the sort line under the ID field. The query grid would look like this:
Running the query produces this result:
Remember that we are counting the rows in the table, each of which represents one transaction. The ID field is used to do the count because we know that the field contains no "null" or empty values. If the field included null values, the result will be inaccurate because null fields are not counted by Access. This quirk of the program can be a real pain.
When you group results in Access, you are creating subtotals based on the entries in the field or fields. In this case, subtotals are created for each person. If we were to leave out the "name" fields, and still run the count on the ID field only, the query would return the total number of records in the table.
You can also add things up in Access. To add up the total amount spent by each division, we would start a new query and select the "Division" and "TransAmt" fields, click on the "Totals" button, then change the totals line under the "TransAmt" field to "sum." To sort the result, we might choose descending order. The query would look like this:
This is the result:
We could also work out the average amount for each division, the maximum, or the minimum amount, by choosing these entries instead of "sum" in the "group by" line.
To limit the calculation to certain records, we would use the "where" operator in the totals line. We could limit the calculation to "craft supplies" expenses exclusively by adding the "Purpose" field to the query (by double-clicking on the field name in the small box in the upper left), and entering "craft supplies" in the criteria line under the "Purpose" field. (We would need to change the totals line under the new field to "where.") The query now includes only those records involving craft supply expenses.
This is what the query looks like when the results are limited to craft supplies expenses:
And this is what the result looks like:
These are the basic types of queries used most often by journalists. They work equally well on much larger tables. Again, for full explanations of more advanced query types and how they relate to journalistic purposes, consult Vallance-Jones and McKie's Computer-Assisted Reporting.
Other Important Database SkillsImporting data
When you receive data from the government, perhaps by downloading from an open data site or after through a freedom of information request, you will frequently be given a file in Excel format or delimited text format. You can import these directly into Access.
You begin the import operation on the "External Data" ribbon. Click on the appropriate icon to import Excel, text or other types of files.
If the field names are in the first line of the file, make sure you click the "first row contains column headings" checkbox on the first screen of the import wizard (the interface that Access gives you when you are importing data). On the third last screen, you are given the option of having Access add a primary key, which is an automatically generated numbered list, added to the resulting table, assigning a unique number to each record, counting up from 1.
When importing "delimited text," you need to specify what the delimiters are. These will usually be either commas or tabs; they act like column dividers in a file that does not have real columns. Access can also import what are known as "fixed-width text files," in which each field begins a specified number of characters in from the left. In practice, however, you will not run into such files very often.
You will notice you can also "link" to a file. When you link, you don't create a copy of the data in your Access database, but make a connection to the original file. This can be an advantage when the outside file will often be updated. Any updates will be reflected in the mirror-image file in your Access database. When you import a file, you create a snapshot of the data as it was at the moment you imported it.
You will recall we mentioned that queries can be written directly in structured query language, or SQL, and that the query grid writes the SQL for you. If you want to see what the underlying SQL looks like—because i) you know SQL and would like to modify it, ii) you are curious, or iii) you enjoy visual torture—you can view it by clicking on the view button at the top left of the "Design" or home ribbons (when working with a query), and selecting "SQL view."
Be aware that Access® often adds unnecessary parentheses to its SQL, so it isn't as complicated as Access makes it look. Computer-Assisted Reporting: A Comprehensive Primer includes a guide to using SQL in queries.