Table of Contents
Databases
Short introduction to databases
- Database: A collection of information which is organized for flexible searching and utilization
- Common Database Types:
- Flat (or table) models
- Basically a two-dimensional array of data elements, where all members of a given column are assumed to be similar values, and all members of a row are assumed to be related to one another. (similar to a spreadsheet)
- Relational models
- A relational database can contain more than one table.
- A relation is a connection between tables.
- A relation connects two tables using a common field or several fields.
- Database Glossary:
- Table - A set of data, with fields (columns) and records (rows).
In the states.dbf table, fields exist for the attributes representing state name, FIPS (Federal Information Processing Standard) zone, and state name abbreviation. A single record exists for each state. In the cities.dbf table, fields exist for the attributes representing city name, state, etc., and there is a single record for each city. If the values for a particular field in one table match the values in a particular field in the other table, a relationship can be made between the tables. Because both tables contain the field "STATE_NAME", it should be possible to relate or join the tables based on the common value of this field. In this way, it is possible to make selections on one table that also select related records in the other table, or to combine the tables into a single virtual table. We could select the single record representing the state of Washington in the states.dbf table and have an automatic selection of all records from the cities.dbf table representing cities within Washington state.
- Query - One or more related questions that you can ask about tables.
- In a query you can:
- See a joined or compound table based on a relation
- Only look at certain fields in a table
- Filter out records that you are not interested in
- Calculate quantities as new fields, based on other fields. Similar to calculations you would perform in a spreadsheet, but a calculation applied to one record (row) is applied to all records in the field.
Database Uses
- Databases are the preferred method of storage for large multiuser applications, where coordination between many users is needed.
- Databases are in widespread use, especially in enterprises.
- Two typical uses are for 1) Online Transaction Processing (OLTP), which usually keeps up with operational data, like payroll, inventory, or in the banking industry, account balances; and 2) Data warehousing, which brings together large amounts of data for querying or reporting purposes.
- Much better than spreadsheets for analysis of large quantities of scientific and business data (thousands of records)
- In spreadsheets you mainly use formulas to analyze data; in databases you will mainly use queries.
Database Software
- Personal relational databases such as Microsoft Access
- SQL relational databases such as Oracle, IBM DB2, MySQL, MS SQL Server
- SQL (Structured Query Language, pronounced 'sequel') is a standardized language used to query data stored in relational databases.
Note on database assignments
- Unlike the spreadsheet assignments, the Microsoft Access assignments will mostly require you to submit Access reports in the form of PDF files.
- You are required to have your name and assignment number on the PDFs.
- The titles and headings used in the Access-generated reports are required to be clear, descriptive and spelled checked.
- As in the spreadsheet assignments, rules regarding a uniform number of decimal places and units of measure do apply.
Database Assignments 1 & 2
Note: Make sure that your name and the assignment number appears in any printouts that you are required to submit.
Due date: Wed, 10/22
Database Lab 1 - Create and Sort ( handout)
- Save all work (PDF files) to the Z:\cs102\db\lab1 folder.
Database Lab 2 - Filters and Queries ( handout)
- Save all work (PDF files) to the Z:\cs102\db\lab2 folder.
Database Assignments 3 - Calculated Fields
Due: Monday, 10/27
- Save all work to the Z:\cs102\db\lab3 folder.
Download the following comma-delimited file which contains demographic data on all counties in the United States for 1990-1999:
http://rockhopper.monmouth.edu/~jchung/cs102/counties.txt
Import the text file into Excel and save as a DBase IV file (dbf).
Import the dbf file into Access.
Using Access queries, answer the following questions in a Word file called cs102dblab3.doc.
1) What is the loneliest county in the continental U.S. (in terms of population density = 1999 population divided by area (given in square miles))?
2) What is the most fertile county (kids, age 17 and under, as a percentage of the total 1999 population?)
3) What is the least fertile?
4) What percentage of single-parent households (1_M_HOUSEH and 1_F_HOUSEH) in the U.S. are headed by women (1_F_HOUSEH)?
5) Ranking the states, New Jersey ranks in what place in percentage of single-parent households headed by men?
6) Ranking the 100 largest counties (in terms of 1999 population), New York County (Manhattan) ranks in what place in percentage of single-parent households headed by women?
7) Where is divorce most prevalent, that is, what are the five counties in the U.S. with the highest proportion of divorcees [ divorced / (divorced + married + separated + widowed) ]?
8) Where is divorce least prevalent, that is, which five counties have the lowest proportion of divorcees?
9) Which county and which state have the greatest number of mobile homes?
Database Assignments 4 - Poverty in School Districts
Due: Wednesday, 10/29
- Save all work to the Z:\cs102\db\lab4 folder.
Download the following Excel data file:
(Poverty in school districts, estimated 1995, from U.S. Census Bureau)
http://rockhopper.monmouth.edu/~jchung/cs102/poverty.xls
Open poverty.xls in Access. Generate and save the following reports as PDF files. In each report, include the states that the school districts are in, if applicable.
a) The 25 most populous school districts in terms of total number of people. (create dblab4a.pdf)
b) The 25 most populous school districts in terms of total number of kids. (create dblab4b.pdf)
c) All the states (including Washington D.C.) ranked by proportion of poor kids to all kids. (poor kids / kids) (dblab4c.pdf)
d) Top 25 school districts ranked by proportion of poor kids to all kids. (poor kids / kids, where kids > 0) (dblab4d.pdf)
e) Top 25 school districts in New Jersey ranked by proportion of poor kids to all kids. (dblab4e.pdf)
Make sure that your name is on the reports that you generate.
Database Assignment 5 - Toxic Release Inventory
Due: Wednesday, 11/5
Save all work to the Z:\cs102\db\lab5 folder.
1. Download this Excel file containing [USEPA Toxic Release Inventory] data for 2003 in New Jersey:
http://rockhopper.monmouth.edu/~jchung/cs102/tri2003_reduced.xls
2. Open the file in Access to generate a link to the data file.
3. Generate and save the following reports as PDF files (reports should include the amount in pounds of toxic chemicals released and the city and county names where applicable):
(Toxic chemical releases are in pounds.)
(Generate the reports as dblab5a.pdf - dblab5f.pdf)
a) The top 10 individual emitters of the chemical Benzene in the State of NJ in 2003.
b) The top 5 NJ counties with the greatest releases of Chromium & Chromium compounds in 2003.
c) The top 5 NJ cities with the greatest releases of the chemical Ethylene Oxide in 2003.
d) The top emitters of toxic chemicals in your home zip code. (If your home zip code is not in the database, use zip code 08818.)
e) The 10 most emitted toxic chemicals in Monmouth County. (All chemicals in the database are considered toxic.)
f) Total emissions of toxic chemicals from NJ's 11 northern counties: Bergen, Essex, Hudson, Hunterdon, Middlesex, Monmouth, Morris, Somerset, Sussex, Union, Warren. What percentage of NJ's total toxic chemical releases come from just the 11 northern counties?
Make sure that your name is on the reports that you generate.
Database Assignment 6 - Linking Tables I
Due: Wednesday, 11/16
Save ALL work to the Z:\cs102\db\lab6 folder.
- Download the older 1995 NJ Toxic Release Inventory database and open in Access,
http://rockhopper.monmouth.edu/~jchung/cs102/fa05/access/form95.dbf
(Look for the TOTREL field in the 1995 table.)
- Import the 2003 NJ TRI data table into the same database as a new table,
http://rockhopper.monmouth.edu/~jchung/cs102/tri2003_reduced.xls
- Generate and print the following reports as PDF files:
Hints: For numbers 1-4, you'll need to create intermediate queries for 1995 and 2003. For number 2, you may be thrown off by the fact that there were no emissions reported for Cape May County in 1995.
1. The total toxic releases for the entire state of NJ in 2003 and 1995 and the change in lbs. between 1995 and 2003.
2. The 21 NJ counties and their change in toxic releases between 1995 and 2003 as a percent increase or decrease ( (2003-1995)/1995 ).
3. Top 20 toxic chemicals in NJ in terms of percent reduction from 1995 to 2003 ( (2003totrel-1995totrel)/1995totrel is negative).
4. Top 20 toxic chemicals in NJ in terms of percent increase from 1995 to 2003 ( (2003totrel-1995totrel)/1995totrel is positive).
Download and save the tab-delimited 2006 NJ toxic releases data at http://rockhopper.monmouth.edu/~jchung/cs102/nj2006tri.raw.txt . Open the file in Excel. Follow instructions to reduce the data to only the fields we are interested in. Generate this final report:
5. The 21 NJ counties and their change in toxic releases between 1995 and 2006 as a percent increase or decrease ( (2006-1995)/1995 ).
Database Assignment 7 (optional) - Mapping
In this optional assignment, we're going to try mapping the individual toxic emitting facilities from the 2006 NJ TRI data. We will be using Google Maps/Google Earth for this exercise. This will require you to have a Google account, which you should have already created.
Steps:
1) Make sure you're logged into your Google account. We will use the Google Docs spreadsheet for this exercise.
2) Go to Google's Spreadsheet Mapper page . We will largely just follow the steps given there. From there, open the “starter spreadsheet,” which will open a Google Docs spreadsheet in your web browser.
3) Use Access to filter the NJ TRI 2006 data to show only the facility names, latitude, longitude and street address, in that order. This data will be pasted into the Google Docs spreadsheet that you have opened.
4) After our initial attempt at mapping the NJ TRI facilities, we will attempt to apply different color Google Maps placemarks to the facilities, depending on the level of toxic emissions from those facilities. We'll categorize the facilities into 4 groups (templates) based on the level of total releases:
total releases group (template) placemark color -------------- ---------------- --------------- 1 - 9,999 1 green 10,000 - 99,999 2 yellow 100,000 - 999,999 3 orange > 1,000,000 4 red
You'll want to change your copy of the Google Spreadsheet Mapper and rename it to “NJ TRI 2006 Test”. In “NJ TRI 2006 Test,” follow my instructions for replacing the default placemarks for templates 1-4.