<<

Statistics 220 Final Exam Model Answer & Marking Guide

1. [10 marks] A plain text format is good because it can be read and written by virtually any software, so the data set in this format can be easily worked with and easily shared amongst several people, regardless of their level of expertise and regardless of the software and hardware that they have to use. The main downside to a plain text format is that there is no information, in the file itself, to describe where each value is within each row of the file. This may not appear to be a huge problem in this case because the file has a fixed-width format, but it is actually impossible for a computer to determine where the columns of data are without a human specifying the width of each column. Another problem is the fact that missing values are stored in this file as full-stops, but there is no indication in the file itself that this is what those full-stops mean. An XML format would be superior because it would make it clear, within the file itself, where every separate piece of information resides and missing values would just not be recorded in the file at all. Plain text files also tend to be larger in terms of the amount of computer memory used to store the data set (each character uses one byte). An XML format would be even worse in this regard, but using a binary format may be more efficient, at the cost of requiring more specialised software to access the data.

2. [10 marks] 3. [20 marks]

mass CDATA #IMPLIED> ]> 4. (i) [2 marks] In a delimited file, the fields within a single row are separated from each other by a special character, e.g., a comma. In a fixed-width file, each field occupies the same number of characters on every row. The Exo- data set in Figure 1 gives an example of a fixed-width format (with column widths 21, 5, 8, 8, 14, 5, 20, 15). A comma-delimited version of the same data would look like this:

planet,year,radius,period,type,mass,star,constellation PSR 1257 a,1991,0.19,25.262,Pulsar,,PSR 1257, PSR 1257 b,1991,0.36,66.5,Pulsar,,PSR 1257, PSR 1257 c,1994,0.46,98.2,Pulsar,,PSR 1257, PSR 1257 d,1994,40,62050,Pulsar,,PSR 1257, ,1995,0.05,4.23,Hot Jupiter,0.47,51 Pegasi, ,1996,0.05,4.62,Hot Jupiter,0.71,Upsilon Andromedae,Andromedae ,,,,,,,Cancer (ii) [2 marks] The DRY Principle stands for “Don’t Repeat Yourself.” This means that we would like to have only one copy of each piece of information. An example from HTML is the use of a single style element in the head of a document to control the appearance of HTML elements rather than having several identical style attributes in each element that we want to control. The ideas of normalisation in database design and nesting elements in XML design are examples of the DRY Principle applied to data storage.

(iii) [2 marks] Computer code has two audiences: the computer and human readers. Comments and the layout of code are important for human readers because they make it much easier for a human to understand the purpose of a piece of computer code and to find their way around within the code.

(iv) [2 marks] A primary key is a column, or a combinations of columns, in a database table that provides a unique value for every row in the table. The primary key makes it possible to distinguish between different rows in a database table. A foreign key is a column (or columns) in a database table that refers to the primary key of another table in the database. Every value in the foreign key must correspond to one value in the primary key that it refers to (or be NULL). Primary keys and foreign keys are used to match the rows between tables when per- forming a database join to combine the values from two tables.

(v) [2 marks] A literal character is a character that has its normal meaning, such as the letter ‘a’. A metacharacter is a character that has a special meaning within a regular expression. For example, the dollar sign, ‘$’, means the end of a line in a regular expression. 5. [5 marks]

SELECT planet, radius, period FROM planet_tbl WHERE type = ’Hot Jupiter’ ORDER BY year DESC;

6. [5 marks]

SELECT planet FROM planet_tbl WHERE planet LIKE ’% b’ ORDER BY planet;

7. [5 marks]

SELECT planet, st.star, ct.constellation FROM planet_tbl pt INNER JOIN star_tbl st ON pt.star = st.id LEFT JOIN constellation_tbl ct ON st.constellation = ct.id;

8. [5 marks]

+------+------+ | numPlanets | star | +------+------+ | 1 | 51 Pegasi | | 1 | Upsilon Andromedae | | 4 | PSR 1257 | +------+------+ 9. [4 marks]

planets[planets$radius > 1, ]

10. [3 marks]

> c(min=min(planets$radius), max=max(planets$radius)) min max 0.05 40.00

11. [3 marks]

> planets[planets$type == "Pulsar", "radius"] [1] 0.19 0.36 0.46 40.00 12. [4 marks]

aggregate(planets["radius"], list(type=planets$type), mean)

13. [3 marks]

Hot Jupiter Pulsar 1991 0 2 1994 0 2 1995 1 0 1996 1 0

14. [3 marks]

> merge(planets, ) star planet year radius type constellation 1 51 Pegasi 51 Pegasi b 1995 0.05 Hot Jupiter Pegasus 2 PSR 1257 PSR 1257 a 1991 0.19 Pulsar 3 PSR 1257 PSR 1257 b 1991 0.36 Pulsar 4 PSR 1257 PSR 1257 c 1994 0.46 Pulsar 5 PSR 1257 PSR 1257 d 1994 40.00 Pulsar 6 Upsilon Andromedae Upsilon Andromedae b 1996 0.05 Hot Jupiter Andromedae 15. [4 marks]

gsub(" ", "-", stars$star)

16. [3 marks]

[1] 1 2 3 4

17. [3 marks]

> pieces <- strsplit(as.character(planets$planet[1]), " ")[[1]] > paste(pieces[1], pieces[2], toupper(pieces[3])) [1] "PSR 1257 A"