Recall from Chapter 1 that categorical (qualitative) data take values that are non-numeric and are usually classified into categories. In this section, we learn graphical and tabular methods for handling categorical data. Let us begin with an example.
Table 1 shows the 20 most downloaded free apps for the IOS platform, as reported by Apple.com, along with the app type, for June 2014. We will analyze the variable app type, which is a qualitative, not quantitative, variable.
Rank | App | App type | Rank | App | App type |
---|---|---|---|---|---|
1 | Two Dots | Games | 11 | Social networking | |
2 | The Line | Games | 12 | NBC Sports Live | Sports |
3 | Traffic Racer | Games | 13 | Social networking | |
4 | Rival Knights | Games | 14 | FIFA Official App | Sports |
5 | Piano Tiles | Games | 15 | Pandora | Music |
6 | Snap Chat | Photo and video | 16 | Spotify | Music |
7 | Photo and video | 17 | Social networking | ||
8 | The Test | Games | 18 | Emoji Keyboard 2 | Social networking |
9 | Republique | Games | 19 | Social networking | |
10 | YouTube | Photo and video | 20 | SoundCloud | Music |
From this data set, it is not immediately clear which app type is the most popular choice among the 20 apps in the sample. That is why we need ways to summarize the values in a data set. One popular method used to summarize the values in a data set is the frequency distribution (or frequency table).
2.1 Graphs and Tables for Categorical Data
The frequency, or count, of a category refers to the number of observations in each category. A frequency distribution for a qualitative variable is a listing of all the values (for example, categories) that the variable can take, together with the frequencies for each value.
EXAMPLE 1 Frequency distributions
Roberto Westbrook/Blend Images/Getty Images
Note: Check that the sum of the frequencies equals the sample size, n.
Create a frequency distribution for the variable app type from Table 1.
solution
For each app type, we compute the frequency; that is, we count (or tally) how many apps were of that particular app type. Table 2 shows the frequency distribution for the variable app type. For example, five of the apps were social networking apps. The frequency distribution summarizes the data set so that quick observations can be made, such as “The most popular app type in the Apple.com top 20 list of the most downloaded free apps is the Games app type.”
Table 2Frequency distribution of app type
App type | Tally | Frequency |
---|---|---|
Games | |||||| | 7 |
Social networking | |||| | 5 |
Music | ||| | 3 |
Photo and video | ||| | 3 |
Sports | || | 2 |
The New York City Police Department tracks the number and type of traffic violations. Table 3 contains a random sample of 12 traffic violations and the borough in which they occurred (Manhattan or Brooklyn).
1.Build a frequency distribution of Borough.
2.Construct a frequency distribution of Violation type.
Table 3Violation type and borough of 12 traffic violations
Violation type | Borough | Violation type | Borough |
---|---|---|---|
Cell phone | Brooklyn | Disobey sign | Manhattan |
Safety belt | Manhattan | Speeding | Brooklyn |
Cell phone | Brooklyn | Safety belt | Manhattan |
Cell phone | Manhattan | Disobey sign | Manhattan |
Speeding | Brooklyn | Disobey sign | Brooklyn |
Safety belt | Manhattan | Cell phone | Manhattan |
(The solutions are shown in Appendix A.)
As the data set gets larger, the need for summarization gets more and more acute. (Imagine if the Apple.com listing consisted of 1000 apps instead of 20.) Take a moment to add up the frequencies in Table 2. What do they add up to? This number is the sample size: n = 20. Now, is this just a coincidence, or does this happen every time?
04/05/15 12:02 PM
Actually, this happens every time: the sum of the frequencies equals the sample size, n. One way to check if you made a mistake in forming your frequency distribution table is to add up the frequencies and see if the sum equals the sample size.
Relative Frequency Distributions
Next, suppose you didn’t know the size of the sample in the survey. Suppose you were told only that seven apps were games. The logical question is “Is that a lot?” If our sample size was only 10 apps, then 7 of those apps being games is certainly a lot. However, if our sample size was 1000 apps, then only 7 of those apps being games is not a lot. So, the number’s significance depends on what you compare the seven apps to— that is, “relative to what?” or “compared to what?” In statistics, we compare the frequency of a category with the total sample size to get the relative frequency.
The relative frequency of a particular category of a qualitative variable is its frequency divided by the sample size. A relative frequency distribution for a qualitative variable is a listing of all values that the variable can take, together with the relative frequencies for each value.