Chapter 1. Working With Data 22.7

Working with Data: HOW DO WE KNOW? Fig. 22.7

Fig. 22.7 describes a study of shrimp species after a vicariance event, the formation of the Isthmus of Panama. Answer the questions after the figure to practice interpreting data and understanding experimental design. Some of these questions refer to concepts that are explained in the following two brief data analysis primers from a set of four available on LaunchPad:

Experimental Design
Data and Data Presentation

You can find these primers by clicking on the button labeled “Resources” in the menu at the upper right on your main LaunchPad page. Within the following questions, click on “Primer Section” to read the relevant section from these primers. Click on “Key Terms” to see pop-up definitions of boldfaced terms.

Question 1 of 8

Question

Although we think of the formation of the isthmus of Panama as a geologically instantaneous event – at one moment there was a water connection between the two sides, and at the next there was not – it was likely not instantaneous from the perspective of marine species living close to the shore. As the land bridge slowly formed, so the marine channels became shallower and shallower. The growing isthmus therefore isolated populations of a species that specialized in deep water earlier than populations of a species that specialized in shallow water. Here are some data on the depth tolerances for six species of another genus of shrimp, Beteus, on either side of the isthmus.

Species	Current distribution	Depth tolerance (m below surface)
A	Caribbean	0.5-3
B	Pacific	0.5-3
C	Caribbean	8-11
D	Pacific	8-11
E	Caribbean	5-7
F	Pacific	5-7

Table

A similar study to the one illustrated in Fig 22.7 is carried out on these six species of Beteus. We sequence 10,000 bp of DNA from multiple loci in each species and compare the number of base pair differences between each species pair (A compared to B, C compared to D, E compared to F). Rank the species pairs in order of greatest number of base pair differences to fewest number of base pair differences.

[number of differences between A and B] > [number of differences between C and D] > [number of differences between E and F]

[C and D] > [A and B] > [E and F]

[C and D] > [E and F] > [A and B]

[A and B] > [E and F] > [C and D]

[E and F] > [A and B] > [C and D]

Correct.

Incorrect.

Incorrect. Please try again.

Data and Data Presentation

Processing Data

Initially, we have raw data—our series of observations or measurements. Before we move to the next level of data analysis and presentation, we often need to process the raw data in some way. Sometimes, for example, this may entail transforming a long string of numbers into a data table. To do this, we may need to categorize the data. For example, in our forest example, imagine that over a 24-hour period in our forest patch, we count 108 sightings of mammals. The first step is to categorize the sightings according to species and put the data in table form. In this case, we generate a frequency table in which we specify the number of sightings of each of six mammal species, A–F:

Species	A	B	C	D	E	F
Number of sightings	43	47	3	5	7	3

Table

This table illustrates the pitfalls of data collection and how we have to be very careful when we design our data collection protocol. How valid are these data? We have seen B’s many times, but maybe each sighting is of the same individual. It is possible that all 47 B sightings were the same individual, whereas perhaps the three F sightings were three different individuals. This suggests that the design of our sampling scheme was flawed. We should re-do the census, only this time using traps that can mark each individual. Imagine that the revised method results in the following numbers:

Species	A	B	C	D	E	F
Number trapped	17	29	5	2	5	3

Table

Question 3 of 8

Question

In a study of species at a different location, we identify five endemic species of mammals living on an island a short distance from the mainland. We also identify each species’ closest relative on the mainland. Two alternative hypotheses exist to explain this group of species:

i. The ancestors of the current island species dispersed from the mainland to the island and subsequently diverged on the island (dispersal).

ii. The island was originally part of the mainland, but a rise in sea level caused it to become cut off, with the result that the mammals on the island diverged from the ones on the mainland (vicariance).

How would you distinguish between these two hypotheses after sequencing the same segment of DNA in representatives of all 10 species?

If the amount of genetic divergence between species in all five pairs is similar, then vicariance is the likely cause.

If the amount of genetic divergence between species is correlated with the lifespan of both species for each of the five species pairs, then dispersal is the likely cause.

If the amount of genetic divergence between species differs for each of the five species pairs, then vicariance is the likely cause.

If the amount of genetic divergence between species in all five pairs is similar, then dispersal is the likely cause.

If the amount of genetic divergence differs for at least three of the five species pairs, then dispersal is the likely cause.

Correct.

Incorrect.

Incorrect. Please try again.

hypothesis

A tentative explanation for one or more observations that makes predictions that can be tested by experiments or additional observations.

Table

Experimental Design

Types of Hypotheses

A hypothesis, as we saw in Chapter 1, is a tentative answer to the question, an expectation of what the results might be. This might at first seem counterintuitive. Science, after all, is supposed to be unbiased, so why should you expect any particular result at all? The answer is that it helps to organize the experimental setup and interpretation of the data.

Let’s consider a simple example. We design a new medicine and hypothesize that it can be used to treat headaches. This hypothesis is not just a hunch—it is based on previous observations or experiments. For example, we might observe that the chemical structure of the medicine is similar to other drugs that we already know are used to treat headaches. If we went into the experiment with no expectation at all, it would be unclear what to measure.

A hypothesis is considered tentative because we don’t know what the answer is. The answer has to wait until we conduct the experiment and look at the data. When an experiment predicts a specific effect, as in the case of the new medicine, it is typical to also state a null hypothesis, which predicts no effect. Hypotheses are never proven, but it is possible based on statistical analysis to reject a hypothesis. When a null hypothesis is rejected, the hypothesis gains support.

Sometimes, we formulate several alternative hypotheses to answer a single question. This may be the case when researchers consider different explanations of their data. Let’s say for example that we discover a protein that represses the expression of a gene. Our question might be: How does the protein repress the expression of the gene? In this case, we might come up with several models—the protein might block transcription, it might block translation, or it might interfere with the function of the protein product of the gene. Each of these models is an alternative hypothesis, one or more of which might be correct.

	Event
	1	2	3	4
Approximate date of isthmus formation (millions of years ago)	3.5	19	6.3	11
% difference in 1000 bp of DNA between closely related species from either side of the isthmus	5	21	7	10

Question 5 of 8

Question

Eight shallow-water marine species―p, q, r, s, t, u, v, w—are present on either side of an isthmus thought to have been formed about 5 million years ago. Below is the phylogeny for the eight species.

We sequence 10,000 base pairs of DNA for each species and analyze the extent of the difference between the members of each closely related species pair (p and q, r and s, t and u, v and w). The results are in the bar graph below. What can we conclude from this data?

that species p, q, r, s, t, u, v, w all live at similar depths.

that species p, q, t, u, v, w all live at similar depths.

that species r, s, t, u, v, w all live at similar depths.

that species t and u live at a shallower depth than the others.

that species t and u live at a deeper depth than the others.

Correct.

Incorrect.

Incorrect. Please try again.

bar graph

A method of presenting discrete data whereby the height of each category’s bar is proportional to the category’s abundance in the sample.

Table

Data and Data Presentation

Graphing Data

Now we can be confident that our numbers are reliable. The next challenge is to present the data. Typically we do this with a graph. Different kinds of data lend themselves to different kinds of graphs. Our mammal species data is discrete—we have clear categories: A, B, C, D, E, and F. For discrete data, either a pie chart or a bar graph would be appropriate. A pie chart divides a circle into “cake slices,” each representing the proportion of the total contributed by a particular category. In our trapping study, we have a total of 61 animals, so the slice representing species A will make an angle at the center of the pie of 17/61 x 360 = 100°. A bar graph represents the frequency of each species as a column whose height is proportional to frequency.

Fig. 1

What about continuous data? Imagine that the data we collected is the body lengths of the mammals we trapped. In this case, we might choose a histogram, which looks similar to a bar chart; only here we have to impose our own categories on a continuum of data. Because they were discrete categories—different species—the columns in the bar graph may have gaps between them. In the histogram, by contrast, there are no gaps between the columns because the end of one range (1–20cm) is continuous with the beginning of the next (20–40cm).

Fig. 2

Often we are plotting two variables against each other. If, for example, we record the time of day that each mammal is trapped, we can plot the total number of mammals trapped over the course of the 24-hour period.

	Midnight-2am	2am-4am	4am-6am	6am-8am	8am-10am	10am-12am	12am-2pm	2pm-4pm	4pm-6pm	6pm-8pm	8pm-10pm	10pm-midnight
Number trapped	8	3	2	0	0	0	0	0	1	22	17	8
Cumulative number	8	11	13	13	13	13	13	13	14	36	53	61

Table

Often one variable is independent—time, for example, will elapse regardless of the mammal count. We plot this on the x-axis, the horizontal axis of the graph. The dependent variable—the values that vary as a function of the independent variable (in this case, time of day)—is plotted on the y-axis, the vertical axis of the graph. If there is reason to believe that consecutive measurements are related to each other, points can be connected to each other by a line. Plotting our data on a graph using the values of the independent and dependent variables as coordinates gives us a line graph. This is a good way to identify trends and patterns in data. Here we can see that the mammals in our forest plot tend to be inactive (and therefore unlikely to be trapped) during daylight hours.

Fig. 3

In science, data are typically presented as a scatterplot, in which points are specified by their (x,y) coordinates. Points are not joined to each other by lines unless there are specified connections among them. Here, plotted in a way similar to the line graph (with the independent variable on the x-axis) is a scatterplot showing the time taken to drive from home to campus for a large number of students. The independent variable is the distance traveled; the dependent variable is travel time because the distances are fixed but travel times vary. Overall, there is a positive correlation between travel time and distance (the further you live from campus, the longer, on average, it will take you to get there), but there is plenty of variation as well. Look at the eight points representing the eight students who live five miles from campus. The variation we see in travel time (from 6 minutes to 30 minutes) is a reflection of differences in driving speed, traffic conditions, and route.

Fig. 4

What if there are more than two variables? Three-dimensional plots can be informative (but can also cause the reader headaches). A popular modern solution to this problem is a so-called temperature plot, in which the third dimension is represented in two dimensions through color: red (hot) for a strong effect in the third dimension and blue (cool) for a weak effect.

Graphs are the mainstay of scientific presentation, but you will see many other ways of presenting data in your textbook. For example, studies showing how different genes interact with each other in the course of development are often illustrated using network diagrams that give the reader a direct sense of the “connectedness” of a particular gene (or node). Evolutionary trees reveal the branching pattern of evolution with species that are closely related having a more recent common ancestor than those that are more distantly related.

Methods of presenting data in science are not limited, even in textbooks, by standard approaches. The popular press has developed many graphics-intense ways of presenting data. Think of an electoral map after an election. You can view information on a number of levels: whether the state is red or blue, the name of the election winner, the size of his or her majority, and so on. Scientists are learning that they too can package information in ways that are simultaneously informative and attractive.

Species	a	b	c	d
a		245	237	130
b			125	233
c				241
d