Fig. 22.7 describes a study of shrimp species after a vicariance event, the formation of the Isthmus of Panama. Answer the questions after the figure to practice interpreting data and understanding experimental design. Some of these questions refer to concepts that are explained in the following two brief data analysis primers from a set of four available on LaunchPad:
You can find these primers by clicking on the button labeled “Resources” in the menu at the upper right on your main LaunchPad page. Within the following questions, click on “Primer Section” to read the relevant section from these primers. Click on “Key Terms” to see pop-up definitions of boldfaced terms.
Although we think of the formation of the isthmus of Panama as a geologically instantaneous event – at one moment there was a water connection between the two sides, and at the next there was not – it was likely not instantaneous from the perspective of marine species living close to the shore. As the land bridge slowly formed, so the marine channels became shallower and shallower. The growing isthmus therefore isolated populations of a species that specialized in deep water earlier than populations of a species that specialized in shallow water. Here are some data on the depth tolerances for six species of another genus of shrimp, Beteus, on either side of the isthmus.
Species | Current distribution | Depth tolerance (m below surface) |
---|---|---|
A | Caribbean | 0.5-3 |
B | Pacific | 0.5-3 |
C | Caribbean | 8-11 |
D | Pacific | 8-11 |
E | Caribbean | 5-7 |
F | Pacific | 5-7 |
Data and Data Presentation
Processing Data
Initially, we have raw data—our series of observations or measurements. Before we move to the next level of data analysis and presentation, we often need to process the raw data in some way. Sometimes, for example, this may entail transforming a long string of numbers into a data table. To do this, we may need to categorize the data. For example, in our forest example, imagine that over a 24-hour period in our forest patch, we count 108 sightings of mammals. The first step is to categorize the sightings according to species and put the data in table form. In this case, we generate a frequency table in which we specify the number of sightings of each of six mammal species, A–F:
Species | A | B | C | D | E | F |
Number of sightings | 43 | 47 | 3 | 5 | 7 | 3 |
This table illustrates the pitfalls of data collection and how we have to be very careful when we design our data collection protocol. How valid are these data? We have seen B’s many times, but maybe each sighting is of the same individual. It is possible that all 47 B sightings were the same individual, whereas perhaps the three F sightings were three different individuals. This suggests that the design of our sampling scheme was flawed. We should re-do the census, only this time using traps that can mark each individual. Imagine that the revised method results in the following numbers:
Species | A | B | C | D | E | F |
Number trapped | 17 | 29 | 5 | 2 | 5 | 3 |
Which phylogeny is the most likely representation of the relationships among the six Beteus species?
In a study of species at a different location, we identify five endemic species of mammals living on an island a short distance from the mainland. We also identify each species’ closest relative on the mainland. Two alternative hypotheses exist to explain this group of species:
i. The ancestors of the current island species dispersed from the mainland to the island and subsequently diverged on the island (dispersal).
ii. The island was originally part of the mainland, but a rise in sea level caused it to become cut off, with the result that the mammals on the island diverged from the ones on the mainland (vicariance).
hypothesis | A tentative explanation for one or more observations that makes predictions that can be tested by experiments or additional observations. |
Experimental Design
Types of Hypotheses
A hypothesis, as we saw in Chapter 1, is a tentative answer to the question, an expectation of what the results might be. This might at first seem counterintuitive. Science, after all, is supposed to be unbiased, so why should you expect any particular result at all? The answer is that it helps to organize the experimental setup and interpretation of the data.
Let’s consider a simple example. We design a new medicine and hypothesize that it can be used to treat headaches. This hypothesis is not just a hunch—it is based on previous observations or experiments. For example, we might observe that the chemical structure of the medicine is similar to other drugs that we already know are used to treat headaches. If we went into the experiment with no expectation at all, it would be unclear what to measure.
A hypothesis is considered tentative because we don’t know what the answer is. The answer has to wait until we conduct the experiment and look at the data. When an experiment predicts a specific effect, as in the case of the new medicine, it is typical to also state a null hypothesis, which predicts no effect. Hypotheses are never proven, but it is possible based on statistical analysis to reject a hypothesis. When a null hypothesis is rejected, the hypothesis gains support.
Sometimes, we formulate several alternative hypotheses to answer a single question. This may be the case when researchers consider different explanations of their data. Let’s say for example that we discover a protein that represses the expression of a gene. Our question might be: How does the protein repress the expression of the gene? In this case, we might come up with several models—the protein might block transcription, it might block translation, or it might interfere with the function of the protein product of the gene. Each of these models is an alternative hypothesis, one or more of which might be correct.
In a study of multiple geological events in which an isthmus has formed separating marine populations on either side, we take the following data:
Event | ||||
1 | 2 | 3 | 4 | |
Approximate date of isthmus formation (millions of years ago) | 3.5 | 19 | 6.3 | 11 |
% difference in 1000 bp of DNA between closely related species from either side of the isthmus | 5 | 21 | 7 | 10 |
Eight shallow-water marine species―p, q, r, s, t, u, v, w—are present on either side of an isthmus thought to have been formed about 5 million years ago. Below is the phylogeny for the eight species.
We sequence 10,000 base pairs of DNA for each species and analyze the extent of the difference between the members of each closely related species pair (p and q, r and s, t and u, v and w). The results are in the bar graph below. What can we conclude from this data?
bar graph | A method of presenting discrete data whereby the height of each category’s bar is proportional to the category’s abundance in the sample. |
Data and Data Presentation
Graphing Data
Now we can be confident that our numbers are reliable. The next challenge is to present the data. Typically we do this with a graph. Different kinds of data lend themselves to different kinds of graphs. Our mammal species data is discrete—we have clear categories: A, B, C, D, E, and F. For discrete data, either a pie chart or a bar graph would be appropriate. A pie chart divides a circle into “cake slices,” each representing the proportion of the total contributed by a particular category. In our trapping study, we have a total of 61 animals, so the slice representing species A will make an angle at the center of the pie of 17/61 x 360 = 100°. A bar graph represents the frequency of each species as a column whose height is proportional to frequency.
What about continuous data? Imagine that the data we collected is the body lengths of the mammals we trapped. In this case, we might choose a histogram, which looks similar to a bar chart; only here we have to impose our own categories on a continuum of data. Because they were discrete categories—different species—the columns in the bar graph may have gaps between them. In the histogram, by contrast, there are no gaps between the columns because the end of one range (1–20cm) is continuous with the beginning of the next (20–40cm).
Often we are plotting two variables against each other. If, for example, we record the time of day that each mammal is trapped, we can plot the total number of mammals trapped over the course of the 24-hour period.
Midnight-2am | 2am-4am | 4am-6am | 6am-8am | 8am-10am | 10am-12am | 12am-2pm | 2pm-4pm | 4pm-6pm | 6pm-8pm | 8pm-10pm | 10pm-midnight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Number trapped | 8 | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 22 | 17 | 8 |
Cumulative number | 8 | 11 | 13 | 13 | 13 | 13 | 13 | 13 | 14 | 36 | 53 | 61 |
Often one variable is independent—time, for example, will elapse regardless of the mammal count. We plot this on the x-axis, the horizontal axis of the graph. The dependent variable—the values that vary as a function of the independent variable (in this case, time of day)—is plotted on the y-axis, the vertical axis of the graph. If there is reason to believe that consecutive measurements are related to each other, points can be connected to each other by a line. Plotting our data on a graph using the values of the independent and dependent variables as coordinates gives us a line graph. This is a good way to identify trends and patterns in data. Here we can see that the mammals in our forest plot tend to be inactive (and therefore unlikely to be trapped) during daylight hours.
In science, data are typically presented as a scatterplot, in which points are specified by their (x,y) coordinates. Points are not joined to each other by lines unless there are specified connections among them. Here, plotted in a way similar to the line graph (with the independent variable on the x-axis) is a scatterplot showing the time taken to drive from home to campus for a large number of students. The independent variable is the distance traveled; the dependent variable is travel time because the distances are fixed but travel times vary. Overall, there is a positive correlation between travel time and distance (the further you live from campus, the longer, on average, it will take you to get there), but there is plenty of variation as well. Look at the eight points representing the eight students who live five miles from campus. The variation we see in travel time (from 6 minutes to 30 minutes) is a reflection of differences in driving speed, traffic conditions, and route.
What if there are more than two variables? Three-dimensional plots can be informative (but can also cause the reader headaches). A popular modern solution to this problem is a so-called temperature plot, in which the third dimension is represented in two dimensions through color: red (hot) for a strong effect in the third dimension and blue (cool) for a weak effect.
Graphs are the mainstay of scientific presentation, but you will see many other ways of presenting data in your textbook. For example, studies showing how different genes interact with each other in the course of development are often illustrated using network diagrams that give the reader a direct sense of the “connectedness” of a particular gene (or node). Evolutionary trees reveal the branching pattern of evolution with species that are closely related having a more recent common ancestor than those that are more distantly related.
Methods of presenting data in science are not limited, even in textbooks, by standard approaches. The popular press has developed many graphics-intense ways of presenting data. Think of an electoral map after an election. You can view information on a number of levels: whether the state is red or blue, the name of the election winner, the size of his or her majority, and so on. Scientists are learning that they too can package information in ways that are simultaneously informative and attractive.
We know from the fossil record that three species of fish―a, b, and c―lived in an ancient lake in what is now New Jersey. Their phylogeny was as follows:
Four million years ago, the lake dried out, creating two new lakes, Lake1 and Lake2. There are now a total of six species in the two lakes: d, e, and f in Lake1; and g, h, and i in Lake2. What is the most likely phylogeny for the six modern fish species?
Two lakes, Lake3 and Lake4, were derived from a single parent lake several million years ago. These two lakes contain just four species of fish in total. Species a and b are in Lake3 and c and d in Lake4. We sequence 1000 base pairs of DNA from a representative individual from each of these four species. Here is a data matrix showing the numbers of differences between each of the species’s DNA sequences.
Species | a | b | c | d |
a | 245 | 237 | 130 | |
b | 125 | 233 | ||
c | 241 | |||
d |
What is the most likely phylogeny for these four species of fish?