Histograms: A Useful Data Analysis Visualization Article Swipe
YOU?
·
· 2019
· Open Access
·
· DOI: https://doi.org/10.1002/pmrj.12145
· OA: W2914699463
Visualizing data with histograms is an excellent first step in any analysis of quantitative data, but many researchers fail to take advantage of this exploratory data analysis tool. This article gives an overview of histograms and uses examples to illustrate important data features that they can help reveal. Histograms were one of the earliest types of data visualizations, with references to their use dating back to the 19th century.1 The goal of these graphs is to visualize the shape (distribution) of data for a single quantitative variable such as systolic blood pressure, age, or birthweight. (Notice that histograms are not bar charts. Bar charts are properly used only for displaying counts of categorical variables. Histograms and boxplots display quantitative data.)2 Derived from the Latin root words for "drawn fences," histograms typically consist of a number of adjacent, equal-width vertical columns, drawn so that there is no space between the columns. The columns correspond to "bins" that together span the range of the data. The data are divided among these bins, with the height of each bar corresponding to the number of data points falling into each bin. The taller the bar, the more data points fall into the range of that bin. Figure 1 shows the length of hospital stays for a random sample of 1000 patients enrolled in Phase I of SUPPORT (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment).3 (The dataset is available at http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets) The hospital stays in this study ranged from 3 days to 241 days. The histogram shows 25 bins extending from 0 to 250 days, so that each bin covers a period of 10 days. The data in Figure 1 can be seen to exhibit a strong right or positive skew. This happens when the majority of the data have low values, but a few values are extremely high relative to the rest (ie, the data have a long right tail). Here, about 75% of the 1000 study patients had short hospital stays with a duration of less than 20 days (almost 475 patients stayed up to 10 days, and nearly 300 stayed between 10 and 20 days). After 20 days, the probability of a patient staying even longer drops dramatically over time, which can be seen in the long, tapering right tail. In fact, only 5 patients stayed longer than 140 days. Additionally, because hospital stays must be nonnegative, there is a floor effect (or basement effect) at zero. This happens when data values tend to "pile up" near the lowest possible limit of the variable. The combination of skewness and floor effect suggests to researchers that they should avoid most traditional significance tests for this data (such as t-tests, analyses of variance, and tests for correlation coefficients). This is because these tests often rely on the underlying population being relatively symmetric. Using these tests when the data are skewed can lead to making incorrect inferences such as rejecting the null hypothesis when the null hypothesis is actually true (type I error). Researchers should also report the median and IQR (interquartile range) of the sample instead of (or at least in addition to) the mean and SD. This is because the mean and SD can be highly sensitive to skewness and outliers, and the median and IQR are more robust measures of center and spread. Histograms are also useful to compare the distributions of two subpopulations on the same axes. For example, in 1964 cancer researcher Frits de Waard and colleagues published a paper that investigated the bimodality in age-specific breast cancer incidence that had been observed since before World War II.4 Figure 2 shows a histogram with data simulated to recreate their published histograms. The researchers stratified their sample of 240 patients with mammary carcinoma into two groups: those who exhibited obesity, hypertension, and/or decreased glucose tolerance and those who exhibited none of these symptoms. The former group had its peak around 65-69 years of age, whereas the latter peaked at around 45-49 years. The dark shaded area shows the overlap in the distributions of the two groups. After the age of 75 breast cancer was seen only in women with one of the indications. These histograms combined with other data led the researchers to propose that breast cancers develop along two different pathways, ones that researchers now realize correspond to estrogen-receptor-positive cancers and estrogen-receptor-negative cancers. The most important decision a researcher faces when developing a histogram is the width of the bins. (Alternatively, the number of bins can be chosen: the greater the number of bins, the smaller the bin width.) Bin widths control the "resolution" of the histogram. If the bins are too wide, the histogram becomes very "soft focus," without a clear shape and with many interesting data features obscured. On the other hand, using too-narrow bins will result in a histogram with an overly choppy result; this tends to accentuate random artifacts in the data sample and makes it difficult to discern the true distribution of the underlying population data. See Figure 3, for example, which shows the effect of different bin widths on the interpretation of the data. The variable plotted here is average systolic blood pressure (SBP) of each patient early in the SUPPORT study. Figure 3A shows an undersmoothed histogram of the data, with 90 bins each covering 2 mm Hg. We can see what appears to be bimodality (ie, 2 modes, or clusters of data) in the sample, with one peak around 70 mm Hg and another around 110 mm Hg. The high resolution makes it difficult to determine the overall shape and height of each peak, however. In Figure 3B the same data is plotted in an overly smoothed histogram with only 7 bins, each one covering 30 mm Hg. In this version we have lost interesting and important data features, such as the bimodality. Notes ceil(x) refers to the ceiling function, which returns the smallest integer that is greater than or equal to x. SD refers to the standard deviation. IQR refers to the interquartile range. Figure 3C shows a histogram using 20 bins, each covering 9 mm Hg (this choice gave the most interpretable bin width in whole numbers). Here the bimodality is clear, with one peak at around 70 mm Hg and another at 110 mm Hg. This shape should be a clear sign to researchers that further investigation is warranted to investigate if there are distinct subgroups in this sample. Histograms are useful exploratory data visualizations for spotting outliers, skew, bimodality, and other shape features in the distribution as well as for comparing subgroups in the data. The presence of strong skewness or outliers should lead researchers to investigate the use of median and IQR as summary statistics and nonparametric hypothesis tests instead of traditional parametric tests. Bin width/bin number is a tuning parameter that should be experimented with to find the right balance to allow interesting features to emerge from the data. Even if no histograms are included in a final publication, they are a quick and indispensable tool to help researchers catch potential problems in the data and reveal interesting features.