Research - (2015) Volume 6, Issue 2
Heat maps have been used as a means to visualize high-density information in settings as diverse as astronomy, business analysis, and meteorology. Discovery biology research teams have also used heat maps to visualize gene clusters in genomics investigations or to study amino acid distribution in protein sequence analysis. Commercially available software packages, like Spotfire® or SAS JMP® afford scientific investigators the ability to construct heat maps and visualize information from studies, yet do not offer any form of summary statistic that would be useful in high-throughput investigations comparing the results of a large number of data visualizations simultaneously or viewing changes in the display longitudinally (over time).
Previously, Juneau suggested the usage of Plotnick’s characterization of lacunarity (1996) for two-dimensional heat map data displays in two colors or shades. For c (c>2) discrete shades (in a monochromatic map) or hues (in a full color display), the author will suggest a modification to Plotnick’s approach using the underlying gliding box approach developed by Allain and Cloitre , but with an alteration in the means of counting features.
Keywords: Heat maps; Gliding box; Lacunarity; Quantification
Background
Heat maps have been employed as a data visualization tool in the social sciences since the late nineteenth century [1-3], and more recently in arenas as diverse as astronomy [4-6], business analysis [7-9], meteorology [10-12] and quantum mechanics [13]. Changes in the ambient conditions of complicated systems like galaxies, the stock market, or large meteorological phenomena can be readily displayed via differences in color or grayscale to assist investigators in hypothesis generating or data interpretation activities. Heat maps afford investigators the ability to study high-density data sets in a single visualization, while maintaining measurement relationships and data integrity [14-16].
The popularity of heat maps has grown substantially with developments in the field of bioinformatics [17]. Numerous examples of published methods employing heat maps exist [18-31] however, the author has not, to date, seen an attempt to represent the information present in these very widely used data visualizations with a single summary statistic. Commercially available software packages, like Spotfire® or SAS JMP®, afford scientific investigators the ability to construct heat maps and visualize information from studies, yet do not offer any form of summary statistic that would be useful in highthroughput investigations comparing the results of a large number of data visualizations simultaneously or viewing changes in the display longitudinally (over time). Previously, Weinstein (1997) had suggested using a “difference heat map” to compare visualizations pre and posttreatment. This approach seems tenable if only two time points are under consideration, but would result in several pre-post difference heat maps if one were interested in comparing change with baseline over time or all pair-wise comparisons of time responses.
One approach to numerically summarizing the content of a heat map might be to use a statistic like the percentage of the entire map colored or shaded by a specific hue or degree of brightness. Figure 1.1.1 illustrates three examples of tri-colored heat maps.
One could summarize each heat map by the percentage of tiles of a given color relative to the whole. The use of such a statistic would not differentiate the apparent geometry of the heat map depicted in Figure 1.1.1a from that of the other two. The geometries of Figures 1.1.1 b and 1.1.1 c might suggest an underlying block structure relationship between rows and columns, which possibly could be related to an underlying multivariate mechanism with a block diagonal covariance matrix [32]. Figure 1.1.1 c has a subtle difference in appearance to Figure 1.1.1 b. A black diagonal band bisects the block structure. Thus, a percentage approach does not account for the overall pattern presented in a heat map and would therefore not serve as a useful numerical summary of the relationships suggested.
A second approach could be to characterize the spacing-filling geometry of the tiles in the heat map via a Hausdorff-like dimension [1]. Consider a non-empty subset of ℜn, say S, which may be covered with sets of diameter at most μ (μ>0), such that the diameter of the covering sets is normalized to S (i.e., the size of S is considered unity).
The Hausdorff dimension is calculated in mathematics as a limiting procedure related to the logarithm of the number of sets of diameter at most, say μ, that cover S as the logarithm of the diameter of these sets approaches zero [33]. In practice, it might be the case that size of the features that form a pattern might be of interest, as well as the patterns themselves. Thus, usage of the Hausdorff dimension in a strict mathematical sense might not be practical because of the investigator’s desire to consider features at least as large as μ (μ>0).
Consider the bi-colored heat map displayed in Figure 1.1.2 with a covering such that μ=0.1. The heat map in Figure 1.1.2 could be covered by a set as illustrated in Figure 1.1.3.
From the covering it is easy to calculate a Hausdorff dimension for μ=0.1. The major shortcoming of this approach is that, as was the cause with the percentage summary, the geometry of two heat maps may be markedly different; however, the Hausdorff dimension could be identical. An example of this phenomenon is illustrated in Figure 1.1.4.
This limitation of such dimension measurements was first recognized by Mandelbrot [34] in the context of numerically summarizing fractals. Mandelbrot advocated the usage of a quantity called the lacunarity to numerically summarize fractals. The root of the word lacunarity is the Latin lacunae, meaning “space” or “hole”. Thus, as the Hausdorff dimension characterizes the “space-filling” properties of a set, the lacunarity measures the presence of gaps or holes in the set.
Juneau [1] suggested the usage of lacunarity to characterize heat maps in two colors or shades, based upon a method suggested by Plotnick [35]. Plotnick’s method was based upon a more general case originally developed by Allain and Cloitre [2]. This approach will provide an investigator with a measure of the “gapiness” for one shade or color relative to a second. The balance of this paper will be based upon a suggested method for a setting with c color or shades, for c>2. Section 1.2 will summarize the gliding box approach of Allain and Cloitre and illustrate its behavior for a heat map in 2 colors or shades. Section 2 will introduce a proposal for a form of modified lacunarity for the setting of c colors or shades (c>2) and highlight the scaling feature of the approach. Section 3 will provide examples of three applications of the technique: cluster analysis, the summarization of longitudinal data in meteorology, and genomics.
The gliding box approach of Allain and Cloitre and the calculation of a heat map’s lacunarity for heat maps in two colors or shades
For some heat map, , let P represent a board [36] that partitions H:
(1) Define p to be a polygon with s sides and diameter (p)=ρ, where diameter( ) =maximum of the lengths of the s sides of p;
(2)
(3) diameter (pi)=diameter Without loss of generality, p can be defined to be a rectangle. A subset
can be called a feature of the heat map. Figure 1.2.1 provides an illustration of H,P, and the partitioning of H by P.
Define a box, , to consist of a set of contiguous features, p, whose union is similar to the polygon, p. Figure 3.2.2 illustrates the relationship between B & P.
The procedure for the gliding box approach is as follows. The box, B, traverses the length of H, beginning at the upper left corner of H. Define an indicator function, χ(p) as follows:
1 if p is gray (for p B )
χ(k)=(1.2.1)
0 otherwise
Call the sum of the values of the indicator variable χ(k) over all features for the mth movement of B across H, the score, ξ . The box is moved k features to the right for a box B of diameter, k. Figure 1.2.3 illustrates the movement of B across H. Thus, for each movement, m, of B across H, a single score is calculated by summing up all of the values of χ(k):
(1.2.2)
The total set of scores, say T, (m=1 to T) for the movement of B T times across H, may now be tallied to form a discrete probability mass function for all possible values from 0 to k2 for a box of diameter, say, k. Call the probability mass function for all values from 0 to k2, Ψ.
The lacunarity of H (Mandelbrot,), Γ(k,T), may then be defined as:
(1.2.3)
where
(1.2.4) and
(1.2.5)
(i.e., M1 and M2 represent the first and second un-centered moments for the scores, )
For two given sets, the one with the larger value of Γ will be a set of gray features more diffusely distributed throughout; i.e., the occurrence of black regions will be more frequent and their size relatively larger. Figure 3.2.4 provides an illustration of three sets H1, H2 and H3 and their corresponding values of Γ.
How was Γdetermined for the left-hand panel of Figure 1.2.4? If one employs a gliding box of size k=2, the first row of H1 would have scores of 4, 1, 3, 1 and 3. In the second row, the scores would be 2,3,2,3 and 2. Thus, if one were to proceed moving the gliding box over the entire heat map for the remaining three rows:
M1=(1/25)*(0*2+1*7+ 2*7+3*8+4*1)= 49/25,
M2=(1/25)*(02*2+12*7+22*7+32*8+42*1)=123/25,
Γ(2)=M2/M1 2=1.27.
Development of a modified lacunarity using the Allain and Cloitre approach
Consider the heat map, Hc, depicted in Figure 2.1.1. As opposed to the heat map, H, depicted in Figure 2.2.1, this display is in more than two colors or shades. In Figure 2.2.3, recall that the procedure counted the number of features of a desired shade (gray). In essence, the procedure described in Section 1.2 describes the distribution of gray features relative to the distribution of the complementary features (those that are not gray). The goal is to develop a form of measurement that simultaneously summarizes the clustering of colors or shades for feature subsets of Hc relative to the other colors or shades.
One possible approach to developing a modified Allain and Cloitre procedure for multi-shaded or multi-colored images or heat maps would be to count the number of neighboring discordant pairs within B as it traverses over Hc. In an intuitive sense, studying the distributional properties of the discordant pairs of features within the gliding box B provides a summary of the density of sets of features contained within Hc. Just as the lacunarity, Γ, defined in equation 1.2.3 for a two-shaded or bi-colored heat map increases as the number of the subsets with large clusters of features with the desired shade or color decreases, a lacunarity, Γmac, based upon modifying the gliding box algorithm of Allain and Cloitre that counts discordant pairs will increase as the number of subsets with large clusters of any color or shade decreases for a given heat map Hc.
Consider the gliding box, Bc, as defined in Figure 2.1.2. Label each feature prs to represent the feature in the rth row and sth column (i.e., in a form of matrix element notation). Define a feature, rs p , to be adjacent to prs if prs and prs have sides with a common vertex. Now, define an indicator function, δ(p) as follows:
1 if the color or shade of prs does not agree with that of paras for
(2.1.1)
Call the sum of the values of the indicator variable over all features
the discordance score, Δ. The box is moved k features to the right for a box B of diameter, k. Thus, for each movement, m, of B across H, a single score is calculated by summing up all of the values of δ(k):
(2.1.2)
Then a modified lacunarity, based upon the Alain and Cloitre approach Γmac may bedefined in a similar fashion as previously, in equation 1.2.3:
(2.1.3)
for all m movements of Bc across the heat map. The same definition of the first and second moments expressed by the notation in equations 1.2.4 and 1.2.5 would hold for equation 2.2.3. The modified lacunarities of four heat map data displays with varying amounts of colored features are shown in Figure 2.1.3.
How was the value determined? Using an approach similar to the one used for the calculations of the left-hand panel in Figure 1.2.4, although, in this instance, with k=3:
M1=(1/225)*(0*0+1*0+ 2*38+3*76+4*41+5*48+6*11+7*6+8*5)=3.80,
M2=(1/225)*(02*2+12*7+22*38+32*76+42*41+52*48+62*11+72*6+82*5) =16.45
Γ(3)=M2/M1 2=1.14.
The Choice of box size and its influence on the calculation of Γmac
Consider the situation illustrated in Figure 2.2.1. For a given choice of the gliding box’s diameter, the coverage of the box over the heat map can result in a different number of features that may be contained within the gliding box as it completes its first row of coverage. Figure 2.2.1 illustrates three choices of box size. When k=2, note that for each row the movement of the box will result in the same number of partition pieces covered; however, this is not the case for k=3 and k=4. The algorithm suggested in Section 2.1 may still be employed in the circumstances illustrated in Figure 2.2.1 for k=3 and k=4 with minimal effect on the calculation of the modified lacunarity if the number of movements of the box is large relative to the size of the heat map. When a box spans a region outside of the heat map, the author recommends using the convention that the discordances be measured only on the portion of the box that is covering the heat map. An illustration of this suggested convention is illustrated in Figure 2.2.2.
Borys [37] derived the relationship between a reference lacunarity, say Γ0, with a gliding box with a diameter μ 0 (normalized to the heat map such that the heat map’s size is considered unity) and the lacunarity determined for a box with a diameter μ (normalized to the same heat map), say, Γ:
Γ=Γ0 (μ 0/μ)D-2 (2.2.1)
where D is the generalized fractal dimension reported in Ott [38]. For a covering of a set, it is possible to determine a fixed D, and a lacunarity calculated on a normalized diameter of μ can be compared to that of one based on a normalized diameter of μ 0. Thus, despite the fact that the user of a lacunarity-based technique is free to choose his or her value of scale, the lacunarity values for different choices of scale can be related via 2.2.1. If two users cannot agree on a common diameter for the box, one can easily transform the corresponding lacunarity values from one scale to another.
An example of the calculation of the modified lacunarity in the arena of cluster analysis with simulated data
As mentioned in Section 1.1, heat maps are used to summarize results frequently in the field of bioinformatics, primarily after a cluster analysis is performed. Packages like SAS JMP® (version 9) allow users to examine the results of cluster analyses via a color map within the multivariate analysis module in the analysis platform of the product. Two data sets were simulated with SAS JMP®. The first data set consisted of 32 8-tuples (items with 8 attributes) of simulated uniform (0,1) variates. Perfect agreement between the components of 25% of the cases was artificially induced in the simulation to create large block structures in the SAS JMP® color map (i.e., to artificially create a very dense cluster of a small set of the items). A second data set consisted of 32 items. The first attribute or component of each item was simulated from a standard Gaussian distribution. For the next 5 of the components, half of the items were assigned linear combinations of the first attribute; the remaining half was assigned random Gaussian noise. The sixth and seventh components consisted of values simulated from a Gaussian distribution. The eighth component was simulated from a uniform (0,1) distribution. The two data sets were independently analyzed using the default options (hierarchical and Ward’s method) in the multivariate analysis module within the analysis platform of SAS JMP®. The results of the cluster analyses for the two data sets are shown in Figure 3.1.1.
Figure 3.1.1: The results of two cluster analyses. Figure 3.1.1 (a) consists of 32 items (8-tuples) of simulated uniform (0,1) variates with a high degree of clustering induced by the constraint of perfect agreement between variates in 25% of the cases. Figure 3.1.1 (b) consists of 32 8-tuples simulated based upon functions of Gaussian variates and linear combinations of a subset of the components of each item.
Γ(2) values were calculated for the two heat maps illustrated in Figures 3.1.1 (a) and (b) as follows:
Γa (2): M1=(1/64)*(0*18+6*6+ 8*12+10*20+12*8)=6.6875
M2=(1/64)*(02*18+62*6+82*12+102*20+122*8)=64.6250
Γa(2)=M2/M1 2=1.45.
Γb (2): M1=(1/64)*(0*8+6*9+ 8*9+10*22+12*16)=6.6875
M2=(1/64)*(02*8+62*9+82*9+102*22+122*16)=64.6250
Γa (2)=M2/M12=1.19.
The color map in Figure 3.1.1 (a) contains more regions with large blocks of a single color than the color map in Figure 3.1.1 (b). Figure 3.1.1 (a) contains large blocks of orange, green and blue, while Figure 3.1.1 (b) contains only a single large block of orange. If the two color maps were considered as landscapes, metaphorically, it would be as if Figure 3.1.1 (a) has several flat regions while Figure 3.1.1. (b) is more varied with only one planar region. The modified lacunarities describe this phenomenon: the modified lacunarity calculated for Figure 3.1.1. (a) is larger than the modified lacunarity of Figure 3.1.1 (b), suggesting that the color map of the first has more “holes” or regions of a single color (consistency) than that of the second.
An example of the calculation of the modified lacunarity with longitudinal temperature data from three cities
A real application of the modified lacunarity can be applied to the average daily temperatures measured in the first fifteen days of January from the years 1995-2009 for Duluth, Minnesota (USA), Mexico City, Mexico and Lima, Peru. These data can be found at http://www.engr. udayton.edu/ weather, the web site for the University of Dayton’s Temperature Data Archive. Data were arranged in three 15x15 arrays. The columns of the arrays represent a single year (beginning with 1995 on the left); the rows days 1-15 of January for that year (beginning with 1 January in the first row).
The values were entered in MS Excel 2003. A MS Excel macro, available as freeware from http://bitesizebio.com/2009/02/03/howto- create-a-heatmap-in-excel , was used to generate the heat map summaries illustrated in Figure 3.2.1.
A gross inspection of the data represented in the three heat maps allows the observer to glean some information. As expected, the average daily temperature of Lima, Peru is more consistent than the other two cities (because of its latitude and topology), as evident by the large blocks of color present in the corresponding heat map. Moreover, the average daily temperature of Duluth, Minnesota is more varied, as is evident by the relatively smaller blocks of color present in its corresponding heat map. These evaluations are highly subjective. If one needed to organize a large series of heat maps, of which these three are representative, an index of temperature consistency might be of value.
If one were interested in a single index quantifying the consistency of the temperatures for the three cities in the first 15 days of January for 15 years, he or she could use the modified lacunarity. Suppose that a meteorologist were interested in the agreement of temperatures between three days over three years as an estimate of consistency. He or she could use the modified lacunarity as a possible descriptive statistic. ΓDuluth (1,3)=1.13, ΓMexico City (1,3)=1.19 and ΓLima(1,3)=1.21. Once again, the modified lacunarity summarizes the topology of the heat maps: larger values reflect the presence of more regions with a consistent temperature.
An example of the calculation of the modified lacunarity for a study of longitudinal system-based analysis of transcriptional responses to Type I Interferons
A third illustration of the modified calculation was made based on Figure 1.D in [39] a study of the transcriptional response to Type I Interferons. If one were to use a 2x2 gliding box, it would be possible to use the figure to estimate Γ(2) for the portions of the heat map corresponding to APP x IFN-β1a (upper left-hand 10x6 portion), APPxIFN-α2b (right-hand side, adjacent 10x6 portion), JS x APP x IFN-β1a (12x6 portion below APP x IFN-β1a) and APPxIFN-α2b (12x6 portion right-hand side, adjacent portion). The values would be 1.69, 1.13, 1.10 and 0.91. These values are consistent with the concept of the modified lacunarity: to find large homogeneous blocks within the heat map with little diversity in color. The figure with the most color change is APPxIFN-α2b, suggesting a greater change in activity (varying from very dark blue to very bright yellow, or from -1 to 1, respectively.
The concept of lacunarity introduced by Mandelbrot is used as a summary statistic in many applications [40-45]. To date, the lacunarity statistic has been applied only in settings with images of two colors or shades and has not been applied to summarize the content of heat map data displays. The proposed modified lacunarity statistic affords investigators the option of summarizing or indexing large numbers of heat maps based upon the presence of large monochromatic blocks of features. The modified lacunarity proposed in this work is easy to compute and its interpretation intuitive in applied settings.
The limitation of this proposed statistic is its small variation relative to the larger variation perceived by the interpreter of a heat map. This feature is evident in Figure 3.2.1. The cause of this small variation in the value of the modified lacunarity is most likely due to the smaller correlations in yearly temperatures for a given day relative to the correlations between days within a given year. Where larger correlations exist between rows and columns in a heat map, larger blocks of uniform color can exist (see Figure 3.1.1). In the presence of larger blocks of color, values of the modified lacunarity can be more potentially more discriminating without the need for several decimal places (see Figure 2.1.3).
All of the lacunarity calculations performed in this work were calculated with MS Excel 2003. With the current state of the art in computer software, it seems plausible that one could output the color codes used to produce an image, transfer them to a simple spreadsheet and automate the gliding box process for several box sizes. Thus, due to the relatively simple implementation, ease of calculation and its intuitiveness, the modified lacunarity has the potential to be a new tool that can be used in many scientific arenas to aid in exploratory data analysis and subsequent hypothesis generation.
The author would like to thank Dr. Dianne Camp for her thorough review and subsequent helpful commentary that improved the overall quality of this work. The author would also like to recognize the excellent suggestion of one of the reviewers to add an example involving longitudinal analysis. This recommendation further added to aid in the intuitive understanding of the principle of the modified lacunarity in practice.