Identifying emerging research
The effort to understand innovation through an examination of co-citations among scientific and technical papers began at the Institute for Scientific Information—now Thomson Scientific—in the 1970s (Small 1976). As Sullivan et al. (1977) put it, ‘‘A series of claims for the technique of co-citation analysis…. The first and most important claim is that co-citation clusters ‘reflect the … cognitive structures of research specialties.’ ‘‘Early research found that citation structure might be used to gain insight into the social structure of science and technology—how knowledge changes and develops over time (Crane 1972; Garfield and Stevens 1965; Small 2003). The ideas of schools of thought that could be revealed through citation analysis was crystallized conceptually by Crane (1972) based on older ideas of the social structure of science pioneered by Derek Price, Thomas Kuhn, and Robert Merton (Kuhn 1970; Merton 1972; Price 1961, 1963).
Much of the earlier research in this field has focused on delineating the structure of science using algorithms that find similar papers and organizing them into clusters (Small 1977). Later studies have mapped out specific fields such as scientometrics itself (Chen et al. 2002), management and information science (Culnan 1986, 1987), organizational behavior (Culnan et al. 1990), chemical engineering (Milman and Gavrilova 1993), economics (Oromaner 1981), and space communication (Hassan 2003). Recently, many researchers have focused on the visualization of these fields, developing tools such as crossmapping and DIVA (Morris and Moore 2000), HistCite (Garfield 1988; Garfield et al. 2003), and Pathfinder (White 2003), and methods for graphing large-scale maps of science (Small 1997). For a review of the seminal literature see Osareh (1996).
Morris has developed a method to help expert panels evaluate small research topics. This method organizes papers visually over time and studies the evolution of a topic, such as anthrax research. The focus is on temporal changes and timeline visualizations. Morris first clusters documents based on bibliographic coupling, the sharing of references, and then visualizes clusters using horizontal timeline tracks, plotting documents along them.
Small has developed a comprehensive method for identifying and tracking research fronts (Small 2003) based on the co-citation of highly cited papers. We build on Small’s methodology and suggest additional steps to weed out certain artifactual clusters.
Methodology for delineating research area
Unlike most methods for analyzing research areas, co-citation clustering is an a priori method that makes no assumptions about what research areas exist. Rather it selects whatever papers are highly cited using a global criterion and clusters these papers based on their pattern of co-citation. One limitation of this method is that it will not identify a specialty if none of its papers have become highly cited. Thus, the co-citation method will not detect an area immediately upon its emergence, but rather at some later stage in its development. In addition the method does not identify all papers that might be considered relevant to the area. Rather, it is designed to simply detect that a research area exists and provide a sample of its highly cited and citing papers. The distinguishing feature of the method is that it is designed to do a quick screening of the scientific landscape rather than a definitive delineation of some specific area.
For the purpose of this study, we have defined highly cited papers as the top 1% of papers in each of 22 broad disciplines (for field definitions, see http://www.in-cites. com/field-def.html). Since our goal is an automatic and easily updated surveillance of the scientific literature, the 1% threshold should be viewed as a parameter which can be adjusted up or down depending on the desired resolution of the analysis. The same 1% thresholds by discipline are used in Thomson Scientific’s Essential Science Indicators (ESI) web product.
For this product roughly 30,000 papers are clustered on a bimonthly basis and grouped into co-citation clusters through a single-link process. All co-citations for the selected highly cited papers in the 22 fields are computed prior to clustering. A co-citation link is defined as a pair of highly cited papers co-cited two or more times. The integer co-citation frequency is normalized by dividing it by the square root of the product of the citation counts of the two papers, the so-called cosine similarity. A force-directed placement mapping method can be used to display the strongest co-citation links within a given cluster, as we show in Figs. 1–4.
Single-link clustering is a simple and rapid way to extract strong patterns of links among very large sets of tens of thousands or even hundreds of thousand of papers, and is suitable for the large-scale and periodic analyses required by ESI. Because only a single co-citation link at a specified threshold is needed to join a paper to an existing cluster, the
Figs. 1–4 (clockwise from upper left) Discipline size over time, frequency count by discipline presented firstmethod has a tendency to create very large clusters by chaining unless the co-citation threshold is not carefully controlled. Studies of individual clusters have shown that by varying this linkage threshold, upward for larger clusters or downward for smaller clusters, it is possible to identify the level at which chaining begins and below which the size of the cluster increases exponentially rather than linearly. It is possible to optimize the threshold for each single-link cluster by picking the lowest possible link threshold prior to this onset of sudden expansion. This can be achieved in an approximate way by defining a low starting level, setting a maximum cluster size and incrementing the threshold until a cluster within the desired size limit is formed, in effect, optimizing each cluster. The process is analogous to pruning a tree where none of the pruned branches are larger than a preset size (Small 1985).
The starting threshold used in this experiment was 0.3 of the cosine normalized cocitation, but individual clusters may form at higher thresholds. The clustering parameters along with the initial citation thresholds are adjustable parameters that control the level of resolution desired by the analyst. The cosine similarity of 0.3, a maximum cluster size of 50, and a cosine increment of 0.1 are used in this analysis and are the same as those used in ESI. We have purposefully selected very high co-citation thresholds because speed of analysis rather than high resolution was sought.
To track clusters over time we look at successive time slices of data to determine the patterns of continuing highly cited papers from one dataset to the next. Such patterns of continuity are referred to as cluster strings. A new or emerging area is defined as a cluster of highly cited papers in one time period whose papers did not appear in any clusters in the immediately preceding time period. Continuing areas can be distinguished by whether they merge, split or continue from a single previous group. Table 1 Statistics on clusters in two time periods
Two sets of co-citation clusters were used representing two overlapping six-year time periods: 1998–2003 and 1999–2004. A six-year time frame was defined to provide ample time for papers to reach their peak citation year, which usually occurs in year 3 or 4 after publication, and diminishes thereafter. Thus a given cluster can include both older and newer papers but excludes classic or very highly cited method papers. Both cited and citing papers are restricted to these time spans. In our case studies we also include some of our results from clustering earlier time periods. Table 1 gives statistics on the datasets used: the number of clusters, highly cited papers, average citations per paper, and average publication year of papers.
The field or discipline assignment of a cluster is determined by the journals in which the highly cited papers are published, using the journal classification mentioned above. The discipline weight for a cluster is defined as the number of papers in a particular discipline. We expect that larger clusters will have larger numbers of disciplinary assignments. In Figs. 1–4 below we use the 1999–2004 data to show where papers comprising the clusters are distributed by discipline in our dataset (as a percent of the total). The stability of most disciplines over time is notable, with the exception of an apparent slight trend upward in the biological sciences such as molecular biology, biology, and microbiology (Table 2).