Tight Clusters Of Data Points example essay topic

2,915 words
Crime Hotspot Analysis using CrimeStat Grant Bhay Introduction: The concept of Geographical Information Systems and profiling is not new to the area of Community Policing. Mechanical methods of point pattern analysis and data separation have been benefiting society ever since Dr. Snow's landmark discovery of the tainted well (Waters 1995). The analysis of spatial and temporal data has been critical to the successes of many criminal investigations. Buzzwords such as "geographic, criminal and psychological profiling" have rejuvenated an interest in analytical geography and the advances in computer technology have given geographers a medium in which to express the capabilities of their science (Waters, 1999, Rossmo, K., http tp: // web).

Criminal Geographic Targeting is an example of the practical application of academic research", concludes Rossmo", and is finally putting geographers on the map. (Rossmo, 1997). The purpose of this paper is to explore the spatial statistics program for the analysis of crime incident locations. CrimeStat, which has been developed by Ned Levine, Ph. D., and Associates, under a grant from the National Institute of Justice. In particular the "hotspot analysis" module, which include the K-means clustering, nearest neighbor hierarchical spatial clustering and Local Moran statistics will be examined for their functionality, ease of navigation and interpret ability of results. The K-Means analysis will be employed to make the determination of whether the distribution of data points display a clustered pattern or one of complete spatial randomness.

One would expect the "hotspot" aspect of the module to highlight areas of high concentration that may not be apparent on maps that simply plot crime locations. It is in this area of "hotspot" identification and analysis that this paper will focus Literature Review: A hot spot has been defined as a condition indicating some form of clustering in a spatial distribution. However, not all clusters are hot spots because the environments that help generate crime, the places where people live, also tend to be in clusters. So any definition of hot spots has to be qualified. Sherman (1995) defined hot spots "as small places in which the occurrence of crime is so frequent that it is highly predictable, at least over a 1-year period". According to Sherman, crime and location are approximately six times more concentrated than it is among individuals.

Therefore the locational aspect is extremely important. However, there seems to be confusion surrounding the hot spot issue, especially when it comes to defining the difference between spaces and places. Block and Block (1995) pointed out that a place could be a point, such as an apartment building, or an area, such as a census tract. However, buildings are generally considered places, and census tracts as spaces. Concentrations of criminal activity locations may be easily identified on a relatively simple point-map of crime locations, however, this becomes problematic when multiple crimes that occur at a single address are displayed by a by a single point on a pin map (Sadler 1998). So there seems to be some academic debate as to an explicit definition of hot spot except for programs with procedures that self-define hot spots, such as, CrimeStat.

Hot spots are specific to their local conditions. In Baltimore County, Maryland, for example, hot spots are identified according to three criteria: frequency, geography, and time. At least two crimes of the same type must be present. The area and the timeframe is small a 1- to 2-week period. Hot spots are generally monitored by crime analysts until they become inactive (Canter, 1997). Although the definition may be elusive, common sense and objective analysis should clearly define hotspots.

Data Acquisition and Manipulation: The analysis will be performed on crime data, originally retrieved from Tetrad Computer Applications Inc. Crime Analysis internet site, (http: // . tetrad. com / new /crime. html#Profile) and was published in the Vancouver Sun, September 16, 1995. The data was in a graphical map format that displayed the locations and modus operandi for unsolved murders in Vancouver spanning twenty four years from 1970 to 1994, figure 1. Figure 1. Unsolved Murders 1970-1994 The defined study area is immediately South of Stanley Park in Vancouver and the original data consists of 135 crime locations.

The point data displayed on the map has associated to it the particulars of the victim, modus operandi and the actual street address of the crime, table 1. The task of geo-referencing the street network map and the crime locations appears, at first glance, deceptively simple. That is, until you try to locate a geo referenced city map of Vancouver, or portion VICTIM ADDRESS DATE MO SEX AGE Patricia 1160 Haro St. 19811125 Strangled F 33 Table 1. Victim and Crime Particulars of, in a digital format without having pre-approved financing from a major lending institution. Therefore, it was necessary to take advantage of a corporate demo subject area data base and graphical user interface, referred to currently as; VIP MapGuide, compliments of Canadian Pacific Railway, figure 2. This intranet site, (http: //mgp c / ctn vip / home. html) is in development to be used to locate and link customer shipments to the rail network and its attributes.

This application allowed me to locate and assign the lat / long co-ordinates associated to each data point in an ASCII text file to be imported into Crime Stat and Idrisi 32 for visual analysis. This was a laborious and time consuming task done for each data point. Where very tight clusters of data points were encountered, it was difficult to distinguish individual points. As a result, the data set was reduced to 103 locations, marginally affecting high concentration areas. Figure 2. Common Track Network VIP Mapguide Methods: The nearest neighbor hierarchical spatial clustering routine groups points together on the basis of spatial proximity.

The user defines a significance level associated with a threshold distance, a minimum number of points that are required for each cluster, and an output size for displaying the clusters with ellipses. Clustering is hierarchical in that the first-order clusters are treated as separate points to be clustered into second-order clusters, and the second-order clusters are treated as separate points to be clustered into third-order clusters, and so on. Higher-order clusters will be identified only if the distance between their centers are closer than the new threshold distance. The results can be saved to a text file, output as a '. dbf' file, or output as ellipses to ArcView '. shp', MapInfo '. mif' or Atlas GIS '. bna' files. The cluster output size can be adjusted to display the number of standard deviations defined by the ellipse, from one standard deviation, the default value, to five standard deviations. Defining a minimum number of points that are required can control restrictions on the number of clusters.

The default is 10. If there are too few points allowed, then there will be many very small clusters. By increasing the number of required points, the number of clusters will be reduced. The K-means clustering routine is a procedure for partitioning all the points into K groups.

Where K is a number assigned by the user. The default K is 5. The routine finds K seed locations in which the distance between points within clusters are small (minimum within) but the distances between seed locations are large (maximum between). If K is small, the clusters will typically cover larger areas. Conversely, if K is large, the clusters will typically cover smaller areas. The results can also be saved to a text file, output as a '. dbf' file, or output as ellipses to ArcView '. shp', MapInfo '. mif' or Atlas GIS '. bna' files.

Method of Analysis Nnh: The first output to be examined is that of the nearest neighbor hierarchical spatial clustering (Nnh). In doing so, it is necessary to restate our research objectives by adhering to the following six steps. Step 1. Hypothesis Statement The first step is to establish null hypothesis statement regarding the CSR. In this case we can state: Ho; there is no statistically significant difference between the observed and expected values, therefore the distribution of points constitutes a random pattern. Ha; there is a statistically significant difference between the observed and expected values, therefore the distribution of points constitute a clustered pattern.

Step 2. Choice of Test: The statistical choice of test in this case appears to be the t-test, (T-value) although traditional NNA employs a z-statistic (standard normal deviate). Step 3. Sample Size and Significance Level: The sample size for all tests regarding this analysis is 103 data points and a significance level of 0.05. The threshold distance is adjusted by the significance level.

Distances smaller than the threshold are candidates for clustering. The larger the alpha-level chosen, then clusters will cover larger areas with larger ellipses. The smaller the likelihood, then clusters will cover smaller areas with smaller ellipses. However, the higher the alpha-level chosen, the greater the likelihood that clusters could be chance groupings. Step 4.

Sampling Distribution: Since we are employing a t-test to determine statistical significance, we can assume we are dealing with a t distribution. Step 5. Region of Rejection: Based on our sample size of 103, an alpha value of 0.05, with n-1 df and employing a two-tailed test, we know from the t-tables that the region of rejection is greater than +/- 1.96. Step 6. Decision Rule: Based on our hypothesis statement and the region of rejection we must then accept or reject the null hypothesis. If our calculated t-value is greater than our critical t-value we must reject the null hypothesis Ho and accept Ha.

Since our Calculated T-value of 1.671 is less than our critical T-value of 1.96, we cannot reject Ho and therefore state: there is no statistically significant difference between the observed and expected values, therefore the distribution of points constitutes a random pattern. Results Nnh: The results from the nearest neighbor hierarchical clustering analysis run in the CrimeStat software are displayed below in table 2. Nnh Clustering results In an effort to make the interpretation more intuitive, the original data point ASCII file was converted to a raster file in Idrisi 32, figure 3. By importing the "shp" file created by CrimeStat and converting this also to a vector file we are able display the calculated Nnh ellipse overlaid with the original raster ized data points, figure 4. This ellipse is bounded by a single standard deviation, which was defined by the creation parameters. Figure 3.

Raster Image of Unsolved Murders 1970-1994 Figure 4. Raster Image of Unsolved Murders 1970-1994 Nnh Clustering Ellipse Method of Analysis K-Means The final output to be examined is that of the K-Means Clustering. In doing so, it is necessary to restate our research objectives by once again adhering to the following six steps. Step 1. Step 2. Choice of Test: The statistical test in this case is unique to the K-Mean method of analysis.

Although the CrimeStat program does not provide the mathematical foundations of this or any of the other applied methods, one would assume it is based on the calculation of the K-Function statistic, lK. Step 3. Step 4. Sampling Distribution: Once again, we are lacking the necessary information to make a definitive statement on the sampling distribution and assume it is similar in nature to that employed by Chen and Get is in their K-function Analysis module of the PPA program they have developed. They had assumed stationarity, which is necessary if inferences are to be made from a single observed pattern.

This testing procedure and its assumptions is referred to as a homogeneous Poisson process. Step 5. Region of Rejection: The rejection region is defined by the minimum and maximum expected values. If the observed values fall outside of this region then we must reject Ho. If the observed value falls within this region then we cannot reject Ho and must accept Ha based on a 95% level of confidence. Step 6.

In this case, the value itself, when compared to the region of rejection (either above or below) can provide further information regarding the distribution of our point data. K-Mean Results The results for the K-Means Clustering analysis are presented below in table 3. K-Means Clustering Results Once again, the ellipse files were created with a bound of one standard deviation. These". shp" files imported into to Idrisi 32 and overlaid onto the original data points, figure 5, for ease of interpretation Figure 5. Raster Image of Unsolved Murders 1970-1994 K-Means Ellipses Analysis of Results: Although the raster images created by this procedure are visually appealing, the results are still ambiguous. In this case the default number of 5 was accepted.

Five ellipses were created but the results for cluster 1 and 2 have a significant population of data points, 49 and 35 respectively. The threshold that determines whether an ellipse is significantly clustered or CSR is not readily apparent. It appears however, that clusters 3, 4, and 5 indicate that there is no statistically significant difference between the observed and expected values, and they have data population point counts of only 3, 8 and 8 respectively. Therefore we must accept Ho and state; the distribution of points constitutes a random pattern. The same cannot be said for clusters 1 and 2, which have data point populations of 49 and 35 respectively. This would suggest that there is a statistically significant difference between the observed and expected values, therefore we must reject Ho and state; the distribution of points constitutes a clustered pattern for these clusters.

These two areas then, would be classified as "hot spots", areas of concentrated crime. This may be deceiving as, as discussed previously, multiple crimes at one location can lead to misleading results. This could range from an apartment complex to a vacant industrial site. Analysis of this type must be "ground truth ed" and inferences should not be made with a priori intimate knowledge of the study area.

CrimeStat Analysis: The spatial statistics program for the analysis of crime incident locations, CrimeStat has eight functional tabs, four of which are dedicated to data structures and parameters; primary file, secondary file, reference file and measurement parameters. The initial data input screen can be troublesome as the "compute" button did little more than freeze the program. Three of the remaining four tabs; distance analysis, Hot Spot analysis and Interpolation provide a powerful analytical arsenal. The Interpolation module, which calculates probabilities and densities and can handle several layers of different distributions would be very useful in time-series and change detection analyses. Unfortunately the spatial Autocorrelation indices module, which houses Moran's I and Geary's C, was completely inaccessible, with dialogue boxes "greyed out" entirely. In general, the CrimeStat program is user friendly, but requires an in depth knowledge of the subtleties of several similar operations and the lack of mathematical documentation makes this task much more difficult.

Conclusions: The often complicated and important process of data acquisition, rectification and preparation for implementation into a GIS software system is often underestimated. A common source of digital data, provided by projects like TIGER are long overdue, and could provide "customers" and users of this data warehouse with a consistent and reliable single source database. This is not to say a single source database may not be erroneous, but the error is kept constant if all are accessing the same information. The purpose of this paper was to explore the spatial statistics program for the analysis of crime incident locations, CrimeStat. The "hotspot analysis" module, which included the K-means clustering and the nearest neighbor hierarchical spatial clustering methods were both successful in processing the inputted data effectively and efficiently. The nearest neighbor hierarchical spatial clustering routine found the overall distribution of data points to be distributed in a completely spatially random pattern.

The K-Means analysis was employed to refine the Nnh process by categorizing the data points into five ellipses, based on the standard deviation of the data in both the major and minor axes. Two of these ellipses, ellipse 1 and 2, were found to be statistically significant. The "hotspot" aspect of the program did indeed highlight areas of higher concentration and is a useful identification and monitoring tool. The CrimeStat program and programs like it, provide users with the ability to create intelligent maps that allow us to extrapolate information into predictive models. This "modular approach" of expanding and developing specific software supplements, empowers the user by providing new applications and dynamic methodologies, to the static and expensive cartographic / GIS software nuclei (Ansel in, L., 1998). The strength of these analyses cannot stand alone and requires an intelligent user to make intelligent choices to provide meaningful results.

32 a.