
SAS Global Forum 2010 SAS Presents Paper 332-2010 PROC GEOCODE: Now with Street-Level Geocoding Darrell Massengill and Ed Odom, SAS Institute Inc., Cary, NC ABSTRACT How do you convert your address data into map locations? This is done through the process of geocoding. SAS/GRAPH® 9.2 software included PROC GEOCODE to simplify this process for you. By popular demand, the third maintenance release of SAS® 9.2 will add street-level geocoding. Street-level geocoding converts street addresses to locations. This paper will review the current capability of this procedure and cover the new street-level geocoding capability. INTRODUCTION Every business or organization has a lot of data that includes an address. The address data is useful only if it is transformed into location information that can be viewed on a map, used in distance calculations, or processed in other similar ways. To make this data useful, you must convert the address to a location having a latitude and longitude. This conversion process is called geocoding. If the address is an IP address, the process is usually called geolocating. This paper will discuss both geocoding and geolocating using PROC GEOCODE. First, we will introduce the concepts needed to understand geocoding, and then we will discuss PROC GEOCODE’s current and new functionality. Finally, examples will show how to use PROC GEOCODE. GEOCODING BASICS The geocoding process depends on having lookup data with the necessary information to make the conversion from address to location. This data is the key to geocoding. Factors such as age and granularity of the lookup data determine the geocoding results. Addresses routinely change with new construction, new roads being added, and postal codes being split and changed. The older your lookup data, the more likely it is that some address matches might be incorrect or missed completely. Granularity is another important consideration. Does the location need to be the actual house location or is it okay at a ZIP Code level or even a city level? If you are viewing the addresses on a state or U.S. map, then ZIP Code or city centroid is accurate enough. To understand geocoding, it is important to first understand the lookup data. It is particularly important to understand the differences between ZIP Code data, ZIP+4 data, and street address data. IP address data is completely different from the other types of addresses, but it is important to understand this data, too. ZIP CODE DATA A ZIP Code is a delivery route or a collection of roads traveled to deliver mail in an area. Sometimes, ZIP Codes are assigned to a single building or to a post office. ZIP Code boundaries are not created by the U.S. Postal Service (USPS). ZIP Codes are not polygonal areas in the way that a county, state, or country is. Creating polygons by simply wrapping the delivery route would leave gaps between the polygons because there are large, undeveloped areas. The area covered in a ZIP Code varies by how densely populated it is. A rural ZIP Code will be larger than one in a large city. Generally, ZIP Code address data specifies a centroid location for the ZIP Code area. The centroid is the geographic center of the area, but because the geographic area isn’t real, then that location can vary between different data providers. Appendix 1 contains many frequently asked questions about ZIP Codes. The standard ZIP Code in the United States is five digits. Figure 1 illustrates a ZIP Code area and its centroid location. All addresses in this ZIP Code would be assigned the same centroid X and Y values if you are geocoding with only ZIP Code data. 1 SAS Global Forum 2010 SAS Presents Figure 1. ZIP Code Area A ZIP Code is further divided into ZIP+4 areas. Four additional digits are appended to the ZIP Code to specify these additional subdivisions. A ZIP+4 will likely represent a single street or a part of a street. In a high-density city area, it might represent one side of a street on a single block, or even one floor of a large building. Figure 2 illustrates the relationship between a ZIP Code and a ZIP+4. The centroid is the midpoint of those addresses in the ZIP+4 area. All addresses in this ZIP+4 area would get the same centroid if geocoded with ZIP+4 data. However, there would be multiple locations within the overall ZIP Code area for each ZIP+4. Figure 2. ZIP+4 Area 2 SAS Global Forum 2010 SAS Presents ZIP Code data will get you to the general vicinity of the address, but probably not to the actual street . ZIP+4 data will probably get you to the correct street in the address, but not to the actual house location. To geocode to the specific house location, you need street-level data. STREET ADDRESS DATA Street-level address data contains information about the ZIP Code, state, city, and street. This information enables you to process the entire address and get a more precise street location. This data does not contain the location for every house number or address. Generally, this data only approximates the position of a particular address on a street by assuming that house numbers are an equal distance apart, which might not be true. Figure 3 illustrates street-level addressing. Anyone who has used a GPS or looked up directions on the Web knows that the location found might not be exactly right. The same is true when using any geocoding mechanism. Figure 3. Street-Level location IP GEOLOCATING DATA IP data is a form of range data and was not designed to be geographic, like street addresses. IP addresses are very different from ZIP Code and street addresses. Generally, these are collected from visitors to Web sites and indicate the connection the visitor used. IP address lookup data contains information that matches ranges of IP addresses to particular geographic locations. The location found will not be at the street or even ZIP Code level, but might indicate the city, state, or country where the IP address is registered. CHOOSING YOUR DATA The type of geocoding you want to do will determine the type of lookup data that is needed. What things are important to you? How precise does the location need to be? Do you need the street-level address, or will the ZIP Code or city center be sufficient? How current does the data need to be? How much are you willing to pay for the data? The more up-to-date, accurate, and fine-grained the data, the more it costs to purchase and maintain it. Also, higher-resolution data requires more disk storage space and takes longer to run the geocoding process against it. There are free sources for some types of data, but these might not be updated as frequently as the data you purchase. It is important to remember that both purchased and free lookup data might be flawed and give incorrect results. There are no guarantees with any geocoding lookup data, so the results should be with used with caution. 3 SAS Global Forum 2010 SAS Presents PROC GEOCODE The GEOCODE procedure converts address data to geographic coordinates (latitude and longitude values). These geographic coordinates can then be used on a map to calculate distances or to perform spatial analysis. Appendix 2 contains more information about what can be done with the geocoded coordinates. In addition, the procedure enables you to add attribute values to your data if they are in the lookup data. Examples would be adding census blocks or area codes to an address. The GEOCODE procedure requires at least two SAS data sets: The input data set that you want to geocode. This data set will contain variables related to the address such as street address, city, state, and ZIP Code. Note that it is important for this data to be clean and in the proper form in order to have a high match rate. A lookup data set (or multiple data sets) containing the data to transform your address data into geographic locations. By default, SASHELP.ZIPCODE is used for ZIP Code or city geocoding. Range data (for example, IP addresses) uses two data sets. Street-Level uses six data sets. The simplest example of how these data sets are used is with ZIP Code geocoding. Figure 4 shows that all of the variables from the input data are carried forward into the output data set. The ZIP variable is looked up in the lookup data set and those X and Y values (longitude and latitude) are added to the output data set. These are the geographic coordinates of that ZIP Code centroid. In addition, if the lookup data set contains other attributes such as county names or census blocks, you can specify that these additional values be moved to the output data set. Input data set (address data) Customer ID Address City State ZIP PROC GEOCODE 456 1234 Smith St Selbyville DE 19975 457 201 S 2nd St Oxford PA 19363 ZIP X Y 19975 38.4673 75.1976 19363 Lookup data set Customer ID Address City State ZIP _MATCHED_ X Y 456 1234 Smith St Selbyville DE 19975 ZIP 38.4673 75.1976 Geocoded Output data set Figure 4. ZIP Code Lookup In reality, geocoding is more complicated than this. By default, the procedure will give you the next larger area if the ZIP Code isn’t found. If a standard five-digit ZIP Code isn’t found, then it will attempt to find the city center location.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages16 Page
-
File Size-