Friday, April 7, 2017

Geocoding

Goals and Objectives:

The goals of this assignment are to learn how to geocode address and PLSS locations and be able to compare them to the work done by others.  In the context of this sand mining project, 19 mine locations were provided and the goal was to geocode them and compare them to the same mines geocoded by classmates.  This provides an opportunity to analyze potential errors.  In order for the geocoding to be done, the data had to be normalized.

Methods:

The first step in the process of geocoding was to ensure that the data was normalized.  This meant manipulating the given datasets in Microsoft Excel so that each record would be in the same format.  Essentially, this split the different parts of the addresses up in a way that the geocoding tools could use to locate the desired locations.

The geocoding portion itself required two different processes. One way was used when the actual addresses were provided.  The other way was used when only PLSS information was available and the locations had to be found with a much more manual process.  When addresses were provided, the "Geocode Addresses" function could be used.  This would take the information from the normalized table and generate a list of options of potential locations of where the address could be.  I would then go and zoom to these locations and select the one that appeared most accurate using the imagery base map as well as Google Maps satellite view as a reference. 

When the physical address information was not available, the PLSS coordinates were used to locate the sand mines.  This was a much more manual process and required the use of the imagery base map, as well as layers displaying PLSS sections and townships in order to get a general idea of where these mines are.  When the mine was located, the "Geocode Addresses" function was used to mark the location for that point. In many cases, the mines that had addresses attached also included PLSS information.  When this was the case, the PLSS information was used to verify the accuracy of the address information.  When all 19 mines were geocoded, a shapefile was created to share with the rest of the class.

The next portion of the assignment was to compare my geocoded mines with the same mines that other people in the class also geocoded.  In order to do this, all of the shapefiles were brought into ArcMap and then the merge tool was used to combine them all into one.  Before doing anything else, since this step required measuring distances, I made sure that all of my data was projected using the same projection.  Using the merged layer, I was able to query out the 19 mines that I also geocoded and then selected a sample of 5 mines that had at least 2 other people to compare my work to. I then used the "Point Distance" tool to measure the distance between the point I thought was the location of the mine with the locations 2 of my classmates thought were the correct locations.  A table was then generated with these results.  The same process was then done to compare the same 5 mines with the true locations provided by data from the DNR.

Results:

Table 1:  The original location data that has not been normalized.  It is a mixture of actual addresses and PLSS locations all in one field.
Table 2: Locational information after it has been normalized.  This involved splitting the given information up in ways that that the geocoding function would be able to decipher.
Figure 1:  This is a map of the 19 sand mine locations that I geocoded.
Table 3:  This table shows the distance in feet from the location I placed each mine with the location 2 other people placed the same mine.  The 2 other people's locations received the "a/b" label in the table in order to differentiate the same intended mine with different people's locations.  The very large distances in some cases and high standard deviation shows that there were a few instances were the location I thought was correct and that of my classmates were at different mines all together.  In other instances, the point would be at the same mine, just a different entrance, thus causing a discrepancy.
Table 4: This table shows the distance between my point and the truth point provided by the DNR.  In many cases with this I noticed that my points were at the entrance to the mine whereas the DNR points would be inside the mines themselves.  Overall, the mean and standard deviation are relatively small distances showing there wasn't a huge amount of discrepancies.
Figure 2: This map shows an example of a difference in my data versus that of the DNR.  The points are clearly marking the same mine, however the actual placement of the points is different.  This was a common error that I noticed throughout my dataset.

Discussion:

While the geocoding process is generally reliable, it is impossible to be completely free of errors. This can be seen first of all through the differences in each person's points as noted in the tables and figures above.

Throughout this process there are both inherent and operational errors present.  Inherent errors occur due to the nature of how geographic data is represented.  This occurs when projecting the round earth onto a flat surface.  In this case, that could have an impact when trying to measure the distances between points.  Another way this could have an impact is when trying to match a point with the imagery base map because the base map could be outdated.  I noticed significant differences in the imagery when viewing it at different extents.  When the data was originally collected, there could be an inherent error depending on the equipment used for the collection purposes as well.

Operational errors occur due to human nature.  The differences in where points were placed on the imagery could be due to people interpreting the base map image differently.  It could also be due to people working at different scales when placing points.   There could also have been an operational error made when collecting the data in the first place.  It is nearly impossible to avoid these errors completely.

It is difficult to ultimately know which points are correct and which ones are not.  The best way to ensure accuracy is to actually go to the locations and verify the point.  However, this is not always possible.  In this case, the most feasible way to ensure data accuracy would be to compare as many different people's geocoded points as possible.  Even while doing that, however, it is impossible to ensure the data is completely correct without physically going to the locations and collecting the raw data.

Conclusions:

Overall, geocoding is a good way to create data points from addresses given.  While it may not be a perfect way to get data points every time, it is certainly a way to save time instead of going to each location in order to get a point.  This is not without limitations, however.  There will always be some risk of error when doing this process.  When examining the data for errors, while it would be nice to be able to check each point, that is not always possible.  In most cases a sample of the data will be examined, as was the case in this lab.  Future studies may want to check more locations than just the sample, as well as comparing the locations to a larger sample of other people's work.


No comments:

Post a Comment