Using Natural Language Processing for Precise Geotagging

Oct 15, 2018

Voyager’s geotagging capabilities allow administrators to associate non-spatial data, such as Word documents or photos, with geographic locations. Once geotagged, that data can be linked to spatial data and is available in location-based searches. Geotagging provides a great tool for organizations that use spatial data. Here at Voyager Search, we’ve made geotagging even more accurate by incorporating some Artificial Intelligence (AI) into the process.

To do this, we employ Natural Language Processing (NLP), a subset of AI. NLP is the science of teaching computers to understand human interaction and the nuances of language and it is the tool that we’ll use for more precise geotagging.

When geotagging text, it’s possible to tag false positives and have geographical points assigned to a document when they don’t truly apply.

As an example, let’s index the Wikipedia page about the actress Olympia Dukakis.

We want to geotag this content, but we don’t want the city of Olympia, Washington to be used to geotag the records.

This is where NLP comes into play. It provides a powerful mechanism for making geotagging more precise by using more information to find results, including content, context and syntax. This gives a more refined picture than using text content alone.

When we run the NLP Extension on this Wikipedia content, it brings back a few different categories, including location-specific ones, but Olympia is only mentioned as a person, not as a location. The results are shown below.

To better see NLP in action, let’s create two pipelines. One pipeline will do geotagging on the text content of the document. The other will first complete entity extraction, which is the process of pulling out key terms that can group related items together, with NLP and then run geotagging on the location-specific results of our entity extraction.

The first pipeline:

The second pipeline, which performs entity extraction prior to geotagging, and uses the “nlp_place” field for the geotagged content:

Now, we create two locations with both pointing to a copy of the .pdf created from the wikipedia page. The first location will use the first pipeline and the second location will use the second pipeline.

After indexing both locations, we have two documents to compare. The documents are exactly the same, but the output is not.

For the document that has had standard geotagging applied without NLP extraction to identify the locations, the geotagged output looks like this:

When we compare that to the item that had entity extraction run to identify the locations mentioned in the article, and used those locations for geotagging, we get a much more precise geotagging result:

If we take a closer look at the text, we can see a mention of 3 film festivals:

  • 2018 Toronto Shorts International Film Festival- Winner "Best Drama"
  • 2018 Yorkton Film Festival – Nominee "Golden Sheaf Award"
  • 2018 Montreal Greek Film Festival – Official Selection

When we use geotagging without NLP, Toronto, Yorkton, and Montreal are all geotagged and represented on the map. However, when we use NLP first, these are not considered locations so are not geotagged. Instead, these 3 entries were identified as events and are now available as filters which we can use for faceting results.

Web Design and Web Development by Buildable