Using OCR To Get More Out of Your Search Experience

By Samuel Smith, Voyager Search Solution Architect

Sep 21, 2017

In the last few years, Optical Character Recognition, or OCR, has become increasingly important in the enterprise search industry. This is especially true in the petroleum and mining industries due to their vast amounts of historical data captured as scanned images. Humans can read these “pictures of text,” but they don't contain the individual words and characters that make them searchable. To accommodate our customers’ need for OCR, Voyager software allows the digestion and inclusion of text from scanned legacy documents, screenshots, PowerPoint presentations, and photographed documents. OCR allows Voyager to read and index traditionally non-machine readable documents so users can add them to their organization’s catalog and make them searchable alongside Word documents, geospatial data, and more.

Written in Python, Voyager software’s OCR capability is implemented as a step in the indexing pipeline and can be downloaded, installed, and configured with any of our licensing options. Voyager’s application of OCR works like most other OCR implementations: an “image” of text is scanned for characters and these characters are then converted into machine-readable text. With Voyager software, the text is then added to a field in the index, where the content is stored to support searches for the document.

Once Voyager has this text, additional pipeline steps can be used to enhance the search experience. For example, an administrator can use Natural Language Processing (NLP) to find place names within the text and then geotag those locations to support spatial search. Those locations can then be added to a geographic hierarchy of countries, states, and cities so that users can easily browse content.

Voyager’s OCR implementation supports the following formats and languages:

  • Arabic, Bengali, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese (additional languages can be added as needed)

Voyager’s OCR capability provides its users with even greater functionality, without requiring them to jump through hoops to get it. With immediate access to all of their content, organizations with massive amounts of historical data no longer have to sacrifice productivity. Getting this time back means quicker outcomes and more streamlined processes.

Learn more about Voyager’s OCR configuration here or contact Voyager Search’s Professional Services team for more complex or custom OCR implementations.

Web Design & Web Development by LVSYS