BLOG

See the latest news and insights around Information Governance, eDiscovery, Enterprise Collaboration, and Social Media. 

All Posts

What Is Optical Character Recognition (OCR)?

Optical character recognition (OCR) offers organizations the opportunity to get a much better digital handle on the information they store.

What Is Optical Character Recognition (OCR)

This article will give you a clear overview of the opportunity that OCR represents, how it works, and what its most common (and useful) applications are.

What is Optical Character Recognition (OCR)? 

Optical Character Recognition is the electronic conversion of handwritten content,  printed text, or image-only digital documents into a machine-readable and searchable digital data format. For example, OCR allows handwritten legal notes, which would ordinarily be time-consuming to review, to be converted into PDFs that can quickly be searched for relevant content. 

In short, OCR takes a physical document or static digital image that isn’t searchable and transforms it into a digital document that is completely searchable.

Evolution of Optical Character Recognition (OCR) 

OCR technology first appeared over 100 years ago, when Dr. Edmund Fournier d'Albe invented the Optophone, a reading device that translated letters into sounds for the visually impaired.

 

During the First World War, physicist Emanuel Goldberg next invented a machine that could read characters and convert them into telegraph code. Goldberg repurposed existing technologies to create the first record-keeping system, which was later acquired by IBM. 

In the 1970s, Ray Kurzweil commercialized the “omni-font OCR”, making it possible to process text printed in different fonts. Two decades later, in the 1990s, OCR was popularized with the digitization of historic newspapers. And more recently, in the early 2000s, OCR was transformed into a largely cloud-based service that, for the first time, allowed this technology to be easily accessed on desktop and mobile devices. 

OCR technology has improved exponentially in recent years, and the solutions available today convert documents to a high level of accuracy. The video below by Techquickie provides an excellent overview of the definition of OCR and how the technology has improved over the years.

How Does OCR Work? 

Although the concept of OCR is straightforward, in practice the technology can be challenging to implement due to a number of factors. For example, different fonts and methods of letter formation can make the job of identifying characters more difficult. 

The process of OCR can be divided into image pre-processing, character recognition, and the post-processing of the output. Let’s break down the steps of OCR to better understand how this technology works. 

Step 1: The Document is Scanned

The first step towards success is to make sure the document is correctly aligned when scanned. Having the document’s text lines in horizontal and vertical alignment will greatly improve the efficiency of the process. Of course, If you’re dealing with a digital image like a JPEG, PNG, or PDF, this step is not required, as you already have a “scanned” document to work with. 

Step 2: Software Refines the Image

Next, the software sets about improving the elements of the document that need to be captured. Edges of letters are smoothed, any artifacts, imperfections, or dust particles are isolated and removed from the images so that only clear, plain text remains.

Step 3: Binarization

Now it is time to align text and convert colors or shades of grey to black and white only. The binarization step not only makes it easier to recognize fonts but also helps to accurately differentiate text (or any image element) from the background. 

Step 4: Identify the Characters

The next step is to figure out which characters are on the page. The more basic forms of OCR compare the pixels of each scanned letter to an existing font database and identify the closest match. More sophisticated forms of OCR break down each character into constituent elements, such as curves and corners, to match physical features as well as actual letters. 

Step 5: Ensure Accuracy 

OCR software can further reduce errors by making use of internal dictionaries to cross-reference and ensure higher accuracy. 

Step 6: Produce an Editable Digital Text File 

The final result is produced: a fully searchable, digital text file that can be manipulated, examined, and edited in any way the owner wishes.

Common Uses of OCR 

OCR is applicable to many types of businesses. There are many practical, commercial uses for OCR technology, from data entry and automatic recognition, to the conversion of handwritten data. Here are just a few examples from a range of industries.

Banking

Banks are one of the main users of OCR, where it helps to improve transaction security and risk management. With OCR, banks can accurately extract data from: 

  • Checks—capturing the account information and the handwritten amount and signature
  • Mortgage applications, loan documents, and payslips
  • ATMs, to improve security and accuracy in self-service processes.

Insurance

Insurance companies use OCR to deliver better customer service and drive performance. Documents can be digitized and claim processing can be automated through OCR and other supporting technologies.

Healthcare

With OCR it is possible to scan, search, and store patients’ medical histories containing reports, X-rays, previous illnesses, treatments, tests, hospital records, and insurance payments. Any hospital record can be swiftly digitized and accessed via OCR, which streamlines workflows and reduces manual admin. 

Legal

The legal industry deals with a lot of paperwork and greatly benefits from OCR technology as a result. Legal firms can digitize a wide range of documents such as handwritten notes, affidavits, judgments, filings, statements, and wills via OCR.

Tourism and Hospitality

OCR can enable guests to self-check-in by scanning their own passport on a hotel website or app.

Retail

OCR technology can prove very helpful to the retail industry, as it allows the capturing of data from packing lists, invoices, purchase orders, and more. It also improves customer experience. Thanks to mobile OCR, customers don’t need to worry about losing vouchers—they can scan serial codes via phones to redeem them.

OCR in eDiscovery and Online Investigations 

OCR also has a crucial role to play in eDiscovery and online investigations. For the purposes of eDiscovery, OCR software can identify and convert text characters from discoverable materials, such as physical contracts, typed letters, JPEGs of photographed documents, and image-only PDFs. Once done, this content can be searched in the same way you would search a Word document—simply type your query into the search bar and you’ll see all references in the document.   

This greatly speeds up the discovery process and reduces costs, since time-consuming human review is no longer needed—digitized information can be instantly searched for keywords, names, dates, etc. 

Similarly, OCR can cut down on the time it takes to conduct an online investigation, while simultaneously allowing investigators to massively expand the scope of their investigations. 

As an example, Pagefreezer’s WebPreserver automated forensic preservation tool allows you to capture an entire website or social media account (like a Facebook timeline or Instagram account) and export them in fully-searchable OCR PDFs. So, instead of collecting what you think you need from a social media account and possibly missing out on valuable evidence, you can simply capture everything, and then use the OCR PDF’s search functionality to identify relevant content.

Want to learn more? Download our white paper, Effective Collection and Forensic Preservation of Online Evidence.

Download White Paper

Peter Callaghan
Peter Callaghan
Peter Callaghan is the Chief Revenue Officer at Pagefreezer. He has a very successful record in the tech industry, bringing significant market share increases and exponential revenue growth to the companies he has served. Peter has a passion for building high-performance sales and marketing teams, developing value-based go-to-market strategies, and creating effective brand strategies.

Related Posts

Understanding the 9 Phases of the EDRM

The Electronic Discovery Reference Model (EDRM) was developed to improve standards and set guidelines in the eDiscovery process. Created by George Socha and Tom Gelbmann, the  EDRM illustrates the sequence of eDiscovery activities relating to a specific legal matter.

Collecting ESI eDiscovery Report: See How Your Company Compares to the Industry

The events of 2020 highlighted two major challenges that modern in-house legal teams face. First, there has been an explosion of data sources across most organizations. From team collaboration platforms (Slack, MS Teams, etc.) and video conferencing tools (Zoom, Google Meet, Cisco Webex), to mobile text messages, company websites, and social media accounts, companies are faced with new kinds of ESI being generated in real-time throughout their organizations. 

The Americans with Disabilities Act (ADA) and Website Compliance

The Americans with Disabilities Act (ADA) has been law for three decades (it turned 30 in 2020), but it’s fair to say that the world has changed considerably since it was first enacted in 1990.