Encyclopedia > Optical character recognition

Article Content

Optical character recognition

Optical character recognition, usually abbreviated to OCR, involves computer systems designed to translate images of typewritten text (usually captured by a scanner) into machine-editable text - to translate pictures of characters into a standard encoding scheme representing them (usually ASCII in the case of English text). OCR began as a field of research in artificial intelligence and machine vision; though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques.

Early systems required "training" (essentially, the provision of known samples of each character) to read a specific font. Currently, though, "intelligent" systems that can recognize most fonts with a high degree of accuracy are now common. Some systems are even capable of correctly identifying columns and non-textual images and producing output that places the text and scanned images equivalently.

The United States Postal Service has been using OCR machines to pre-sort mail since 1965. Mail sortation plays a small role in OCR research; OCR systems need only read the zip code (postal code) on each envelope. After the zip code has been read, a barcode with the same information is printed on the envelope. Envelopes marked with the machine readable barcode may then be processed; machine readable codes can be decoded more quickly than human readable letters and numbers.

Whilst the accurate recognition of European typewritten text is now considered largely a solved problem, recognition of handwriting in general, and printed versions of some other scripts--particularly those with a very large number of characters--are still the subject of research.