Removal of clutter from historical scanned documents

Removal of clutter from historical scanned documents

An innovation project led by Lumex A/S and partly financed by the Norwegian Research Council, aims to improve optical character recognition in noisy historical documents. The goal is to develop solutions that will be able to detect, locate and characterize clutter and then apply adapted OCR to regions containing clutter. The clutter can be related to tears, cracks and aging of the paper of the documents, or stamps and annotations that have been deliberately introduced. Ink smears and blobs from the printing process are also frequent.

NR is contributing to this project by developing novel methods that can help to remove various types of clutter in such images. The images below show results where clutter has been automatically located and marked.

   

Detected clutter marked in red.

Department

Partners

  • Lumex A/S
  • The National Archives of Norway