Form processing: Automatic recognition of ID numbers in scanned forms

by Ronen Ben Zino and Benny Shamil
Supervised by Johanan

Introduction: What is form processing?

Form processing is a process whereby information entered into data fields is converted into electronic form:

  • Entered data are ""captured"" from their respective fields
  • Forms themselves are digitized and saved as images.

In most cases forms processing is considered complete when the data from all the forms have been captured, verified and saved into a database. It is also essential that the integrity of the captured data is preserved.


Forms can be processed manually or using forms processing software. In the advantages of form processing using computer software are very clear.

The aim of the project

In this project we were asked to develop an automatic solution for
picking a scanned exam form of a certain student from a whole database of such forms, on request. The identification had to be done by recognizing the personal ID number of the student on the title page of the form. On each form, this ID number is marked by the student himself, by checking the right digits in a table, see the examples below.


Problems that we had to overcome

Forms which are filled by humans may include a lot of obstacles as skewing of the image, unclear mark, deleted mark and so.

Once we overcome those obstacles we can segment and label the scanned form and extract the ID number of the student from the marks square. 
Thus, our project divided into two main parts:

1.        Fixing and cropping the relevant check boxes area

2.        Segment the check boxes area and extract the ID number.

The solution

General scheme

·         The forms is scanned into a known format (jpg,bmp)

·         The image is read and cropped roughly.

·         An algorithmis applied to find the image skew from the origin.

·         The skew is fixed using a correlation algorithm in order to detect the exact location of the check boxes.


·         The check box area is segmented and labeled.

·         The marked square is detected and the ID number extracted, each mark has a value in the range of 1-90.


The project was developed with MATLAB version 7.1, on PC platform.

The forms were scanned using the scanner that used for scanning the exam forms at the computer administration center of theEE faculty.


We developed and implemented a program which extracts the ID number from scanned exam forms, as used in the EE faculty.

The program was tested on 100 exam forms (both B&W and colored forms, including many cases of problematic forms with obstacles as described before). In those forms, accuracy of 100% was achieved. We believe that our program is ready to be used for regular work with the scanned forms of the type that we worked with.

As a followup project, a user friendly graphical interface shall be build, which integrates the process of scanning, automatic identification of the student's ID, and encode the ID in the filename of the scanned exam.



We are grateful to our project supervisor Johanan Erez for his help and guidance throughout the work and to Shula Fine from the EE computer administration center that gave technical support with the scanning of the forms. We are also grateful to the Ollendorff Minerva Center Fund for supporting this project.