Experience and Realizations
Working on the project was engaging physically and mentally.
Before starting anything, we have to look for possible resources and references to get the gist of what we are about to battle with. Google indeed became our best friend in this project. We have tried millions of keywords with ‘document’ and ‘classification’ in it, and Goggle never failed in giving us surprising results.
We read tons of research papers daily for two weeks, from the simple to the complicated. We have also tried consulting the online database of our university hoping to find some information on what algorithms we could use. Out of the research papers we found, we have only used 4 so far. We thought it would end there, but then installations came.
As our laptops have different operating systems, some APIs only work on Mac OS X or Mac OS, some only on Windows, while some may work for both, they are too tedious to install. We have also tried using a virtual box just to make things happen, but that too failed. So we worked on the project using the research we found and installed all the necessary tools that it required, hoping it will at least run smoothly. On the process, we then soon realized that we have to switch gears and find new algorithms to work with. It is usually caused by being unable to apply the algorithm, or simply just being stuck. Of course, for the new algorithm, we need new tools. This cycle continued until we came to a point where what we do actually make sense in classifying the documents. Of course, we will still presume the cycle won’t end yet, and maybe it is the reason why Google will be our long-time companion. Good thing, we weren’t planning to start from scratch again, yet. Whenever we are trying to apply a new research or algorithm, we make sure to test it. Thanks to our co-interns and friends, we were able to test the different algorithms every time as they donate their receipts to us. After a few days, we’ve got the reimbursement and expense forms from the HR, which became part of our variable as well. With these receipts, we were able to test and review how useful and flexible the algorithms without the need to buy.
We were also able to get our hands on some legal documents from Sir Jeric. Unlike receipts, OP documents have a uniform layout. Instead of testing the features, we opted to test the precision and accuracy to accommodate other types as well. Luckily, it works just as fine.
In order for this project to work, we need the strive to learn for us to keep on looking for the solutions that will best fit the real world use for it. From this principle, we realized we also need to adjust to its future users – fast, flexible and feasible. So yeah, still working on how to make those three happen, but surely it is possible as what we have started will be a great foundation.
Learning Python, TensorFlow & OpenCV
Strategizing on what APIs and materials we are going to use took a long time since installation as mentioned before are tedious. Nonetheless, it still provided us knowledge that could be of use in the future. Some lessons are learning to code in Python, training an image classifier in TensorFlow, the usage of the OpenCV library and different OCR APIs and programs that are useful for metadata extraction.
The use of image processing was a big deal in extracting layouts from various documents. We have to learn algorithms on how we could segment documents in order to get the relative layouts of those documents. The technical papers gave us various algorithms, but we did not use all of them. Only some as the others still require more studying which would take more time. The algorithm we used for layout extraction is the Run Length Smearing Algorithm (RLSA) which is a simple algorithm, from the name itself, smears lines of text in order to merge different lines to get text blocks instead of text lines. We used OpenCV’s contour detection to get the bounding boxes of different text blocks so that we can divide and classify them with TensorFlow for easier classification. Although the flow works, the output seems to vary, depending on different thresholds used in the RLSA algorithm since various documents have different line spacings and layouts. The RLSA algorithm could smear together different blocks or lines of text of the threshold is not set with the respect with the spacing.
Coding in Python was a bit rough for us since we got used to coding in Java in our university. We had to learn the basic syntax of Python and the capabilities of Python with the developed libraries it offers. OpenCV takes Python to a whole new level providing it with different methods that we can use for image processing. Drawing rectangles, circles and text in an image using OpenCV is really good since it accepts parameters that are easy to manipulate in a way that it is useful when creating an automated drawing of objects.
In our project, we used rectangles to block off different blocks of text and color it differently with one another so that we could differentiate blocks of text. With those rectangles, we extract those regions from the original photo so be used for classifying various parts of the document (title, body, footer).
Python was also recommended by our professor in De La Salle University since according to him, Python having support in OpenCV, has a number of libraries that is easily installed in Mac OS X or Mac OS and is easy to understand. According to our research, it is also easy to install in Linux since the terminal in Linux is relative to what Mac OS X or Mac OS offers.
Using different OCR API’s on Document Classification
Using TensorFlow to classify images took us a while to get there. From reading the front page and documentation of Tensorflow, watching Youtube tutorials, and reading StackOverflow issues page of the inception algorithm we finally managed to train our own image classifier. However, we do not want to just simply classify images of documents with an image classifier. Instead, we want to focus on the features present in a document to classify documents. For example, the layout of receipts has various relative parts such as the header, the item list, and the footer which we can use to classify whether a document is a receipt or not. With that said, with the use of OpenCV and the TensorFlow we can achieve the desired result. However, we must have a clean extraction of the layout to train the perfect model for classifying different parts of a document.
We have also ventured a free OCR API that Google created named Tesseract. Although Tesseract performs good in clear screenshots of documents, its performance varies in scanned documents. Tesseract also provides a function that gives out the coordinates of lines of texts, blocks of texts and such. We have tried using it, as mentioned, and it still varies in performance in scanned documents and performs an excellent in screenshots of documents. However, images could not be detected so we had to resort to another API or algorithm in order to extract the layout.
Despite having to learn a lot of different APIs, we were still able to utilize everything we learned since we have developed something that work hand in hand with the other APIs. With OpenCV performing the image processing for document segmentation forwarding it to TensorFlow for classification features. Lastly, with identifying the features, and OCR API could be used but we did not bother using Tesseract since it can get a bit inaccurate sometimes. A better OCR could be used like ABBYY but the SDK it provided was not free so we did not have the chance to try it.
Darlene Psalm Marpa
John Patrick Tobias
Related Article: Internship Experience: Chatbot App Project