Ancient historical manuscripts are a rich source of history and civilization. They consist of several different patterns or layers of information.
Unfortunately, these documents are often affected by different ages and storage-related degradation, which influence their readability and information contents. Mostly, these damages appear as humidity spots and molds or ink seeped from the reverse side, impairing the main text.
In a new study- published in PLoS ONE- scientists proposed a document restoration method that removes the unwanted interfering degradation patterns from ancient color manuscripts. The method uses single-sided RGB manuscripts, thus avoiding the need for recto-verso alignment, and adopts the approach of performing an analysis of their content that is of individually detecting and locating their constituent patterns.
Virtual restoration is then obtained by inpainting the undesired interfering content with the background texture.
Scientists noted, “It is straightforward to see that such an approach can also facilitate the execution of other tasks besides virtual restoration, for instance, document binarization or text extraction, and geometrical and logical page layout analysis.”
“Unlike the binary restoration, the main focus is to restore the aesthetic look of the manuscript, which is important in processing ancient documents. We thus combine three different color space information to create a feature space that can capture all the necessary information to discriminate the classes (foreground text, background medium, and several other different information/degradation patterns) by highlighting the differences, even slight, in their spectral responses.”
In particular, scientists associated each pixel with its representations in the RGB, CIELUV, and CIELAB color spaces, along with its spatial location in the image. The spatial smoothness constraint enforced by pixel spatial information is particularly suited to describe the homogeneity of color usually observed in typical manuscript patterns.
A Gaussian mixture model (GMM) based clustering can be used for pixel-based classification. To improve and speed up GMM clustering, scientists performed principal component analysis (PCA) of the initial data space to decorate it and reduce its dimension without losing information.
The team found that PCA is particularly beneficial for K-means clustering by eliminating associations between data and improving the quality of segmentation. The PCA components captured the essential information and organized it in a more coherent way.
After segmentation, a virtually restored image of the manuscript with all its informative content is generated by selectively replacing the detected degradation pixels with appropriate fill-in pixels that reproduce the textured background.
Scientists noted, “The results show that the proposed method can be satisfactorily used to remove the interference commonly found in ancient manuscripts and to extract typical salient features.”
For experiments, scientists evaluated the performance of the new method using a set of experimental results on ancient color document images. They compared results with a recently published bleed-through removal method, one of the most impairing degradations in ancient manuscripts.
For comparison, scientists used images from the well-known database of ancient documents, which contains 25 pairs of recto-verso images of ancient manuscripts affected by different levels of bleed-through, along with manually created ground truth binary images of the foreground text.
Scientists noted, “It is worth mentioning that, while this database mainly focuses on bleed-through effects, our method can be used to remove also other document degradations, such as stains, folding marks, etc.”