Supporting Information

Supporting Information Faigenbaum-Golovin et al. 10.1073/pnas.1522200113 Introduction sampled key points. An optimization of the pen’s trajectory is The main goal of the current research was to estimate the minimal performed for all intermediate sampled points, taking into number of authors involved in the scripting of the Arad corpus. To account information from the noisy character image. A short deal with this issue, we had to differentiate between authors of mathematical description of the procedure follows; for more de- different inscriptions. Although relevant algorithms have been tails and analysis see ref. 14. proposed in the past (e.g., ref. 34 for incised lapidary texts), our A stroke could be referred to as a 2D piecewise smooth curve experience shows that most of the solutions are tailor-made for ðxðtÞ, yðtÞÞ, depending on the parameter t ∈ ½a, b. However, such a specific corpora. The poor state of preservation of the Arad First representation ignores the stroke’s thickness, which is related to Temple period ostraca, and the high variance of their cursive texts the stance of the writing pen toward the document (in our case, a of mundane nature, presented difficulties that none of the available potshard) and to the characteristics of the pen itself. In the case methods could overcome (see Fig. 2). Therefore, novel image of Iron Age Hebrew, it is well accepted that the scribes used reed processing and machine learning tools had to be developed. pens, which have a flat, rather than pointed, top. This fact makes The input for our system is the digital images of the inscriptions. the writing thickness even more essential to the process of stroke The algorithm involves two preparatory stages, leading to a third restoration. Therefore, we denote the stroke as a set-valued step that estimates the probability that two given inscriptions were function: written by the same author. All of the stages are fully automatic, n o with the exception of the first, semiautomatic, preparatory step. SðtÞ = ðp, qÞjðp − xðtÞÞ2 + ðq − yðtÞÞ2 ≤ rðtÞ2 t ∈ ½a, b, The basic steps of the algorithm are as follow: i) Restoring characters via approximation of their composing where xðtÞ and yðtÞ represent the coordinates of the center of the strokes, represented as a spline-based structure, and esti- pen at t, and rðtÞ stands for the radius of the pen at t (Fig. S1). mated by an optimization procedure (for further details The corresponding stroke curve is thus see Description of the Algorithm, Character Restoration). ii) Feature extraction and distance calculation: creation of fea- γðtÞ = ðxðtÞ, yðtÞ, rðtÞÞ t ∈ ½a, b, ture vectors describing the characters’ various aspects (e.g., angles between strokes and character profiles); calculating whereas the skeleton of the stroke will accordingly be the curve the distance (similarity) between characters (see Description of the Algorithm, Feature Extraction and Distance Calculation). βðtÞ = ðxðtÞ, yðtÞÞ t ∈ ½a, b . iii) Testing the hypothesis that two given inscriptions were written by the same author. Upon obtaining a suitable P value We note that our model of a written stroke is an approximation, (the significance level of the test, denoted as P), we reject because in reality the top of the reed pen was not necessarily a the hypothesis of a single author and accept the competing perfect circle. proposition of two different authors; otherwise, we remain un- Borrowing the idea of minimizing an energy functional (35, 36), decided (see Description of the Algorithm, Hypothesis Testing). we produce an analytic reconstruction of a stroke with respect to ∈ × The next section will present an in-depth description of each of a given image Iðp, qÞ (ðp, qÞ ½1, N ½1, M). This reconstructed p γp the stages. This will be followed by an experimental section that stroke S ðtÞ is defined as corresponding to the stroke curve ðtÞ, describes the application of our algorithm to both modern and minimizing the following functional: ancient texts. We verify the validity of our approach by applying −« Zb Zb tZj+1 the algorithm to modern texts (with a number of contemporary XJ−1 γ = GI ðtÞ + p1ffiffiffiffiffiffiffi + _ _ € € texts written by individuals known to us). F½ ðtÞ c1 2 dt c2 dt c3 jKðx, y, x, yÞj dt rðtÞ rðtÞ j= a a 0 t +« Description of the Algorithm j Character Restoration. The state of preservation of most ostraca is p poor at best. After more than two and a half millennia buried in γ ðtÞ = argmin F½γðtÞ, the ground, the inscriptions are often blurry, partially erased, γðtÞ cracked, and stained. However, to analyze the script, clear black P “ ” and white ( binary ) images are required. Theoretically, such where GI ðtÞ = Iðp, qÞ is the sum of the gray level values of depictions of the inscriptions do exist, in the form of manually ðp, qÞ∈SðtÞ created facsimiles (drawings of the ostraca), created by epigraphic the image I inside the disk SðtÞ; γðtjÞ = ðxðtjÞ, yðtjÞ, rðtjÞÞ j = 0, ..., J experts. However, these have been shown to be influenced by the are manually sampled points on the stroke curve γðtÞ,withrespect _ € _ € prior knowledge and assumptions of the epigrapher (32). A po- to the natural parameter t; x, x and y, y denote the first3 and= second tential solution for this problem could have been provided by derivatives of x and y; Kðx_, y_, x€, y€Þ = ðx_y€− y_x€Þ=ðx_2 + y_2Þ 2 stands for automatic binarization procedures from the domain of image the curvature of the skeleton of the stroke βðtÞ;0< c1, c2, c3, « ∈ R processing. Unfortunately, in our experimentations, various bi- are parameters, set to c1 = 2, c2 = 2,000, c3 = 50, « = 0.01 in our narization methods produced unsatisfactory results (12). experiments. We finally substituted these initial attempts with a semi- The reconstruction is subject to initial and boundary conditions automatic approach of individual character restoration. Restoring at (a) the beginning and end of strokes; (b) intersections of a character is equivalent to reconstructing its strokes, which are the strokes; (c) significant extremal points of the curvature; and (d) character’s building blocks, and then combining them. Accord- points with no traces of ink. These conditions are supplied by ingly, henceforth we will discuss the problem of stroke restoration manual sampling. rather than complete character reconstruction. Stroke restoration The energy minimization problem described above is solved aims at imitating the reed pen’s movement using several manually by performing gradient descent iterations on a cubic-spline Faigenbaum-Golovin et al. www.pnas.org/cgi/content/short/1522200113 1of8 representation of the stroke (for more details see ref. 14). The end their respective SDs (σZernike, σDCT, etc.) are calculated in a similar product of the reconstruction is a binary image of the character, fashion. incorporating all its strokes. Therefore, each character k is represented by the following Fig. S2 presents a restoration of an entire character, stroke by vector (of size 7 · JL), concatenating the respective normalized stroke. It can be seen that although the original character image row vectors of the distance matrices: contains several erosions (Fig. S2A), the reconstructed strokes 0 1 (Fig. S2C) look both smooth and complete, and their union re- k k k k ~k k k ~u ~u ~u ~u − uProj ~u ~u · sults in a clear letter, adhering to the character image (Fig. S2D). ~ @ SIFT Zernike DCT Kd tree L1 CMI A 7 JL uk = jj jj jj jj jj jj ∈ R . σSIFT σZernike σDCT σKd−tree σProj σL1 σCMI Feature Extraction and Distance Calculation. Commonly, automatic comparison of characters relies upon features extracted from the characters’ binary images. In this study, we adapted several well- In this fashion, each character is described by the degree of its established features from the domains of computer vision and kinship to all of the characters, using all of the various features. document analysis. These features refer to aspects such as the Finally, the distance between characters i and j is calculated character’s overall shape, the angles between strokes, the char- according to the Euclidean distance between their generalized ’ acter s center of gravity, as well as its horizontal and vertical feature vectors: projections. Some of these features correspond to characteristics commonly used in traditional paleography (21). chardistði, jÞ = ~ui −~uj . The feature extraction process includes a preliminary step of 2 the characters’ standardization. The steps involve rotating the characters according to their line inclination, resizing them ac- The main purpose of this distance is to serve as a basis for clus- cording to a predefined scale, and fitting the results into a tering at the next stage of the analysis. padded (at least 10% on each side) square of size aL × aL (with L = 1, ..., 22 the index of the alphabet letter under consideration). Hypothesis Testing. At this stage we address the main question On average, the resized characters were 300 × 300 pixels. raised above: What is the probability that two given texts were Subsequently, the proximity of two characters can be measured written by the same author? Commonly, similar questions are using each of the extracted features, representing various aspects addressed by posing an alternative null hypothesis H0 and at- of the characters. For each feature, a different distance function is tempting to reject it. In our case, for each pair of ostraca, the H0 defined (to be combined at a later stage; discussed below). is both texts were written by the same author. This is performed Table S1 provides a list of the features and distances we use, along by conducting an experiment (detailed below) and calculating with a description of their implementation details.

Supporting Information

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support