<<

Supplementary data

Table S1. (part 1) Detailed list of 62 used in research. C refers to Class (Table 1). Bold font in names refers to the short names used in figures.

Mineral name C Empirical formula name C Empirical formula algodonite 2 Cu6As 5 Cu3(CO3)2(OH)2 antimonpearceite 2 Ag12Cu4Sb1.5As0.5S11 5 Cu2(CO3)(OH)2 bornite 2 Cu5Fe2+S4 7 Cu4(SO4)(OH)6 bournonite 2 PbCuSbS3 copiapite 7 Fe2+Fe3+4(SO4)6(OH)2•20(H2O) 2 CuFe2+S2 7 CaCu4(OH)6(SO4)2•3(H2O) colusite 2 Cu12.5V5+As1.5Sb0.9Sn0.6Ge0.3S16 7 Pb2Cu(AsO4)(CrO4)(OH) covellite 2 CuS 7 Cu4(SO4)(OH)6•2(H2O) digenite 2 Cu9S5 nakauriite 7 Mn2+4Ni3Cu(SO4)4(CO3)(OH)6•48(H2O) 2 Cu3AsS4 natrochalcite 7 NaCu2(SO4)2(OH)•(H2O) freibergite 2 Ag7.2Cu3.6Fe2+1.2Sb3AsS13 osarizawaite 7 PbCuAl2(SO4)2(OH)6 germanite 2 Cu26Fe2+4Ge4S32 7 Cu4(SO4)(OH)6•(H2O) gladite 2 PbCuBi5S9 7 Pb2Cu(CrO4)(PO4)(OH) idaite 2 Cu5Fe2+S6 arthurite 8 CuFe3+22(AsO4)0.72(PO4)0.22(SO4)0.1O1.5(OH)0.5•4(H2O) jaskolskiite 2 Pb2.2Cu0.2Sb1.2Bi0.6S5 chenevixite 8 Cu2Fe3+2(AsO4)2(OH)4•(H2O) krupkaite 2 PbCuBi3S6 8 Cu3(AsO4)(OH)3 seligmannite 2 PbCuAsS3 8 CaCu(AsO4)(OH) stannite 2 Cu2Fe2+SnS4 cornetite (co0) 8 Cu3(PO4)(OH)3 stromeyerite 2 CuAgS 8 Cu5(AsO4)2(OH)4•(H2O) 2+ tetrahedrite 2 Cu9Fe 3Sb4S13 8 PbZn(VO4)(OH) umangite 2 Cu3Se2 8 PbCu(AsO4)(OH) boleite 3 KAg9Cu24Pb26Cl62(OH)48 8 Cu2(PO4)(OH) kinoite 3 Ca2Cu2Si3O10•2(H2O) 8 PbCu(VO4)(OH) 4 Cu2O 8 Cu2(AsO4)(OH) delafossite 4 CuFe3+O2 parnauite 8 Cu9(AsO4)2(SO4)(OH)10•7(H2O) 4 CuO 8 Cu5(PO4)2(OH) delafossite 4 CuFe3+O2 turquoise 8 CuAl6(PO4)4(OH)8•4(H2O) Table S1. (part 2) Detailed list of 62 minerals used in research. C refers to Class (Table 1). Bold font in names refers to the short names used in figures.

Mineral name C Empirical formula Mineral name C Empirical formula richelsdorfite 8 Ca2Cu5Sb(AsO4)4Cl(OH)6•6(H2O) creaseyite 9 Pb2Cu2Fe3+1.75Al0.25Si5O17•6(H2O) 8 Pb2Cu(PO4)(SO4)(OH) 9 CuSiO2(OH)2 8 CaCu5(AsO4)2(CO3)(OH)4•6(H2O) halloysite 9 Al2Si2O5(OH)4 9 K2.25Na1.75Cu20Al3Si29O76(OH)16•8(H2O) 9 Cu8Si8O22(OH)4•(H2O) allophane 9 (Al2O3)(SiO2)1.3•2.5(H2O) vesuvianite 9 Ca10Mg2Al4(Si2O7)2(SiO4)5(OH)4 9 Cu1.75Al0.25H1.75(Si2O5)(OH)4•0.25(H2O)

Figure S1. The number of measurement points (MPs) taken per mineral together with their partitions into possible reference and validation only ones.

Figure S2. The number of different (exclusive) rock samples on which MPs were taken – listed per mineral.

Figure S3. Heatmap representing elements composition among all 62 minerals in accordance to mineral’s empirical formula [24]. The sorting from left to right was done based on the shortest N‐dimensional distance between each pair of minerals where the first pla‐9 is the closest to the origin of PCA space.

Figure S4. NTF model: a nonnegative 3‐way tensor Y is represented by a tensor‐matrix product of the latent component matrices 𝑈, 𝑈, 𝑈 with an identity tensor I or equivalently by a sum of J rank‐1 tensors obtained by an outer product of the latent components.

J = 2 J = 3 J = 4

Mineral U Mineral U Mineral U

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6 0.9 0.4 0.85 0.4 0.4 0.8 CONTRIBUTION CONTRIBUTION CONTRIBUTION 0.2 0.2 0.2 0.15 0.16 0 0 0.06 0.04 0 0.03 0.01 12 123 1234 LATENT SPECTRA INDEX LATENT SPECTRA INDEX LATENT SPECTRA INDEX

(a) (b) (c)

Figure S5. Example of Jm number of latent spectra decision for one m‐th object. The L is set to 0.95 and R to 2.0 and maximum latent division 𝐽 to 10:

a) First iteration ‐ Jm equals 2. Assuming one spectra always needs to be left as U then the sum of coefficients α of the first K latent spectra (K equals only 1 in that case) covers only 0.85 and the L‐condition is not met – Jm is incremented. b) Second iteration ‐ Jm equals 3. Assuming one spectrum always needs to be left as U then the minimum sum of coefficients α of the first K latent spectra that meets L is 0.96 for K=2, so the L‐condition is met. Spectra 1 and 2 are assigned as mineral while spectrum 3 is assigned as U. Then the second R parameter is checked. The ratio between last mineral () and first U spectrum (orange) is 1.5 which does not meet the R value – hence Jm is incremented. It is worth noting here that the difference between contributions of the coefficients of second spectrum (6 percent) and third spectrum (4 percent) compared to the first spectrum (90 percent) makes it unclear if maybe both second and third spectra are rather U. But in that case, the L‐condition is not met, so a given pair of L/R will not work for Jm = 3. c) Third iteration ‐ Jm equals to 4. Assuming again that one spectrum always needs to be left as U, then the minimum sum of coefficients α of the first K latent spectra that meets the L minimum contribution is 0.96 for K=2, so the L‐condition is met. Spectra 1 and 2 are assigned as mineral while spectra 3 and 4 are assigned as U. Then the second R‐condition is checked. The ratio between the last mineral (blue) and the first U spectrum (orange) is 5.3 which meets the minimum R‐condition = 2.0. The loop is broken and the solution for the m‐th object with Jm = 4 is given. In such a case, we have two spectra that are assigned as mineral (and later in regression will be labeled as m‐th object) and two spectra with minor contribution labeled as U.

Figure S6. Dataset partitioning of the training and validation data using some of the possible reference MPs as validation data to evenly match the dataset among mineral classes (Section 5.3).

(a) (b) (c)

(d) (e) (f)

Figure S7. Classification precision and recall output for the parametrisation of the NTF method with R and L variables (validation dataset): (a) P – precision scores; (b) R Uin – recall scores with the U class included as an error; (c) RUex – recall scores with the U class excluded from the measure; (d) P – precision scores in accordance with the best SVM result from Table 3; (e) RUin – recall scores with the U class included as an error in accordance with the best SVM result from Table 3; (f) R Uex – recall scores with the U class excluded from the measure in accordance with the best SVM result from Table 3.

Figure S8. Confusion matrix of the SVM classifier for validation dataset (Section 5.3).

Figure S9. Confusion matrix of the KNN classifier for validation dataset (Section 5.3).

Figure S10. Confusion matrix of the LDA classifier for validation dataset (Section 5.3).

Figure S11. Confusion matrix of the NTF based classifier for validation dataset (Section 5.3).

Table S12. CPU time and disk space usage of the compared methods. The desktop hardware was Intel i7‐4790k (4.00GHz) with 32 GB RAM.

Method CPU learning time (s) CPU validation time for 188 MPs (s) Disk space usage (MB) SVM 1274.50 1134.26 1378.59 LDA 199.52 27.23 1440.37 KNN 4.26 462.27 964.46 NTF 1389.22 574.65 25.19