.

Magnificent beasts of the Milky Way: Hunting down with unusual infrared properties using supervised machine learning

Julia Ahlvind1 Supervisor: Erik Zackrisson1 Subject reader: Eric Stempels1 Examiner: Andreas Korn1 Degree project E in Physics – Astronomy, 30 ECTS

1Department of Physics and Astronomy – Uppsala University June 22, 2021 Contents

1 Background 2 1.1 Introduction...... 2

2 Theory: Machine Learning 2 2.1 Supervised machine learning...... 3 2.2 Classification...... 3 2.3 Various models...... 3 2.3.1 k-nearest neighbour (kNN)...... 3 2.3.2 Decision tree...... 4 2.3.3 Support Vector Machine (SVM)...... 4 2.3.4 Discriminant analysis...... 5 2.3.5 Ensemble...... 6 2.4 Hyperparameter tuning...... 6 2.5 Evaluation...... 6 2.5.1 Confusion matrix...... 6 2.5.2 Precision and classification accuracy...... 7

3 Theory: Astronomy 7 3.1 Dyson spheres...... 8 3.2 Dust-enshrouded stars...... 8 3.3 Gray Dust...... 9 3.4 M-dwarf...... 10 3.5 post-AGB stars...... 10

4 Data and program 10 4.1 Gaia...... 11 4.2 AllWISE ...... 11 4.3 2MASS ...... 11 4.4 MATLAB...... 12 4.4.1 Decision trees...... 12 4.4.2 Discriminant analysis...... 13 4.4.3 Support Vector Machine...... 13 4.4.4 k-nearest neighbour...... 13 4.4.5 Ensemble...... 13

5 The general method 14 5.1 Forming datasets and building DS models...... 14 5.1.1 Training set stars...... 14 5.1.2 Training set Dyson spheres...... 14 5.1.3 Training set YSOs...... 16 5.2 Finding and identifying the DS candidates...... 16 5.2.1 Manual analysis of DS candidates...... 16 5.3 Testing set...... 17 5.4 Best fitted model...... 18

6 Process 18 6.1 limiting DS magnitudes...... 19 6.2 Introducing a third class...... 19 6.3 Coordinate dependence...... 20 6.4 Malmquist bias...... 21 6.5 cc flag...... 22 6.6 Feature selection...... 22 6.7 Proportions of the training sets...... 23 7 Result 23 7.1 Frequent sources of reference...... 24 7.1.1 Marton et al.(2016)...... 24 7.1.2 Marton et al.(2019)...... 24 7.1.3 Stassun et al.(2018)...... 24 7.1.4 Stassun et al.(2019)...... 25 7.2 A selection of intriguing targets...... 25 7.2.1 J18212449-2536350 (Nr.1)...... 26 7.2.2 J04243606+1310150 (Nr.8)...... 27 7.2.3 J18242978-2946492 (Nr.10)...... 28 7.2.4 J18170389+6433549 (Nr.22)...... 29 7.2.5 J14492607-6515421 (Nr.26)...... 30 7.2.6 J06110354-4711294 (Nr.30)...... 31 7.2.7 J05261975-0623574 (Nr.37)...... 32 7.2.8 J21173917+6855097 (Nr.46)...... 33 7.3 Summary of the results...... 34

8 Discussion 34 8.1 Evaluation of the approach...... 35 8.1.1 The influence of various algorithms...... 35 8.1.2 Training sets...... 36 8.2 Challenges...... 37 8.3 Follow-up observations...... 37 8.4 Grid search...... 38 8.5 Future prospects and improvements...... 39

9 Conclusion 40

A Appendix i A.1 Candidates...... i A.2 Uncertainty derivations...... iii A.3 Results of various models...... iv A.4 Algorithm...... v A.4.1 linear SVM...... v A.4.2 quadratic SVM...... vii Abstract The significant increase of astronomical data necessitates new strategies and developments to analyse a large amount of information, which no longer is efficient if done by hand. Supervised machine learning is an example of one such modern strategy. In this work, we apply the classification technique on Gaia+2MASS+WISE data to explore the usage of supervised machine learning on large astronomical archives. The idea is to create an algorithm that recognises entries with unusual infrared properties which could be interesting for follow-up observations. The programming is executed in MATLAB and the training of the algorithms in the classification learner application of MATLAB. Each catalogue; Gaia+2MASS+WISE contains „ 109, 5ˆ108 and 7ˆ108 (The European Space Agency 2019, Skrutskie et al. 2006, R. M. Cutri IPAC/Caltech) entries respectively. The algorithms searches through a sample from these archives consisting of 765266 entries, corresponding to objects within a ă 500 pc range. The project resulted in a list of 57 entries with unusual infrared properties, out of which 8 targets showed none of the four common features that provide a natural physical explanation to the unconventional energy distribution. After more comprehensive studies of the aforementioned targets, we deem it necessary for further studies and observations on 2 out of the 8 targets (Nr.1 and Nr.8 in table3) to establish their true nature. The results demonstrate the applicability of machine learning in astronomy as well as suggesting a sample of intriguing targets for further studies. Sammanfattning Inom astronomi samlas stora m¨angderdata in kontinuerligt och dess tillv¨axt¨okar snabbt f¨orvarje ˚ar.Detta medf¨oratt manuella analyser av datan blir mindre och mindre l¨onsamaoch kr¨aver ist¨allet nya strategier och metoder d¨arstora datam¨angdersnabbare kan analyseras. Ett exempel p˚aen s˚adan strategi ¨arv¨agleddmaskininl¨arning. I detta arbete utnyttjar vi en v¨agledmaskininl¨arningsteknik kallad klassificering. Vi anv¨anderklassificerings tekniken p˚adata fr˚ande tre stora astronomiska katalogerna Gaia+2MASS+WISE f¨oratt unders¨oka anv¨andningenav denna teknik p˚ajust stora astronomiska arkiv. Id´en¨aratt skapa en algorithm som identifierar objekt med okontroversiella infrar¨odaegenskaper som kan vara intressanta f¨orvidare observationer och analyser. Dessa ovanliga objekt ¨arf¨orv¨antade att ha en l¨agreemission i det optiska v˚agl¨angdsomr˚adetoch en h¨ogreemission i det infrar¨oda¨anvad vanligtvis ¨arobserverad f¨oren stj¨arna.Programmeringen sker i MATLAB och tr¨aningsprocessen av algoritmerna i MATLABs applikation classification learner. Algoritmerna s¨oker igenom en samling data best˚aende av 765266 objekt, fr˚ankatalogerna Gaia+2MASS+WISE. Dessa kataloger inneh˚allertotalt „ 109, 5ˆ108 och 7ˆ108 (The European Space Agency 2019, Skrutskie et al. 2006, R. M. Cutri IPAC/Caltech) objekt vardera. Det begr¨ansadedataset som algoritmerna s¨oker igenom motsvarar objekt inom en radie av ă 500 pc. M˚angaav de objekt som algoritmerna identifierade som ”ovanliga” tycks i sj¨alva verket vara nebul¨osaobjekt. Den naturliga f¨orklaringenf¨or dess infrar¨oda¨overskott ¨ardet omslutande stoft som ger upphov till v¨armestr˚alningi det infrar¨oda. F¨oratt eliminera denna typ av objekt och fokusera s¨okningenp˚amer okonventionella objekt gjordes modifieringar av programmen. En av de huvudsakliga ¨andringarnavar att introducera en tredje klass best˚aendeav stj¨arnorinneslutna av stoft som vi kallar ”YSO”-klassen. Ytterligare en ¨andringsom medf¨ordef¨orb¨attraderesultat var att introducera koordninaterna i tr¨aningensamt vid den slutgiltiga klassificeringen och p˚as˚avis, identifiering av intressanta kandidater. Dessa justeringar resulterade i en minskad andelen nebul¨osaobjekt i klassen av ”ovanliga” objekt som algoritmerna identifierade. Projektet resulterade i en lista av 57 objekt med ovanliga infrar¨odaegenskaper. 8 av dessa objekt p˚avisadeingen av det fyra vanligt f¨orekommande egenskaperna som kan ge en naturlig f¨orklaring p˚adess ¨overfl¨odav infrar¨odstr˚alning.Dessa egenskaper ¨ar;nebul¨osomgivning eller p˚avisadstoft, variabilitet, Hα emission eller maser str˚alning. Efter vidare unders¨okningav de 8 tidigare n¨amnda objekt anser vi att 2 av dessa beh¨over vidare observationer och analys f¨oratt kunna fastsl˚adess sanna natur (Nr.1 och Nr.8 i tabell3). Den infrar¨odastr˚alningen¨arallts˚ainte enkelt f¨orklaradf¨ordessa 2 objekt. Resultaten av intressanta objekt samt ¨ovrigaresultat fr˚anmaskininl¨arningen,visar p˚aatt klassificeringstekniken inom maskininl¨arning¨aranv¨andbartp˚astora astronomiska datam¨angder. List of Acronyms Machine learning ML – Machine learning. model – Another word for the machine learning system or algorithm that is trained on the training data. training data - A dataset that consists of both input and output data, which is used in the training process of the models. predictions- The output of a model after it has been trained on training data. input data - The data that is fed to the model, i.e. the first part of the training data. labeled data - Another word for the output data, i.e. the second part of the training data. test data - A dataset that is used to test the performance of the model and that is different from the training set. training a model - A process where the algorithms are provided training data, where it learns to recognise patterns. hyperparameter - A parameter whose value is used to control the learning process, thus, adjusting the model in more detail. classifier - A method on solving classification problems. kNN – k-Nearest Neighbour SVM – Support Vector Machine classes/labels - The output of a classifier. parametric – For parametric models, the training data is used to initially establish parameters. However once the model has been trained, the training data can be discarded, since it is not explicitly used when making predictions. loss function - A function that measures how well the model’s predicted output fits the input data. TP – True positive. FN – False negative. FP – False positive. TN – True negative.

Astronomy SED – Spectral Energy Distribution. YSO – Young Stellar Object. PMS – Pre-main-sequence MS – IR – Infrared FIR – Far infrared NIR – Near infrared UV - Ultraviolet DS – Dyson sphere

1 1 Background on large astronomical data. The procedure is to create a sample of models of typical targets that contain 1.1 Introduction these unusual infrared properties, as well as a sample of common stars. The unconventional objects, as the Progress in astronomy has been driven by a number ones that fit the Dyson sphere models, are limited to of important revolutions, such as the telescope, near solar-magnitudes within this study. Thereafter, photography, spectroscopy, electronics and computers, we apply ML to train algorithm(s) to recognise and which have resulted in a major burst of data. categorise stars with uncommon infrared properties Astronomers are now facing new challenges of handling from the Gaia+WISE+2MASS catalogues. The Gaia large quantities of astronomic data which accumulates archive is a grand ESA (European Space Agency) every . Large astronomical catalogues such as catalogue comprising brightnesses measurements and Gaia, with its library of „1.6 billion stars, hold positions of more than 1.6 billion stars with remarkable key information on , galaxy formation accuracy. The brightness is measured in the three processes, the distribution of stars and much more. It passbands GBP (BP), G and GRP (RP) which covers is crucial for the progression of astronomic research, wavelengths from the optical („300 nm) to near- not only to access this data but to extract relevant infrared (NIR) („1.1 µm). The 2MASS (Two Micron and important information. It is unrealistic to assume All Sky Survey) catalogue contains, besides astrometric that all data will be manually analysed. Therefore, we data, ground-based photometric measurements in are entering an era where we put more confidence in the NIR regime („1.1-2.4 µm), with around 470 computers and machine learning (ML) processes, to do million sources. Complementary to the optical-NIR the work at hand. measurements, WISE (Wide-field Infrared Survey The astronomic data avalanche is not solely the Explorer) photometry data is recovered in four mid- cause of the increasing usage of ML. The technique infrared (mid-IR) bandpasses („2.5-28 µm). The is becoming more and more popular and important in AllWISE source catalogue contains „ 7 ˆ 108 entries numerous fields. Moreover, the growing interest is not (R. M. Cutri IPAC/Caltech). Note that this catalogue isolated to astronomical research, but in other scientific is not a point source catalogue, meaning that it research, financial computations, medical assistance contains some resolved sources such as galaxies and and applications in our mobile phones. There are many filaments in galactic nebulosity. Using these three subdivisions for machine learning that can be used archives enables a broader wavelength range, necessary in various fields. For the purposes of this work, we for our study. The project is expected to result in a will adopt supervised machine learning (SML). This list of intriguing objects that are suitable for follow-up ML technique exploits the fact that the input data is observations. Questions that will be considered are; are classified so that the predictions of the training can be there any objects with properties consistent with those evaluated. This is in contrast to unsupervised machine expected for Dyson spheres in our night sky? If so, is learning which trains models based on unclassified it possible to determine their nature based on existing training data. Simply put, the difference between data outside the Gaia+WISE+2MASS catalogue? unsupervised and supervised ML is, thus, that the training process is further supervised for the latter In this report we include relevant background technique. theory of machine learning and astronomical Supervised ML can be further divided into knowledge in section2&3, respectively. In section subgroups, namely regression and classification 4, we review the databases used in the ML part of techniques. The technicalities of the latter are the work, the programs and applications exploited in discussed in section 2.2 and is used within this work. section 4.4 and the employed programming methods The classifier method will be adapted in order to and scripts in section5. Further on, in section6, we categorise stars with and without unusual infrared discuss the process and steps of implementations taken properties. Several types of stars are expected to have throughout the work to improve the programs. The these kinds of properties. Some of them being stars main results are thereafter appraised in section7, as surrounded by debris disks and dust-enshrouded stars, well as a selection of the most prosperous candidates. such as young stellar objects (YSOs) or protostars. Thereafter we discuss the overall progress, results, the Further intriguing objects with such infrared properties application of ML in astronomy and future prospects are the hypothetical Dyson spheres (DS). A Dyson and improvements in section8. Finally, we conclude sphere is an artificial mega-structure which encloses the results of this work and its appliance in section9. a with the porpoise of harvest the stellar light. By narrowing down the search to the latter group of alluring objects, we hope to discover intriguing targets 2 Theory: Machine Learning that are not directly explained by previously observed astronomical phenomena. Machine learning is the scientific area involving the study of a computer algorithm that learns and With this work, we aim to exploit the use of improves on past experiences. A ML algorithm or machine learning and thereby test its applicability model is constructed on sample data, referred to as

2 training data, with the goal to predict and asses 2. Therefore, the classes of the outcome can be referred an outcome without being trained for that specific to as positive and negative classes. task. There are two main subgroups of ML which are used in various situations, unsupervised learning and supervised learning. In the former technique, the algorithm is trained to identify intrinsic patterns 2.3 Various models within the input-training data without any knowledge There are many different models or algorithms that of labeled responses (output data). A commonly used can be used in machine learning. Similar to the technique of unsupervised ML is clustering. It is used choice of supervised or unsupervised machine learning, to find hidden patterns and groupings in the data, different models are suitable for different purposes. which has not been categorised or labelled. Instead Furthermore, the classification model that depends on of relying on labeled data, the algorithm looks for the chosen machine learning algorithms, can further be commonalities in the training data and categorises tuned by changing the values of the hyperparameters entries according to presence or absence of these (see section 2.4). Some of the models commonly used properties. The latter subgroup, supervised ML, is for classification, as described in the former section, are used within this work and presented in the following discussed below. Note however, that these methods are section. also applicable for other machine learning techniques besides classification. The algorithms reviewed in the 2.1 Supervised machine learning following section are also the ones adapted in this work. By trial and error we adopt the various algorithms Supervised machine learning techniques utilise the to identify what algorithm(s) are most suited for this whole training data which contains some input data, classification problem. output data and the relationship between the two groups. By using a model, that has previously been adapted to the training data, one can predict the 2.3.1 k-nearest neighbour (kNN) outcome i.e. the output data from a new set of data, The k-nearest neighbour method for supervised that is different from the training data (Lindholm machine learning is, as many other SML models, based et al. 2020). The process of adapting the model onto on the predictions of its neighbours. If an input the training data is referred to as training the model. data point xi is close to a training data point xt, Unsupervised machine learning is particularly useful then the prediction of the input data yi(xi) should for cases where the relationship between the input data be close to the prediction of the training set yt(xt). x and output data y is not explicit. For instance, the For example, if k “ 1, then the object is assigned relationship may be too complicated or even unknown the class of its single nearest neighbour. For higher from the training data. Thus, the problem can not be values of k, the data point is classified by assigning solved with a traditional computer program of input the label that is most prevailing among its k number x that returns the output y, based on some common of nearest neighbours. The kNN algorithm works in set of rules. Supervised machine learning, instead, various metric spaces. A common metric space is the approaches the problem by learning the relationship Euclidean metric, where the distance between two data between x and y from training data. points (xj , xi) in n-dimensions follow eq.1(Kataria & Singh 2013). Other examples of metric space that can As previously mentioned, which ML approach to be used are Chebyshev or Minkowski. use depends on the data and what you want to achieve. SML is suitable for regression techniques, categorising or classifying data since the algorithm needs training n 2 in order to make predictions on the unseen data. Dpxj , xiq “ ||xj ´ xi|| “ pxi,k ´ xj,kq (1) g Regression techniques predict continuous responses e.g. f k fÿ changes in the flux of a star. In contrast, classification e techniques predict discrete responses e.g. what type As with all ML models, the parametric choices vary of star it is. The latter approach will be used in this between problems. There is no optimal value for k work and is more thoroughly described in the following which is profitable for the majority of cases and few section. guidelines exist. However, one is that for a binary (two-class) classification, it is beneficial to set k equal 2.2 Classification to an odd number since this avoids tied classifications (Hall et al. 2008). But in truth, one has to, by trial The classification method categorises the output into and error, test the options on a training set to see classes and can thus take M number of finite values. what is suitable for this particular classification. In In the most simplest case, M=2 and is referred to general, a large value of k reduces the effect of noise as a binary classification, if M ą2 we instead call it on the classification, but at the same time makes the a multi-class classification. For the binary case, the boundaries between classes less distinct, thus risking labelled responses are noted -1 and 1, instead of 1 and overfitting (the model has adapted too much to the

3 training data and will not be able to generalise well to new data).

2.3.2 Decision tree Another method commonly used in machine learning is decision trees and more specifically classification trees if the problem regards classification. The leaves in a classification tree represent the class labels and the branches the combined features which are used in order to select the class. More specifically, the input variable (previously referred to as xi) is known as the root node, the last nodes on the tree are known as leaf nodes or terminal nodes and the intermediate nodes are called internal nodes. The nodes of a decision tree are chosen by looking Figure 1: An illustration of a two dimensional support vector for the optimum split of the features. One function that machine with corresponding notations. measures the quality of a split is the Gini index. This function selects the optimal separation for when the data is split into groups, where one class dominates and problem can be split into multiple binary classification minimises the impurity of the two children nodes–the problems and thus be solved for. The primary following nodes after a split of a parent node. The objective of this technique is to project nonlinear gini impurity thus measures the frequency at which any separable n-dimensional data samples onto a higher element of the dataset is miscategorised when labelled, dimensional space with the use of various kernel and follows the formula seen in eq.2. Hereπ ˆlm is functions such that they can be separable by an n- the proportion of the training observations in the lth dimensional hyperplane. This is referred to as a linear region that belong to the mth class according to eq.3. classifier but higher orders such as quadratic, cubic and In eq.3, nl is the number of training data points in Gaussian also exists. The choice of the hyperplane in node l, yi is the ith label of the training data point the dimensional space is important for the precision (xi, yi). Therefore,π ˆlm is essentially the probability and accuracy of the model. of an element yi belonging to the class m (Lindholm One common choice is the maximum-margin et al. 2020). hyperplane, where the distance between the two closest data points of each group (the support vectors), is M M maximised and that no miss classification occurs. The 2 Ql “ πˆlmp1 ´ πˆlmq “ 1 ´ pπˆlmq (2) hyperplane or the threshold which separates these two m“1 m“1 classes will thus reside in between the two closest ÿ ÿ points where the distance from the threshold and data 1 point is called the margin (Hastie et al. 2009). In two πˆlm “ 1yi “ m (3) nl dimensions, one can visualise the data as two groups ÿ on the xy-plane, where one can draw two parallel lines representing the boundary of each group, orthogonal Trees can be sensitive to small changes in the to the shortest distance between the data points of training data, which can result in large changes in the the different groups (see figure1). For this case, tree, hence in the final outcome of the classification. the hyperplane is also represented as a straight line Therefore, the numbers of splits in classification trees and positioned in between the two boundary lines. A must be treated carefully in order to prevent overfitting general hyperplane can be written as a set of points i.e. where the tree does not generalise well from the x satisfying the eq.4, where θ is the normal vector to training data. In the use of decision trees, one can the hyperplane and y the classification label following further improve the precision of the predicament by y P t ´1, `1 . The boundary lines (hyperplanes for utilising the so-called pruning. Pruning is essentially a higher dimension) thus represent the position where the inverse process of splitting an internal node when anything on or( above eq. 5a is of the class with label 1, a sub-node or internal node is removed after training and on or below eq. 5c of class -1. The distance between the data. these two boundary hyperplanes in the θ-direction is 2 ||θ|| (see figure1), and to maximise the distance, ||θ|| 2.3.3 Support Vector Machine (SVM) needs to be minimised (Lindholm et al. 2020). Support vector machine is yet a further ML technique well suited for classification, but can also be used in regression problems. In its most simple form, SVM does not support multiclass classification, however, the θT x ´ b “ y (4)

4 constraints are followed. In other words, if C is large, the constraints are hard to ignore and the margin is T θ x ´ b “ 1 (5a) narrow. If C is small, the constraints are easily ignored θT x ´ b “ 0 (5b) and the margin is broad. θT x ´ b “ ´1 (5c) 2 T min||θ|| ` C maxp0, 1 ´ yipθ xi ` bqq (8) i Furthermore, if we also require that all ÿ classifications are correct, in addition to having the Furthermore, for non-linear separable data, even maximum distance as stated above, we enforce eq.7. introducing soft margins would not suffice. The kernel function provides a solution to this problem by T y θ xi ` b ě 1@i (7) projecting the data from a low-dimensional space to a higher dimensional space (Noble 2006). The kernel ` ˘ When combining the two criteria, we get the so- function derives the relationships between every pair of data points as if they are in a higher dimension, called optimisation problem, where ||θi|| is subject to T meaning that the function does not actually do the the constraints yi θ xi ´ b ě 1@1. Thus, we end up with the function signpθT x ´ bq. The choice of transformation. This is referred to as the kernel trick. the maximum margins` is often˘ by default used by the Common kernel functions are polynomials of degree program itself since it is the most stable solution under two or three, as well as radial basis function kernels perturbations of input of further data points. However, such as Gaussian. for non linearly separable data, the algorithm often finds the soft margins. This lets the SVM algorithm 2.3.4 Discriminant analysis deal with errors in the data by allowing a few outliers to fall on the wrong side of the hyperplane, thus Discriminant analysis uses the training set to ”misclassify” them without affecting the final result. determine the position of boundaries that separates In other words, the outlier(s) instead reside on the the response classes. The locations of the boundaries same side of the hyperplane with members of the are determined by treating each individual class as opposite class. Clearly, we can not allow too many samples from multidimensional Gaussian distributions. misclassifications, and thus aiming to minimise the The boundaries are thereafter drawn at the points misclassification rate. Hence when introducing the where the probability of classifying an element to soft margins it is necessary to control and check the either of the classes, is equal. The boundary is thus process. Soft margins are used in a technique called a function that depends on the parameters of the cross validation. This technique is used to control the fitted distributions. If the distribution of all classes balance between misclassifications and overfitting. One is assumed to be equal, the boundary equations are can set the number of allowed misclassifications within simplified greatly and become linear. Otherwise, the the soft margins to get the best classification that is boundaries are quadratic. The quadratic discriminant not overfitting. This happens automatically for some analysis (QDA) is slightly more demanding in terms programs (like MATLABs classification learner (CL)) or of memory and calculations, but it is still seen as an can be set manually. efficient classification technique. The QDA can be In order to control the position of the boundary, expressed as seen in eq.9(Lindholm et al. 2020). SVM utilises a so-called cost function. The idea is to regulate the freedom of misclassified points on the action of maximising the margin by using the hinge loss πˆmN x|µˆm, Σˆ m function. A point belonging to the positive class that ppy “ m|xq “ (9) M ´ ˆ ¯ resides on or above the higher boundary or support j“1 πˆj x|µˆj , Σj vector (eq. 5a) has zero loss. A point of the positive ř ´ ¯ class on the threshold (eq. 5c) has loss one, and further nm{j Whereπ ˆm{j “ n is the number of training points points in the positive class on the ”wrong” side of in class m/j over the total number of data points and the threshold have linearly increasing loss. The cost N indicates Gaussian or normal distribution. Eq. 10 function thus accounts for the distance between the shows the mean vector of each class among all training support vector and the misclassified point. The further data points within that class. the separation, the more costly it gets to correctly classify the point and less so to shift the threshold, 1 µˆ “ x (10) maximise the margin and miscategorise the point. The m{j n i m{j i:y “m total cost function can be stated as seen in eq.8 where ÿi the first term governs the maximisation of the margin Finally, eq. 11 describes the covariance matrix of and the second term accounts for the loss function each class m, in other words, a matrix that describes (Smola & Sch¨olkopf 1998). The parameter C is a the covariance between each pair of elements of the regularisation parameter that adjusts how well the given vectors xi and µˆm.

5 2.4 Hyperparameter tuning

Sometimes the performance of the classification 1 T σˆm “ pxi ´ µˆmq pxi ´ µˆmq (11) can be optimised by setting the hyperparameters nm i:yi“m manually. Hyperparameters vary between models but ÿ are generally seen as ”settings” for the algorithm. A typical hyperparameter for SVM algorithms is the polynomial degree of the kernel function, for kNN it 2.3.5 Ensemble is the number of neighbours k. For decision trees there are two important hyperparameters to consider; Ensemble methods are meta-algorithms that uses the number of estimators (decision trees) and the several copies of a fundamental model. This set of maximum allowed depth for each tree, i.e. the multiple copies are referred to as an ensemble of base maximum number of splits or branches. As with many models, where base models can e.g. be one of the things in machine learning, there is no optimal choice aforementioned models in the sections above. The of hyperparameters. Therefore, empirical trials of the fundamental concept is to train each such base model hyperparameters are the preferred way to optimise the in slightly different ways. Each base model makes its algorithm, this is also known as hyperparameter tuning. own prediction and thereafter an average or majority Another approach is the so-called Grid search, which vote is derived to obtain the final prediction (Lindholm builds and evaluates a model for each combination et al. 2020). There are two main types of models of parameters specified in a grid. A third option when discussing ensemble classification, bagging and is a random search which is based on a statistical boosting. In bagging, or ”bootstrap aggregation”, distribution of each parameter, from which values are multiple slightly different versions of the training set randomly sampled (Bergstra & Bengio 2012). This are created. These sets are random overlapping subsets work is, as previously mentioned, executed in MATLAB of the training data. This results in an ensemble of and MATLAB’s classification learner. Here the evaluation similar base models which are not identical to the of the hyperparameter tuning is automatically done core base model, thus when training the models, one by the program and commonly not tweaked by us. reduces the variance and so the risk of overfitting. One However, hyperparameter tuning is indeed accessible could summarise bagging as an ensemble of multiple in MATLAB and tested within this work. The available models of the same type, where each one is trained and relevant hyperparameters in MATLABs CL is further on a different generated random subset of the training discussed in section 4.4. Furthermore, we evaluate our data. This method is not a new model technique, but experimentation of hyperparameter tuning by studying a collection of formerly mentioned models that uses a a confusion matrix which is addressed in the below new approach to the data itself. After the ensemble has section 2.5.1. been trained, the results are aggregated by an average or weighted average of the predicted class probabilities, which results in a final prediction from the bagging. 2.5 Evaluation The second ensemble classification technique 2.5.1 Confusion matrix mentioned is boosting. In contrast to bagging, the base models in boosting are trained sequentially where The most convenient way of evaluating the each model aims to correct for the mistakes that the performance of a classifier is to study a so-called former models have made. Furthermore, an effect confusion matrix. For a binary classification problem, of using boosting is bias reduction of the base model, the confusion matrix is formed by separation of the instead of the reduced variance in bagging. This allows validation data in four groups depending on the true boosting to turn an ensemble of weak base models into output or label y and the predicted output ypxq. one stronger model, without the heavy calculations Figure2 shows a general confusion matrix where the that normally would be required (Lindholm et al. green boxes, true positive (TP) and true negative 2020). Both boosting and bagging are ensemble (TN), show the correctly classified elements, meaning methods that combine prediction from multiple models that the model predicts the true class. The red boxes, of classification (or regression) type. Thus, boosting is false positive (FP) and false negative (FN) instead also using previously mentioned models through a new show the number of incorrectly predicted elements. approach. The biggest difference is, as mentioned, that Out of the latter two outcomes, the false positive is boosting is sequential. The idea is that each model tries often the most critical one to consider and should, to correct for the mistakes made by the previous one therefore, be minimised. This is true because, in most by modifying the training dataset after each iteration, cases of classification, we are interested in discerning in order to highlight the data points for which the one specific class as accurate as possible. To illustrate formerly trained models performed dissatisfactory. this, we will use an example related to this work, The final prediction is then a weighted average or namely the classification of common stars (A) and weighted majority vote of all models of the ensemble. stars with unusual infrared properties (B). Since the goal is to identify targets belonging to class B, TP are

6 The accuracy of a classification model is similar to the precision, but includes the total correct predictions while precision evaluates half of the true outcome. The accuracy thus evaluates the fraction of correct predictions over the total number of predictions, as seen in eq. 13. The ratio takes a value between 0 and 1, but the accuracy is commonly presented in percentage 0-100%. The accuracy metric is conceivably better suited for a uniformly distributed sample since it is a biased representation of the minority class. This implies that the majority class has a bigger impact on the accuracy than what the minority class has. Therefore, the classification accuracy could be misleading if the working involves large differences in the quantities of each class (Sch¨utzeet al. 2008).

TP ` TN Figure 2: The figure shows a confusion matrix with true and Accuracy “ (13) predicted classes. The true positive (TP) and true negative (TN) TP ` TN ` FP ` FN (green squares) represents the correctly classified objects while the false positive (FP) and the false negative (FN) (red squares) Naturally, a precision close to 1 and an accuracy are the falsely classified objects. near 100% is preferred, however, as discussed earlier (section 2.5.1) it is more important to reduce the targets predicted as, and truly are unusual stars. TN number of FN. A model with high precision and a are targets predicted as and are common stars, FN are high ratio of FN is a worse model than one with lower targets predicted as stars but belong to class A. Finally, precision and a low ratio of FN. Nevertheless, both the most critical outcome, FP are targets predicted as parameters give a quick indication of how well the group B but in reality, TN are targets predicted as and model classifies the entries. are common stars, FN are targets predicted as stars but belong to class A. Finally, the most critical outcome, FP are targets predicted as group B but in reality, 3 Theory: Astronomy belong to group A (common stars falsely classified as unusual stars). If the number of FP is high, it means The purpose of this work is as previously mentioned, that the model will assort many common stars as to utilise machine learning to select intriguing stellar unusual ones, thus making the result unreliable. This targets with unusual infrared properties. A typical means that the group of interest will be contaminated spectral energy distribution (SED) of a star, like a by ”common stars” and, thereby making the manual black-body spectrum, is characterised by a relatively process of checking the list of interesting targets time- smooth peak at some specific wavelength (e.g. see consuming. If instead FN is high while FP is low, the SED of the in figure3). The placement of it means that many uncommon targets are lost in the peak is determined by the temperature of the star the classification and gets assorted into the common and indicates in what wavelength range the majority group. It is true that these targets will go unnoticed, of the stellar radiation is emitted. The black-body however, the targets later classified as uncommon stars curve of solar-like stars (spectral type FGK) commonly by the model will have a smaller uncertainty of being peaks in the optical wavelength range and decreases misclassified. as we move to longer wavelengths in the infrared (IR). The specific position of the peak in the optical signifies the colour of the star. It should be noted that 2.5.2 Precision and classification accuracy black-body radiation is often the first approximation Further parameters that can be used for assessing the for stellar emission even though stars are not perfect overall performance of the classification is precision black bodies. The unconventional objects considered in and classification accuracy. The precision of a model this study are expected to deviate from this classical describes the ratio of true positives over all positives, view. The targets of interest do indeed show a peak as stated in eq. 12. A high precision value (close to in the optical, but with a lower brightness than a 1) is good and a low (close to 0) signals that there is common star. Furthermore, these targets are also a problem in the classification which yields many false expected to show an excess in the infrared regime. positives. The underlying idea is that there is some component blocking a significant fraction of the stellar radiation, TP thus making the star dimmer. However, the stellar precision “ (12) radiation can not be retained within but is re-emitted TP ` FP in another wavelength. This re-emitted radiation is the

7 are identifying potential candidates based on their IR- excess from thermal radiation. The AGENT formalism is one proposal of how the energy budget would look like for a Dyson sphere. This formalism was first introduced by Wright, Griffith, Sigurdsson, Povich & Mullan(2014) and follows;

α `  “ γ ` ν (14) ff Where α is the collected starlight,  is the non- starlight energy supply, γ represent the waste heat radiation and ν is all other forms of energy emission e.g. neutrino radiation. For simplifications of the formalism one assumes negligible non-thermal losses (ν „ 0) and that the energy from starlight is much higher than other sources ( „ 0) thus generalising Figure 3: The solar spectrum. Credit: Robert A. eq. 14 to α “ γ. The simplified black body model of Rohde licensed under CC BY-SA 3.0 the Dyson sphere can be expressed using α and γ on the host star (see eq. 6 of Wright, Griffith, Sigurdsson, Povich & Mullan(2014)). We generalise it further by thermal radiation emitted in the IR. In the following expressing the and magnitude of the DS as; section, we will discuss known astronomical objects that fit the description of these uncommon targets, as LDS “ LStar ˆ fcov (15) well as a hypothetical object which lay the ground for our investigation of the use of ML on large astronomical datasets. magDS “ magStar ´ 2.5log10p1 ´ fcovq (16)

3.1 Dyson spheres 3.2 Dust-enshrouded stars Perhaps the most speculative candidate for a target One type of the known astronomical objects that with a SED as described in the above section is a so- fit part of the SED profile of our DS models are called Dyson sphere. A Dyson sphere is an artificial dust-enshrouded stars or certain stars within nebulae. circumstellar structure first introduced by Dyson Young stellar objects (YSOs) are good candidates (1960), which harvest starlight from the enclosed star. for nebulous objects. YSOs denotes stars in their Such a mega-structure covers parts of or the whole early stage of evolution and can be divided into two star, thus blocking the star’s light. This causes the subgroups; protostars and pre-main-sequence (PMS) energy output in the optical part of the SED to stars. The terminology of a protostar refers to a drop. However, the blocked starlight is reradiated point in time of the cloud collapsing phase of stellar in the thermal infrared as the megastructure gets formation. When the density at the centre of the heated up and has turned into an ember the size collapsing cloud has reached around 10´10 kgm´3, the of a star. This energy output is seen as a second region becomes optically thick and makes the process peak or slope in the IR range of the SED. Stars more adiabatic (no heat or transferred between encased by Dyson spheres are therefore expected to system and surrounding). The pressure increases in the show abnormalities in their SED when compared to central regions and eventually reaches near hydrostatic ”normal” stars. A Dyson sphere is typically not equilibrium (the gravitational force is balanced by visualised as a solid shell enclosing the star, but rather the pressure gradient; Carroll & Ostlie(2017)). The a swarm of satellites where each satellite absorbs a protostar resides deep within the parent molecular small fraction of the stellar radiation (Suffern 1977, cloud, enshrouded in a cocoon of dust. Since the Wright, Mullan, Sigurdsson & Povich 2014, Zackrisson star has yet to begin nuclear fusion, the generated et al. 2018). The covering fraction fcov, depends on the energy does not come from the core. Instead, most distribution and scales of the enshrouding satellites. of the energy comes from gravitational contractions If one assumes that this astro-engineered structure which heats the interior. The heated dust thereafter acts like a gray-absorber (all wavelengths are equally reradiates the photons in longer wavelengths which affected), one expects only a general dimming of the accounts for the IR source of radiation associated with observed flux and no further changes on the spectral these stars. This IR excess might, therefore, resemble shape. An ideal Dyson sphere with high efficiency that of a Dyson sphere. However, the protostar is not can absorb all stellar light and has minimal energy expected to peak much in optical due to the dust which loss. In reality, that is not likely the case for such is fully enshrouds the protostar, thus all radiation is a construction. Some of the absorbed stellar light reprocessed. will be turned into waste heat and other forms. The As the name suggests, a PMS star is a star in the waste heat is the centrepiece of this work since we evolutionary stage just before the main sequence where

8 Tauri stars, they also show strong emission lines. The major difference between T Tauri and Herbing Ae/Be stars are their , where the former typically have masses in the range 0.5 to 2 M@ (solar masses) and the latter group 2-10 M@.

Furthermore, YSOs are associated with early star evolution phenomena such as protoplanetary disks (also known as proplyds), astronomical jets (a beam of ionised matter that is emitted along the axis of rotation) and masers. Maser emission is commonly detected via emission from OH molecules (hydroxyl radical) or water molecules H2O. A maser is the molecular analogue to a laser and a source of stimulated spectral line emission. Stimulated emission is a process where a photon of specific energy interacts with an excited atomic electron which causes the electron to drop into a lower energy level. The emitted photon and the incident photon will be in the same phase with each other, thus amplify the radiation (Carroll & Ostlie 2017). Figure 4: The HR diagram with and luminosity on the vertical axis and spectral class and effective temperature on the horizontal axis. The diagram also shows the different evolutionary stages of stars and some well known stars. Image credit: R. Hollow, CSIRO. 3.3 Gray Dust most stars in their evolution reside today and spend A potential source that could give rise to false high most of their lifetime. The main sequence (MS) fcov Dyson sphere candidates, is seemingly grey dust. refers to a specific part of the evolutionary track for Obscured starlight with no significant reddening of a star in the so-called Hertzsprung-Russell diagram the spectrum is likely caused by material in the line (HR diagram), which plots the temperature, colour of sight. This is what we refer to as grey dust. or spectral type of stars against their luminosity or Micrometre-sized dust grains would essentially be grey absolute magnitude, to mention a few versions (see at optical to near-infrared wavelengths, thus, reduce figure4). After a protostar has blown away its envelope the optical peak that is seen for most stars, similarly or birth cradle, it is optically visible. The young star as expected of a Dyson sphere (Zackrisson et al. 2018). has acquired nearly all of its mass at this stage but Studies suggest that such dust grains have been formed has not yet started nuclear fusion. Thereafter, the star in circumstellar material around supernovae (Bak contracts, which results in an internal temperature Nielsen et al. 2018) but have also been detected in the increment and finally initiates fusion processes. When interstellar medium (Wang et al. 2015). Observations the star is in the phase of contraction, it is in the have shown that the µm-sized graphite grains, together pre-main-sequence stage (Larson 2003). Two common with nano- and submicron-sized silicate and graphite types of PMS stars are T Tauri stars and Herbing grains, fit the observed interstellar extinction of the Ae/Be stars. T Tauri stars are named after the first Galactic diffuse interstellar medium in the range from star of their class to be discovered (in the far-UV to mid-IR, along with NIR to millimetre of Taurus) and represent the transition between stars thermal emission (Wang et al. 2015). The µm-sized that are still shrouded in dust and on the MS. Dust grains account for the flat or grey extinction in the that surrounds the young star (both T Tauri and UV, optical and NIR. Since they absorb little (near no) Herbing Ae/Be type) is the source of IR radiation and amount in the optical, they do not emit much radiation causes the IR excess in the SED which matches the DS in the IR. The sources of smaller sized grains could, models. However, the T Tauri stars are expected to however, potentially mimic the IR signature of a DS. feature irregular luminosity variations due to variable The gray dust likely affects the optical part of the SED accretion and shocks, with timescales on the order more than the IR, to a great extent. Interstellar gray of days (Carroll & Ostlie 2017). Furthermore, these dust will block the dispersed stellar radiation from a types of young stars exhibit strong hydrogen emission distant star and emit IR radiation while gray dust that lines (the Balmer series). Thus, an Hα spectral line surrounds a star prevents a larger fraction of the stellar indication is a good signature for these types of stars. radiation to escape, hence also a higher IR emission. Herbing Ae/Be stars are likely high-mass counterparts Therefore, the IR property created by circumstellar to T Tauri stars. As the name suggests, these stars dust, is the one that most resembles that of a Dyson are of spectral type A or B and in likeness to T sphere.

9 3.4 M-dwarf M-dwarfs or Red dwarfs are small and cool main sequence stars (thus appear red in colour) and constitutes the largest proportion of stars in Milky Way, more than 70% (Henry et al. 2006). These low- mass stars develop slowly and can therefore maintain a near-constant low luminosity for trillions of . This makes them both the oldest existing types of stars observed, but also very hard to detect at large distances. Evidently, not all M-dwarfs have debris disks. In fact, this group of stars is rare and most observed M-dwarfs with debris disks are younger M- dwarfs, (Binks & Jeffries 2017). Most studies of circumstellar debris disks are concentrated on early- type or solar-type stars, and less so for M-dwarfs. However, some studies (Avenhaus et al. 2012, Lee et al. 2020) have shown that warm circumstellar dust disks are not only observed around M-dwarfs but that they also create an excess in the IR. The underlying sources Figure 5: Evolutionary track of a solar mass star. Credit: of the dust disks are believed to be planetesimal belts Lithopsian, licensed under CC BY-SA 4.0. where the dust is created through collisions. This also takes place in the solar system, however, the collisions are more frequent for other observed systems which IR, thus producing similar SEDs as the ones expected results in higher IR excess than solar system debris for DSs. However, the obscured AGB stars, and young disks (Avenhaus et al. 2012). The combination of stars as previously discussed, are often associated with the low luminosity, temperature and an encompassing OH masers. The post-AGB phases start when the debris disk of an M-dwarf could create a SED that is star begins to contract and heat up. The increasing consistent with a DS model. Although, a debris disk is temperature and constant luminosity make the star not expected to generate high coverage, meaning that a transverse horizontally towards higher temperatures in fitted DS model onto a target with a debris disk should the HR diagram (see figure5). When the temperatures show a low covering fraction. Typically a debris disk reach „ 25000 K, the radiation is energetic enough to reduces the optical emission in orders of LIR „ 0.01 L‹ ionise the remaining circumstellar envelope and appear (Hales et al. 2014). as planetary nebulae (Engels 2005). It is possible that post-AGB or non-pulsating AGBs have similar SEDs 3.5 post-AGB stars as the expected DSs since both objects are assumedly cooler with stronger IR-emission. Post-Asymptotic giant branch (post-AGB) stars are evolved stars with initial masses of 1-8 M@ (Engels (2005), values vary significantly between authors „0.5- 10 M@). These objects are expected to have high 4 Data and program brightness where a solar-mass star can reach ą 1000 times the present solar luminosity (see figure5). An AGB star is both very luminous and cool, thus having The data used within this work originates from three strong IR-radiation and occupy the upper right region different catalogues. These are the Gaia (Data release in the HR-diagram seen in figure5. The characteristics 2, DR2; Gaia Collaboration(2016, 2018)), 2MASS of these stars are the energy production in the double- (Skrutskie et al. 2006) and WISE/AllWISE (All Wide- shell structure (helium and hydrogen) that surrounds field Infrared Survey Explorer; Wright et al.(2010)) . the degenerated carbon-oxygen core. The AGB phase The three catalogues cover different wavelength ranges, can be divided into two parts; early-AGB, a quiescent that when used together covers optical to mid-infrared burning phase and the thermal pulse phase, where wavelengths. Since the targets of interest display large energy releases are created by flashes of He- somewhat lower energy output in the optical and an shell burning. In these phases, the outer layers of excess in the IR, this range of wavelengths is suitable the envelopes are extended to cooler regions by the for our project. pulsation which facilitates dust formation. The newly formed dust is pushed out by radiation pressure which In this section, we will first go through the drags the gas along and leads to high mass-loss rates. three data catalogues from which we recover our In turn, the high mass-loss rates lead to the formation astronomical data. Thereafter, a more in-depth review of circumstellar dust shells with high optical depths. of machine learning via MATLAB, its features and The dust absorbs the stellar light and re-emits in the utilities.

10 4.1 Gaia

The ESA space observatory Gaia with the accompanying database is one of today’s greatest resource of stellar data. The satellite is constantly scanning the sky, thus creating a three-dimensional map of no less than „ 1.6 billion stars. This corresponds to around one per cent of the total number of stars within the Milky Way. Furthermore, the data will reach remarkable accuracy, where targets brighter than 15m will have a position accuracy of 24 microarcseconds (mas) and the distance to the nearest stars, as good as 0.001% (The European Space Agency 2019). Such precision will be achieved at the end of the mission when the satellite has measured each target about 70 times. The satellite operates in Figure 6: The coloured lines show the passbands for GBP pBP q three band-pass filters; BP , G and RP with respective (blue), G (green) and GRP pRP q (red) that defines the Gaia DR2 central wavelength; 0.532, 0.673 and 0.797 µm (Gaia photometric system. The thin, gray lines show the nominal, pre- launched passbands used for DR1 and published in Jordi et al. Collaboration 2016, 2018) which can be seen in figure6. (2010). Credits: ESA/Gaia/DPAC, P Montegriffo, F. De Angeli, New data releases are frequently published in the C. Cacciari. The figure was acquired from Gaia’s homepage Gaia archive as the satellite repeatedly scans each where it was published 16/03/2018. target. The latest release EDR3 was made public in December of 2020. This work is based on the catalogue superior to a former WISE All-Sky predecessor DR2. Even though the (early) EDR3 has Release Catalog. Furthermore, the photometric better accuracy, it is missing some properties (such accuracy of AllWISE, in all four bands, has improved as luminosity) and partly other properties for some due to corrections of the source flux bias and more targets (e.g. G magnitudes). It is worth noting that the robust background estimations. The only exception, errors of the magnitudes are not given in the archive, where WISE All-Sky is better than AllWISE is for these are manually derived from the respective flux (see photometry measurements for sources brighter than eqs. 30& 31). The error estimation is a simplification the saturation limit in the first two bands W 1 ă 8 & of the real uncertainty but is not expected to have an W 2 ă 7. Both these catalogues are commonly shown impact on the machine learning process. The estimated for our Dyson sphere candidates when searched for in errors for Gaia photometry in Vega system is based on VizieR2, where we pay extra attention to the variability the derivation found on the homepage of ESA Gaia: flag for each band. From this catalogue we also utilise GDR2 External calibration1 (Paolo Montegriffo 2020). a parameter called cc-flag which is discussed more in- The derivation can be seen in Appendix A.2. depth in section 6.5.

4.3 2MASS 4.2 AllWISE Between 1997 and 2001, the Two Micron All WISE is a Medium Class Explorer mission by NASA Sky Survey resulted in photometric and astrometric that conducted a digital imaging survey of the full measurements over the entire celestial sphere. A pair sky in the mid-IR bandpasses; W 1, W 2, W 3 and of two identical telescopes at Mount Hopkins, Arizona W 4 with respective central wavelength; 3.4, 4.6, 12.0 and Cerro Tololo, Chile, made NIR observations which and 22.0 µm (Wright et al. 2010). The catalogue resulted in a Point Source Catalogue containing above contains photometry and for over 500 470 million sources. The 2MASS All-Sky Data Release million objects. In likeness to the Gaia satellite, the includes the aforementioned Point Source Catalogue, WISE space telescope accumulates data for each target 4.1 million compressed FITS images of the entire sky multiple times. The independent exposures for each and an additional Extended Source Catalogue of 1.6 point on the Ecliptic plane were typically 12 times million objects. The NIR photometric bands used or more, while observational points at the Ecliptic in 2MASS and utilised within this work were; J, H poles reached several hundred. In November 2013 the and Ks with corresponding central wavelengths; 1.247, AllWISE Data Release was generated. This source 1.645 and 2.162 µm (Skrutskie et al. 2006). These catalogue is a combination of data from WISE Full passbands largely correspond to the common bands J, Cryogenic, 3-Band Cryo and NEOWISE Post-Cryo H and K, first introduced by Johnson(1962), with survey phases. This has enhanced the sensitivity in the adjustment that 2MASS Ks filter (s for short) the W 1 and W 2 bands, thus making the AllWISE excludes wavelengths beyond 2.31 µm to minimise 1https://gea.esac.esa.int/archive/documentation/GDR2/Data_processing/chap_cu5pho/sec_cu5pho_calibr/ssec_cu5pho_ calibr_extern.html 2https://vizier.u-strasbg.fr/viz-bin/VizieR

11 thermal background and airglow since the telescope is The presumption of dimensionality reduction of a n situated on (Skrutskie et al. 2006). For this work, dataset xii“1, where x is a n´dimensional vector of we will use the magnitude measurements in these three numerical variables, is that there is some amount of aforementioned filters (J, H and Ks) of this catalogue. redundancy in the n´dimensional representation of x. Furthermore, the n´dimensional representation of x, 4.4 MATLAB conceivably lies close to or on some manifold which allows lower dimensionality representation. The goal supports various machine learning techniques q MATLAB is therefore to create another dataset tii“1, where and applications. Both supervised and unsupervised t is a q´dimensional vector and q ă n (Lindholm learning and their respective sub-groups classification et al. 2020). The purpose of PCA is to get a more & regression and clustering are available. MATLAB compact representation of the dataset and hopefully provides numerous applications that are suitable for filter out the noise which muddles the classification. different problems to simplify the machine learning Secondly, PCA is also beneficial at lowering the risk tasks at hand. We are using supervised ML on of overfitting for high dimensional data. A dataset a classification problem, therefore, utilising MATLABs with a large number of features reduces the ability Classification Learner app. This app automatically to generalise patterns of the data. This could be trains selected models and helps choosing the optimal, the cause of non-normalised features and lack of as well as providing manual hyper parameter tuning. a well-defined kernel which can translate the high The model types available include decision trees dimensionality problem into a secondary dimensional (section 2.3.2), discriminant analysis (section 2.3.4), space. However, PCA has some known limitations. support vector machine (section 2.3.3), k-nearest In relevance to this work, the classification accuracy neighbours (section 2.3.1) and ensemble (section 2.3.5) might be affected. The variance-based PCA does not classifiers. MATLABs CL also provides numerous ways to consider the differentiating attributes of the class, thus explore the data and evaluate the learning and visualise it might discard low variance components which hold the results (such as the confusion matrix described information that distinguishes one class from another. in section 2.5.1). The app also prevents overfitting We noted that the classification accuracy was reduced by default by applying cross validation, as previously when activating the PCA throughout the whole work mentioned. and various rounds of implementations. We have When loading the training sets into the CL, the therefore decided to not further use the PCA in the app allows for numerous feature selections. Firstly, ML of this work. we can select which parameters to include in the The models mentioned in section 2.3 are also the training. Secondly which ones are set to be predictors base-models found in MATLABs CL. These five base (not to be confused by the predictions) and which models (decision trees, discriminant analysis, SVM, ones are responses. For our purpose, the predictors kNN and ensemble) have ”sub-models” that work in are typically colours and respective errors (such as slightly different ways such as in different metric spaces G ´ W 3, G ´ J etc.) and the responses are naturally or boundary equations. Furthermore, most models the object types (the different classes such as stars or come with the option of hyperparameter tuning, where Dyson spheres). The implications of varying features we can restrain the models to our liking. The ones we are further discussed in section 6.6. Furthermore, the referred to as sub-models are commonly the base-model scatter plot provides a visual representation of the with specific choices of hyperparameters. Therefore, labelled data by showing dots for the correctly assigned the manual changes that are available are many times data points and crosses for the incorrectly classified equivalent to the default model-settings in CL. In the ones. The axis of the scatter plot can be changed, thus following sections, we will review the alternatives given providing the opportunity to analyse the correlations in the CL app. between the predictors and recognise patterns in the labelled data. The latter is useful since we can find 4.4.1 Decision trees common misclassified points and identify problematic regions where the data of various classes overlap. It Firstly, we have the decision trees which consists of a is also a tool for us to see how well the algorithm group of three decision tree models simple, medium performed. Even though the favoured method for and complex. Their differences are their respective evaluation of the model is via the confusion matrix (see maximum number of splits, 4, 20 and 100. However, section 2.5.1). the maximum number of splits is a hyperparameter The application also provides the opportunity that can be set manually if desired. Likewise, we can to use a Principal Component Analysis (PCA) in change the split criterion and surrogate decision splits. machine learning. PCA is a non-parametric statistical The former includes the options Gini’s diversity index, technique that is used for dimensionality reduction Towing rule and Maximum deviance reduction. These where it infers the most informative dimensions of options are in general terms functions that estimate the the input data (Lindholm et al. 2020). The main node purity, i.e. the frequency of miscategorisation. objective of the PCA is, therefore, to find the Gini’s diversity index was covered in section 2.3.2 dimensions along which the data varies the most. and Maximum deviance reduction works in a similar

12 fashion by measuring the node impurity, whereas distance metric and an equal distance weight. The Towing rule increases the node purity. The surrogate three latter kNN models, cosine, cubic and weighted decision splits are used when the data is missing have all three 10 number of neighbours by default values, e.g. if the split depends on one variable and but with varying distance metricise. Cosine has, as a data point is missing this particular variable, then the name suggests, a cosine distance metric–distances the surrogate split predicts the actual split. expressed as one minus the cosine of the angle between considered data points–and an equal distance weight. 4.4.2 Discriminant analysis The cubic kNN model has a ”cubic” or Minkowski metric distance and an equal distance weight. Only Secondly we have the discriminant analysis. They the weighted kNN model has a squared inverse distance come in two options, linear and quadratic, both of weight, but a Euclidean distance metric. If the which were discussed in section 2.3.4. The only distance weight is ”equal” it means no weighting 2 further manual adjustment is the covariance structure, and squared inverse is evidently 1{Dpxj , xiq (see to either full or diagonal. Covariance structures are eq.1 for reference). Besides the aforementioned patterns in the covariance matrices (eq. 11) which parameter options which are standard for the built- describes the covariance pattern between each pair. in model versions in the CL, the distance metric has several choices; Euclidean, Minkowski, Cosine, 4.4.3 Support Vector Machine City bloc, Chebychev, Mahalanobis, Correlation, Spearman, Hamming and Jaccard distance. Moreover, A larger group of models is the SVM. There are six the distance weight has the options; Equal, Inverse and types of SVM models to chose between in MATLABs CL. Squared inverse. They are linear, quadratic & cubic and fine, medium & coars Gaussian SVM. As previously mentioned in section 2.3.3, these options represent various 4.4.5 Ensemble hyperplanes in n-dimensional space and are specifically important for non-linearly separable data. These are Finally, we have the ensemble models. This group of also referred to as the kernel function (also covered in models consists of; boosted trees, bagged trees, subspace section 2.3.3) and is included in the advanced options. discriminant, subspace kNN and RUSBoosted trees. The second option here is to set the box constraint level. The two firstly mentioned (boosted and bagged trees) This parameter regulates the cost imposed on margin are seemingly the most common types of ensemble violation which helps to prevent overfitting. With models and was reviewed in section 2.3.5. These two increasing box constraint level, less support vectors are models work with the learner type decision tree and used, however, calculations get more demanding and ensemble method AdaBoost and Bag, respectively. The time-costly (MathWorks 2021). two subspace ensemble models (subspace discriminant The difference between the three Gaussian SVMs and subspace kNN) are operating in a similar way is the kernel scale. For the non-Gaussian SVMs, as bagged trees. This means that an ensemble the kernel scale is by default ”automatic”, where of discriminant/kNN models is trained on distinct the software selects a suitable scale using a so-called random subsets of the training data in parallel. Here heuristic procedure. However, it can also be manually the learner type is discriminant and the ensemble adjusted. For fine Gaussian, it is automatically set method subspace/kNN, respectively. For the latest to 1.3, for medium 5.1 and for coarse 20. The kernel ensemble model RUSBoosted trees, the learner type scale is, as the name implies, the scale of the kernel. is again decision tree and the ensemble method If the kernel is small for a Gaussian distribution, the RUSBoost. kernel falls off rapidly and if it is larger, the kernel The function of the ensemble method AdaBoost falls off slower. In other words, the kernel of the coarse (Adaptive Boosting) is that the following weak Gaussian SVM is ”broader” than for the fine Gaussian learners are tweaked in favour of previous classifiers SVM. misclassification, i.e. a sequential process as described in section 2.3.5. The ensemble method bag works, as 4.4.4 k-nearest neighbour also described in section 2.3.5, by creating random subsets of the training data that the ensemble of Another type of model which has many versions in decision trees can train on. The subspace method MATLABs CL is the kNN model. For this particular works similarly to the bag method where random model, the default options are fine, medium, coarse, subspace ensembles are made up of the same model cosine, cubic and weighted kNN. The three alterable fit but on random groups of the input data. The options for the kNN model are the number of final ensemble method is the RUSBoost (Random neighbours, distance metric and distance weight. The Under Sampling Boost). This algorithm is good for only difference between the first three ”sub-models” datasets where one class is distinctly overrepresented is the number of neighbours. The fine kNN has 1 (Mounce et al. 2017). The common hyperparameter number of neighbours, medium has 10 and coarse has for all versions of the ensemble models, which can be 100. All of these three have by default a Euclidean altered is the number of learners, which is standardly

13 set to 30. For the tree-ensemble models (boosted, 5.1 Forming datasets and building DS bagged and RUSBoost) we can also set the maximum models number of splits as for decision trees mentioned in section 4.4.1. Finally, for the subspace-based ensemble 5.1.1 Training set stars models, the subspace dimension can be changed. It can The first class of the training set is the group that prove beneficial to adjust this parameter depending one expects the majority of the objects to be assorted on the dimensionality of the data, however, high in, namely the common stars. This training set is dimensional data might result in several subspaces constructed from a large sample of all types of stars only containing noisy features. Furthermore, when the within a radius of 100 pc. The number of training subspace dimension is large, it is more probable that stars can, for each run, (see table1 for the final there are overlaps of features in different subspaces (Li number used in training of the algorithm found in & Zhao 2009). appendix A.4) be set and sampled into a list consisting of the Gaia source id, (the position (ra) and (dec)). The list also contains the magnitudes and corresponding error in each passband 5 The general method filter (BP, G, RP and J, H, K and W1, W2, W3 & W4) from Gaia (Gaia Collaboration 2016, 2018), 2MASS The approach of this work can be explained as (Skrutskie et al. 2006), and WISE (Wright et al. individual programs that together build the whole 2010) respectively. Alternatively, the magnitudes are system. It is a sequential process that is repeated exchanged by colours. These colours are based on the multiple times to test out and improving the models. aforementioned magnitudes and are calculated as G- However, it is also to get a diverse sample of candidates magnitude subtracted by each magnitude individually that are based on various samples of random stars from the remaining filters. That is for the Gaia filters; and DS models. The overall progress starts by BP ´G, RP ´G, for 2MASS filters ;G´J, G´H,G´Ks generating a training set which contains some samples and from WISE; G ´ W 1,G ´ W 2,G ´ W 3 and G ´ of stars, DSs and in some iterations, a third group W 4. These colours or magnitudes covers the range of of dust-enshrouded stars referred to as the YSO- optical, near-IR and mid-IR, thus covering the essential group. Thereafter the training set is loaded into regions of the SED of a DS. In other words, the optical MATLAB’s CL where we try out different models and dimming and the IR-excess. hyperparameter settings by training the model on Finally, the interstellar reddening (E(BP-RP)) is the imported dataset. The classification models with listed from Gaia data, as well as the bolometric the highest performance are kept and exported. The stellar luminosity derived from the G-magnitude in trained models are then applied on a large set of stars the Vega system (GVega “ G ´ p25.7934 ´ 25.6884q) from the Gaia+2MASS+WISE catalogue, to classify under the assumption that this is a main sequence the entries accordingly, thus generating a list of DS star, via a fitting formula (Mat´ıas Suazo, private candidates. The output list consists of candidates communication). All magnitudes are converted from that all models have assorted into the same class Vega to AB magnitudes by adding the zero point (DS). A further program exists in order to evaluate difference between the two systems, to the default the match between the DS models (from the training Vega magnitudes (see eq. 17(Paolo Montegriffo 2020, set) and the output candidates. The performance and Anthony Gonzalez 2012, Caltech 2017)). accuracy of the models can be checked by applying the trained models onto a test set consisting of a smaller sample similar to the training set, but with new sets GAB “ G ` p25.7934 ´ 25.6884q (17a) of stars, DS and possibly YSOs. With this, we can BPAB “ BP ` p25.1161 ´ 24.7619q (17b) evaluate how well the models categorize the objects RPAB “ RP ` p25.3806 ´ 25.3514q (17c) since we know both their predicted and true classes. J “ J ` 0.901 (17d) The final step in the process is to manually inspect AB the output candidates to see whether the targets can HAB “ H ` 1.365 (17e) be identified as any other astronomical objects, or if KsAB “ Ks ` 1.839 (17f) they lack natural explanations of their behaviour, thus W 1AB “ W 1 ` 2.699 (17g) making them good DS candidates. W 2AB “ W 2 ` 3.339 (17h) In the following sections, we will review each W 3AB “ W 3 ` 5.174 (17i) individual program in more depth. This includes W 4AB “ W 4 ` 6.620 (17j) the basic programs that are used for all iterations. Later we will present a step by step analysis of the 5.1.2 Training set Dyson spheres specific implementation that was done with the aim of improving the algorithms. That discussion will include The training set of Dyson spheres is created on the the execution, the outcome and changes that followed basis of a number of choices (which can be changed as well as an evaluation of the implementation. between the runs) of real stars from a sample of observed stars within 100 pc in the Milky Way.

14 Figure 7: The SEDs of Dyson spheres (coloured lines) and the star that the models are based on (black). Left: Models with temperatures between 200 to 600 K and fixed covering fraction fcov “ 0.5. Right: Models of varying covering fraction with fixed Teff “ 300 K.

Furthermore, two key parameters are also selected.

These are the covering fraction fcov, commonly magdim‹ “ mag‹ ´ 2.5log10 p1 ´ fcovq (19) arranged to the interval of 0.5 to 0.95, and the temperature within the range 300 ă Teff ă600 K. The range of covering fractions prevents identifications of The magnitudes of the DSs are derived by low covering fraction Dyson spheres, however they are constructing black body spectra which depend on the chosen to eliminate natural sources of IR-excess such chosen temperature, Teff . Firstly, Planck’s law (see as debris disks that typically show a low degree of eq. 20) is used to simulate the emissive power per IR-excess. The temperatures are within the range of unit area of the Dyson sphere which depends on the which the second peak lies within the near- to mid- temperature Teff and a wavelength λBand from the IR regime. This is an adequate assumption given the range 0.532-22.0 µm. This array of wavelengths are thermodynamic efficiency of a DS around a solar-like set such that each point corresponds to the centre star. Wright, Griffith, Sigurdsson, Povich & Mullan of each passband filter from; 2MASS (J, H and Ks; Skrutskie et al.(2006)), Gaia (BP, G and RP; Gaia (2014) states that a waste heat of „ 300 K is an Collaboration(2016, 2018)) and finally, WISE (W1, acceptable estimate and that a DS with ν “  “ 0, W2, W3 and W4; Wright et al.(2010)). Initially, the thus α “ γ (see section 3.1 for details) requires that the DS surface must increase as A 1 T 4. Therefore, wavelengths used in the following derivations are set 9p { wasteq ˚ given a DS at 1 (AU) and 95% to a range 100 Aă λ ă 1000 µm . Besides wavelength efficiency around a solar-like star (5800 K) leads to and temperature, Planck’s law includes the constants h-Planck’s constant, c-the speed of light and k is the T “ 290 K and an increased efficiency of 99.5% waste Boltzmann constant. and Twaste “ 29 K requires an increased surface area of 104, i.e. a sphere at 100 AU. The temperature range of 300 to 600 K is, therefore, a reasonable assumption. π2hc2 1 The two parameters characterise the Dyson sphere fλpλ, Teff q “ (20) p10´10λq5 ehc{p10´10λkT q ´ 1 model and its final magnitude which is a combination of the enshrouded star’s flux and of the structure itself accordingly. The total emissive power (fλpλ, Teff q) is derived via eq. 21 where the radius R is derived from Stefan Boltzmann’s law, see eq. 22, including σ, which is magdim‹{p´2.5q magDS{p´2.5q magtot “ ´2.5log10 10 ` 10 the Stefan-Boltzmann constant and the DS luminosity L according to eq. 15. The final flux of the DS is ´ (18)¯ DS calculated via eq. 23 and the AB magnitude via eq. 24. The magnitudes of the dimmed stars (the ones Lastly, the magnitude is interpolated at the points set enshrouded by the DSs) magdim‹ are based on the by λBand. magnitudes of the input stars and scaled with the covering fraction as stated in eq. 19. 7 2 fpλ, Teff q “ fλpλ, Teff q10 4πR (21)

15 Table 1: The number of entries in each group of the final 2 4 training dataset and the constraints on the DS training set. Leff “ 4πR σTeff (22)

fpλ, T q1010pλ10´10q2 f “ eff (23) class stars DSs YSOs DS c

Nr. of entries 2000 480 100 magDS “ ´2.5log10pfDSq ´ 48.6 (24) magnitude – 4 ă G ă 6m – f – 0.5-0.95 – A sample of Dyson sphere models is seen in figure7. cov T – 300-600 K – The left graph shows how the temperatures alters the eff shape of the SED and include five models with fixed fcov “ 0.5 and temperatures ranging from 200 to 600 K. The right graph shows the effect of varying covering fraction and include five models with fixed 5.2 Finding and identifying the DS Teff “ 300 K and varied fcov from 0.5 to 0.9. The candidates marked points in each graph show the interpolation The algorithms are trained to recognise ”normal” stars on the query points λBand. The figure is to illustrate typical shapes of DS models as well as highlight the and unusual targets with high IR-excess. Typically, numerous models and versions of the model (or affects of varying Teff and fcov on the shape of the SED. hyperparameter tuning) are trained. We evaluate each The DS coordinates are simulated based on the model by their accuracy, but foremost by examine same star(s) on which the DS model is built on. The their confusion matrix. In a previous discussion of the ra and dec are taken to be the coordinates of the confusion matrix we stated that the most important input star plus a random factor between -0.2 and 0.2 parameter to consider is the false positive. Therefore, degrees (12 arcmins) to avoid eventual correlation with the best model is the one with a low number of FP. the base-star and the output of DS candidates. The However, we do not neglect the remaining entries on number of DS models used in the final training set the confusion matrix. The best models with low FN & which the ML algorithm in appendix A.4 is based on FP and high TP & TN are exported to the working can be seen in table1. space. Here we apply the algorithms onto a large dataset with considerably more targets. These are selected objects within ă 500 pc and they are ”good” 5.1.3 Training set YSOs entries i.e. with non-zero profile-fitting magnitude The group of nebulous stars that accounts for our third error. From that large file of 765266 entries, the classification group in a later implementation stage, is program extracts the same parameters that are used based on two lists of objects. The first part of the in the training of the models, such as magnitudes. The set, which makes out the greater share of the training models are applied onto the dataset where the output set, consists of stars within prominent nebulae which is the predicted type, i.e. class. The predictions are were found in previous iterations and were classified stored in a matrix, where the program later checks as DSs by the algorithm. The Gaia source ID of which targets have been equally classified by all applied these objects are saved in a txt-file which is read in models. These common targets are thus the DS the program that forms the YSO training set. The candidates whose Gaia source ID is stored in an output list is read and crosschecked with the large dataset file. The output file is thereafter used to manually where relevant information (such as magnitude in all investigate the DS candidates. Each DS candidate used filters) is extracted for each target. After new is plotted in a magnitude colour diagram. Such an runs of the program and further dust-enshrouded stars example can be seen in figure9 where the red dots are have been classified as DS by the algorithm, these the commonly classified DS candidates and the black targets are appended to the YSO training sample to dots represent the training set of ”normal” stars for enhance the number of training indices. The second reference. part of the YSO training set comes from a former survey made by Marton et al.(2019), where YSO 5.2.1 Manual analysis of DS candidates candidates in Gaia DR2+AllWISE catalogue were identified using machine learning (see section 7.1.2 for The manually executed part of our work is to further further information). We used a sample of 8 confirmed analyse the common DS candidates to see whether it YSOs from that study and incorporated them in the is possible to determine their nature based on data YSO training set in a similar fashion as for the former outside the three catalogues and parameters already group we identified ourselves. The final training set of used in the study. For this we exploit the VizieR the YSO group contained 100 entries as seen in table1. (Ochsenbein 2000)& SIMBAD 3 (Wenger et al. 2000) 3http://simbad.u-strasbg.fr/simbad/ 4https://aladin.u-strasbg.fr/aladin.gml

16 database along with the interactive sky atlas Aladin It is easier to establish the variability period for a star Lite4 (Bonnarel et al. 2000, Boch & Fernique 2014) with less orbiting bodies, but with today’s programs, and the infrared science archive IRSA5 (Skrutskie et al. or even by a trained eye, these patterns of transits are 2006). recognised. Instead, assuming that the DS consists Aladin Lite is a visual tool that allows us to view of larger segments orbiting the star, the periodic the targets (the DS candidates) in multiple wavelength variability could mimic what is observed for a . ranges from various surveys. In particular, we view our Some suggest that the light variability by a swarm of targets in the optical (DSS2 survey) and the infrared satellites that orbits the star in random, meaning that (AllWISE). Key features which we look for is how they do not necessarily move in unison or even the same bright the star appears in the optical, potentially direction, will not appear regularly in time (Wright visible nebulous features and if the target is bright et al. 2016). Although, there are natural phenomena and or red in the IR. It is also possible to see whether that show similar irregular variability patterns like the the targets are contaminated by other artefacts such DSs. For example, dust rings that are commonly as ghosts from a bright nearby source etc. This is seen around young stars have similar effects on the typically the first step in our analysis to eliminate variability as DSs. A known example where both dust evidently false DS candidates. and a DS structure, with low fcov, was considered as If no distinct indications of nebulae are seen by the origin of an irregular variability is the star known viewing the target in Aladin Lite, a more detailed as Tabby’s star (Boyajian et al. 2016, Meng et al. analysis can be done in IRSA. IRSA offers data 2017, Wright et al. 2016). Although, some of the analysis and visualisation tools from near to infrared latest theories of the true nature of this star are that observations, 2MASS (Skrutskie et al. 2006), WISE the dimming is caused by fragments of an orphaned (Wright et al. 2010), IRAS (Neugebauer et al. 1984), exomoon, one of the in the system that AKARI (Murakami et al. 2007) and Planck (Planck formerly was an exomoon of another planet (Martinez Collaboration 2011b,a, Planck HFI Core Team 2011, et al. 2019). The recently verified binary companion, Planck Collaboration 2020a,b). The primary use, for a red dwarf, has also been proposed as a potential our purpose is to detect any subtle indications by source of instability which could affect the aperiodicity the presence of nebular dust emission in the AllWISE (Pearce et al. 2021). In this work we make use of filters. IRSA allows for individual images in each filter high covering fraction DS models that have variability (W1, W2, W3 & W4). Emission from dust that is not patterns less similar to stars with debris disks. Also apparent in Aladin Lite commonly becomes visible in assuming that the DS consists of separate satellites and W3 and W4. Targets that show strong noise in these not a few large panels, the variability pattern should be filters are unwanted and not optimal DS candidates. minimal. Therefore, we regard a target with no signs Therefore, we exclude such sources from our list of DS of variability a better DS candidate. candidates in table3. Former studies and references are, for some The natural next step in the analysis is to targets, available in SIMBAD. This database provides examine the potential DS candidate using further basic data, cross-identifications and bibliography for catalogues other than the one used in the ML process. astronomical objects. We are particularly interested VizieR offers the most complete library of published to see if the potential DS candidate has been reviewed, astronomical catalogues. Information from various thoroughly analysed and classified in former studies. catalogues helps to determine the true nature of the Several bibliographies of catalogues or surveys do not target if possible, thus discerning natural astronomical analyse the targets in depth, they are simply listed objects from DS candidates. As previously stated, within the source. Occasionally, there exist papers of young stellar objects often match the characteristics more in-depth analysis of the targets, which entailed in of DS candidates. Therefore, common properties a classification of the object. Therefore, one has to be that we pursue information about are variability, observant of the type of publication and information it indications of Hα or maser emission, contamination holds. and if any catalogue has classified the target and on what premises. Findings of these parameters are 5.3 Testing set reflected in the table3. Variability is an important factor to address when To further evaluate the performance of the ML studying DS candidates. Assuming that the Dyson algorithm, we can apply the trained models on a sphere is not a solid shell, but a swarm of solar panels constructed testing set that is different from the that orbits the star could result in the flickering of the training set. This program loads a new input file stellar light. This is also true for stars having orbiting of stars and DS models (and YSOs), similar to their that move in front of the star along our line respective training sets, however, with new entries. of sight, so-called transiting planets. However, the These input files contain the same parameters as the difference is that a planet or a few planets periodically training sets, e.g. magnitudes and corresponding passes in front of the star creating a periodic variability. errors. These parameters are combined into a testing 5https://irsa.ipac.caltech.edu/frontpage/

17 set where the true class of all entries are known. The trained algorithms are thereafter applied to the testing set where the entries are categorised. If the predicted and true class are equivalent, the entry is correctly classified. The number of correctly and incorrectly classified objects are calculated and presented as an output. Trivially, the desired result is 100% correctly classified objects so that the algorithm is accurate in its predictions.

5.4 Best fitted model This segment of the full program evaluates the similarly of the mutually found DS candidates and DS models. Given that the program selects the best fitted DS model, it also gives the most optimal choice of DS parameters Teff and fcov for that particular entry. Thus, two properties of the potential DS candidate are derived. From the program that creates the training Figure 8: The SED of the DS candidate Nr.9 is plotted in set of DS models, we construct a sample of models black, the corresponding best fitted DS model in darker blue and to cross reference the candidates with. The model the scaled best-fitted model in dark red. The light blue curves represent each tested DS model and the beige curve is the star parameters we have adapted are 0.1 ă fcov ă 0.95 on which the models were based on. The best-fitted model with with step size 0.05 and 200 ă Teff ă 700 K with 100 K Teff =400 K and fcov “ 0.15, have RMSD=0.55052. intervals. We chose a wider interval for both fcov and Teff than what is used for the DS models in order to algorithms, differed in scale but not in shape, to extend the analysis on the properties of the targets. the DS models of the training set. The algorithms For example, a lower covering fraction („ 0.1) is more thus found targets with SEDs comparable to DSs but likely to suggest a natural cause for the IR-excess, such with an average magnitude that was slightly higher as a debris disk. The stars on which the DS models or lower than the model. Therefore, we adjusted the are based are commonly selected from a sample where program by first introducing the mean deviation of the 4 ă G ă 6 mag to limit the models to stars around magnitudes of the best fitted DS model and the DS the solar magnitude. This program was also used for candidate as: the evaluation of how similar the entries of the YSO training set are to DS models. 10 1 Firstly, the Gaia source IDs are cross-referenced to S “ pmagcandidate,i ´ magmodel,iq (26) 10 i the entries in a large data file of Gaia+2MASS+WISE ÿ data that was previously used in the classification The best-fitted model is thereafter scaled with S by the trained algorithms. Corresponding magnitude as: mag “ mag `S. Finally, the RMSD is information is retrieved from the data and saved into scaled,i model,i once again calculated, now between the scaled model an array. Thereafter, we loop over all DS models and the DS candidate. However, a large scaling factor and calculate the root-mean-square deviation (RMSD) could mean that the absolute magnitude is inconsistent which estimates the deviation between each model with main sequence stars. If the model is shifted to a and DS candidate. The RMSD is the root-mean- much higher brightness, it is likely that the target is square deviation between the magnitudes recorded in an evolved star. The resulting SEDs are plotted as each filter of Gaia+2MASS+WISE or equivalently seen in figure8 where the black line corresponds to the each data point on the SED. The RMSD is calculated DS candidate, the red line is the best fitted and scaled according to eq. 25 where mag and mag candidate,i model,i model, the light blue curves are all applied un-scaled are the magnitudes of the DS candidate and DS model models and the beige curve is the star of which the in each filter. DS models are based on. Each figure also displays the RMSD, Teff and fcov of the best-fitted model. 10 1 RMSD “ pmag ´ mag q2 g candidate,i model,i 10 6 Process f i fÿ (25) e Supervised machine learning is an iterative process in The Teff and fcov of the best-fitted model, i.e. the which modifications of the programs and the approach one with the smallest RMSD is saved together with the is carried out as the work proceeds. Adjusting the derived RMSD. In the final step, we allow the model programs to increase their performance is a central step to be scaled. This step was introduced since we found in ML. As previously stated, we revised our programs that several DS candidates that were identified by the and approach as the work procedures to improve the

18 Figure 9: All six colour-magnitude diagrams show the DS candidates (red) identified by the algorithm (linear and quadratic SVM), plotted with the training set of 2000 stars as reference. The leftmost figures corresponds to the training process where the stars that the DS models were based on, were limited to G : 4 ´ 6m. The middle graphs represents the less limited case G ą 4m and the rightmost, no magnitude restrains. The two rows represent the colour-magnitude diagrams with G magnitudes on the vertical axis and G ´ J & G ´ W 3 on the horizontal axis respectively. performance of the algorithm in the classification of based on this restraint and all candidates in table3 the unusual IR targets. The motivation and major are found by that. Nevertheless, we also explored the changes, results and evaluations, are reviewed within possibility of DS models being based on all stars with this section. G ą 4m. As seen in the middle graph of figure9. As expected, the number of DS candidates 6.1 limiting DS magnitudes identified by the algorithms are few when the magnitudes of the base star are restrained. This is One of the first adjustments applied to the process seen in figure9, which shows three colour-magnitude was to limit the magnitudes of the DS candidates. diagrams of the DS candidates (red) and the training Originally the DSs were based on a random sample of set of stars (black). The leftmost graph represents the stars as previously mentioned. However, using stars case where the base star’s G magnitude is restrained from the whole MS of a broad range of brightness within 4-6m, the centre is for G ą 4m and the can cause problems when they are used to model rightmost is without magnitude restrictions. Clearly, a DS. If the star is much brighter than the Sun, the algorithm which is trained on a limited magnitude other stars such as evolved stars which have common range is more successful in identifying outliers than properties with DSs might get classified as DSs by the one without. By comparing the two cases of the algorithm. Furthermore, it will be harder for limiting magnitudes we see that the less restrained the algorithm to identify outliers when no magnitude case identifies more DS candidates that seemingly are restrictions are imposed. We choose the span of 4 ă outliers and diverge from stars along the track seen by m G ă 6 to replicate DS models based on stars of similar the training set of stars. However, the majority of the magnitudes as the Sun. DS candidates with G ą 4m that differ from the ones In the program that creates the training set of found in 4-6m, appears to be dust-enshrouded stars Dyson sphere models, we establish a constraint on the when further investigated. Therefore, we have kept magnitudes upon which the DS models will be based. the primarily imposed limits 4-6m in further runs. From the smaller file of stars (from within 100 pc) we extract the ones that match the magnitude constraints. 6.2 Introducing a third class Thereafter, the DS models are generated as previously described from these stars, with varying temperature With the former implementation, the full program and covering fraction. The first approach was to limit seems to recognise objects with significant IR excess. the magnitudes to G : 4 ´ 6m. Most of the runs were However, many of the targets which the algorithm

19 Figure 10: The figures show colour-magnitude diagrams of a training set of 2000 stars (red), 528 DSs (blue) and 100 ”YSOs”/dust- enshrouded stars (yellow). The left graph displays the G ´ W 3 colours and the right, G ´ J colours. identified as DS candidates, are located in large nebulae expected. Thereafter, the biggest group is the YSOs and often acknowledged as YSOs. The dust around which, by far, surpasses the size of the DS group. YSOs and the itself mimics the red excess that is However, it appears challenging for the algorithm to expected for DSs. Therefore, the second adjustment to distinguish the real YSOs or dust-enshrouded stars the programs was to introduce a third class consisting from good DS candidates. In many cases, objects of dust-enshrouded stars, i.e. stars within or behind that have previously been identified as DS candidates large nebulae. The classification therefore goes from a but confirmed manually to not be YSOs, end up binary classification case to a multi-class classification. in the YSOs candidate output. In other words, we In doing so, we hope to reduce the number of targets have a number of false negative. Nevertheless, most within and behind nebulae, that otherwise would be entries that do get sorted as DS candidates, are not categorised as DS candidates. situated in or behind an evident nebula, i.e. lower false In section 5.1.3, we described the general approach positive. Occasionally the output DS candidates do not of how the training set of the third group was formed. show any sign of nubulosity when looking via Aladin This set of targets, from Marton et al.(2019) and our Lite, however, when studied more in detail (images in identified dust-enshrouded stars, does not change for separate filters) through IRSA we commonly encounter each run like the training set of the stars and DS. ”noise” in the two last filters of WISE, W3 and W4. The colour-magnitude diagrams in figure 10 shows how This is likely an indication for the presence of a dusty the YSO training set is distributed, with G ´ W 3 nebula, where the interstellar dust causes the infrared on the left and G ´ J to the right. Evidently, the excess in the SEDs and colours of the stars. Such training YSOs are located far from MS when plotted stars could, therefore, be difficult for the algorithm in G´W 3 (see the left plot of figure 10 ), but coincides to differentiate from DS candidates, and not recognise for most of the entries when plotted in G ´ J (see the them as members of the YSO class. right plot of figure 10). Unfortunately, the YSOs are In figure 11 we show two examples of entries from situated approximately at the same magnitudes and the YSO training set and the best fitted DS model. colour as the DS training set even when plotted in G ´ Evidently, these two classes of targets are difficult for W 3. However, the training YSOs are in general more the algorithm to distinguish due to their matching spread out while the training DSs are restrained and SEDs. The outcome of this implementation does, centred at slightly lower magnitudes. The two colour- therefore, not account for a fully desired result, though magnitude diagrams thus demonstrate that YSOs (and a step in the right direction. This issue will be further DS models) resembles common stars when plotted in analysed in section 8.2. NIR (G ´ J) while clearly appearing different in IR (G ´ W 3) as expected. The classifier therefore goes 6.3 Coordinate dependence from a binary to a multiclass classification. The implementation was, to some degree, effective. Despite the previous modification to remove dust- The algorithm clearly recognises that there is a third enshrouded stars from the DS candidates, some group of stars that resembles the DS candidates and are remained in the output. Therefore, we incorporated much more abundant. Thus far, most stars from the the coordinates into the training process with the big ”searching-set” are categorised as common stars, as hope to eliminate stars within and behind large known nebulae from the DS candidates. The larger known

20 Figure 11: The figures show two examples of the DS model fit to the training set of the third group of ”YSOs”. The black line and dots are the stars from the YSO training set and the red dots and curves represent the DS models. Respective DS model temperature, covering fraction and the RMSD is also noted in the plot. nebulae are located near the galactic plane, which output. Thus we can assume that the implementation in the international celestial reference system (ICRS) is prosperous to some degree. Although, noise in the appears as a u-shaped band in an equatorial projection. WISE filters W3 & W4 were still seen for many of the However, the implementation was not expected to help DS candidates. to remove objects within small nebulae or limited dusty regions. 6.4 Malmquist bias For the large dataset where the algorithm searches for DS candidates, as well as for the training set of YSO A further consideration we adapted was the so-called and stars, the coordinates were simply included while Malmquist bias. The Malmquist bias refers to an loading or formatting the file. The only alteration that effect in observational astronomy, which leads to a required new derivations were for the DS training set. biased sample of preferential detections of intrinsically As previously mentioned, the coordinates of the DSs bright objects. For a brightness-limited survey, stars within the training set were derived from the stars on below a certain magnitude can not be included in the which the DS models were based on. The coordinates sample, and since stars appear dimmer when they are were calculated according to eq. 27, where R is a further away, faint stars at a greater distance can not uniformly distributed random number between ˘0.2 be included. Therefore, more stars within a broader degrees. magnitude range are available at a shorter distance while the sample contains more intrinsically bright ra “ ra ˘ R P r´0.2˝, 0.2˝s stars at large distances. This creates a bias for our DS ‹ (27) ˝ ˝ sample since we limit our DS models to possess an IR decDS “ dec ˘ R P r´0.2 , 0.2 s # ‹ excess and the stars at far distances are likely naturally brighter in IR due to their distance from us. Thus by Firstly, we must stress that few DS candidates, limit our set of stars from where the algorithms search that were proposed by the algorithms, were positioned for DS candidates, we reduce the chance of finding within grand nebulae before this implementation. ”false” Dyson spheres or i.e. high infrared objects. However, we did see a general improvement after The only change to the program was to download a including ra & dec for all three groups in the training new dataset from Gaia that included stars in the Milky process. We encountered fewer such nebulous objects Way up to a smaller distance. Formerly we searched within the output of DS candidates. Whether or not through a stellar sample up to 1 kpc (with 1680144 all these objects were classified into the YSO group is entries), with the new restrain, this limit was set to difficult to say since the YSO output class had tens 500 pc with 765266 entries. of thousands of entries and therefore, not reasonable It is hard to evaluate the performance of this to manually cross-check and the source IDs are not implementation since we have not studied all stars directly connected to nebulae. Nevertheless, dust- in the output in detail. We expect that stars at enshrouded stars were indeed encountered within the large distances have dominating emission in the IR. YSO output group when random checks were done, This could lead to more entries of the sample of DS and much fewer such objects within the DS candidate candidates originate from larger distances since their IR output automatically is greater than their optical.

21 Figure 12: The colour-magnitude diagrams shows entries classified as DS candidates (red) by the linear and quadratic SVM model. Each diagram also includes the training sets of common stars (black) and YSOs (yellow) for reference. The leftmost diagram represents the results of a training that included all available features. The middle diagram shows the results of a training without G and Gerr. The rightmost graph corresponds to the results achieved from a training that excluded the errors of G and the colours.

We hope that this implementation minimises the still showed signs of contamination, both through number of false candidates and a less biased output. visual inspections and by reading off of their cc-flag. However, it does indeed reduce the number of otherwise 6.5 cc flag contaminated sources such as diffraction spikes or ghosts. Thus, the implementation helps to single out A further attempt was made to distinguish Dyson valid candidates with unusual IR properties that can sphere candidates and stars obscured by dust, like not be explained by other natural phenomena. nebulous objects or targets with the aforementioned detected noise in the WISE filters W3 & W4. This 6.6 Feature selection was done by introducing a second parameter that the algorithm took into consideration during training. As noted earlier, the classification learner app in The AllWISE catalogue provides a parameter called MATLAB enables feature selection when loading the the contamination and confusion flag (cc-flag) which training data. For the majority of the runs, and what indicates if the source might be contaminated or is favoured, we adopt all features that are included in affected due too the proximity to an image artifact the training sets. These are the 10 different magnitudes (Supplement 10 May 2018). The cc-flag (ccf) has or colours with corresponding errors, coordinates, cc- one character per band. The various characters are, flag and object type. In the following section, we will diffraction spike (D,d), persistence (P,p), halo (H,h) review two feature selections and their implication on optical ghost (O,o) or non (0) where lowercase letters the algorithms. are instances in which the source is believed to be real The first feature selection is to include or exclude but might be contaminated, whereas uppercase letters the G magnitude. Every time the training sets are signify likely contamination of the source. It was noted loaded into the CL app we can choose to import that many targets that fall in the category of the dust- magnitudes or colours. Most of the time we use obscured stars (both clear nebulous stars and ”dust- colours since the algorithms seem to profit from that noise”) indicated contamination, thus uppercase letters configuration by being more effective and accurate in in the cc-flag. Therefore, we explored the utility of the their predictions. However, even when the colours cc-flag in the separation of the ”YSO” and DS class. are imported, we also include the G magnitude and Since the false DS candidates showed, surprisingly error (Gerr) on its own. This implies that the models no preferred cc-flags but rather all various characters, are taking into account both the colour of the object we introduced the cc-flag in the data as a binary but also its brightness in the G band. In theory, index. For each filter we assigned ”0” for lowercase- the DS candidates that are found by the algorithms and no character, and ”1” for the uppercase character. should therefore be constrained in magnitude as well as This was introduced in all datasets, such as; the large colour. In practice, this is not always the case. Some searching set (objects up to 500 pc), the training set algorithms seem to classify entries as DS even when of stars and YSO which all had their default input they are located at much higher brightness than the transformed into the binary system. Finally, the DS DS models suggest. Therefore, we examine the effects models in the training sets were, trivially, set to ”0” of including or excluding the G magnitudes and all for all filters. magnitude error, in the training process. In figure 12 The enforcement of the cc-flag gave minimal- to we see three examples of DS candidates (red) identified no difference for the outcome of the DS candidates. by an algorithm, plotted against the training set of The targets selected by the algorithm as DS candidates stars (black) and the YSO training set (yellow). The

22 leftmost graph shows an example where all features were included in the training of the algorithm and the middle one represents an algorithm trained without G and Gerr. We see that in the second case, fewer entries have been classified as DS candidates. It appears that the algorithm is more selective, but the spread along the y-axis, i.e. in G magnitude is the same for both. After further study of the entries in the second case, which were categorised as DS candidates and not identified in the first case with G, it seems that these targets are exclusively dust-enshrouded stars. This was true for several choices of algorithms. The entries that were identified when taking G into account, and not found when excluding it, were not exclusively dust- enshrouded stars. In some of those cases, targets that were classified as DS were non-nebulous objects as well. Therefore, we judge that it is beneficial to include G Figure 13: The colour-magnitude diagram shows entries classified as DS candidates (red) by the linear SVM model that in the feature selection when training the models. was trained on equally sized training sets, each with 100 entries. Secondly, we explored the effects of including the For reference the figure also includes larger training sets of errors in the training. In contrast to how excluding stars (black) and YSOs (yellow) than the once used within the the G magnitude affected the results, excluding the training. errors leads to more entries classified as DS candidates. The latter results are seen in the rightmost graph of the ratio of entries in the training sets should not be figure 12. The DS candidates seem to be clustered at equal, but rather what is reflected by the expected and ”red” colours (G ´ W 3 ą 0) and low G magnitudes performed observations. as expected of DS. However, when further looking into the additional entries that are not seen in the leftmost graph of figure 12, they are once again, almost 7 Result exclusively objects within prominent nebulae. Thus we consider the errors as important features to include in In most runs of the previously described programs with the training, to reduce the number of dust-enshrouded the aforementioned adapted implementations, some stars. new intriguing targets are often identified. Several DS candidates are recurrently recognised by the 6.7 Proportions of the training sets algorithms and dust-enshrouded stars are infrequently classified as DSs. However, dust-enshrouded stars Another aspect in the training of the algorithms that are less frequently identified as DS candidates after we have much control over, is the ratio or proportions of the improvements of the programs mentioned in the the training sets. This means that for each constructed previous section6. The criterion for a suggested training set, we specify the number of entries in DS candidate to be included in table3 is that no each set. For most runs, we adopted a ratio that clear nebulous structure is evident when viewing the likely mirrors the true ratio of objects, or at least target in Aladin Lite nor a ”noisy” background of in observations, namely that the training set of stars each individual AllWISE filters when viewed in IRAS. was greater by „1 order of magnitude compared to The ”noisy” background also signifies the presence the other two. Typically, the training sets contained of dust emission but in more detail. If the entry 1000-2000 ”normal” stars, 50-100 YSOs and 50-400 has a non-negative index on a nebula in table3, DSs, and the finally adapted ones that are used in the object might lie close to a nebula or show a the algorithms found in the appendix A.4 have the weak signal of background noise in any of AllWISE ratio according to table1. We anticipated that such filters. Further properties that may propose a natural a distribution would help the algorithm make more physical explanation to the appearance of the object, accurate classifications. This seems to be the general are variability, Hα emission and maser radiation. All case. The colour-magnitude diagram of DS candidates of which are listed in table3 for each entry. If the cell (red) plotted with the training set of stars (black) and is left blank, no information of that property is given YSOs (yellow) seen in figure 13, shows an example of by the published catalogues and references. a linear SVM model which is trained on training sets with equally many entries. In other words, the training In the following section, some of the most sets of common stars, YSOs and DSs do all consist prosperous targets from table3 are reviewed. These of 100 entries. Clearly the algorithm fails to single entries are DS candidates found via the ML process out targets with unusual properties by also classifying which show no indication, or lacking published data, entries with high brightness and common colours as DS of any of the four properties, presence of nebula, candidates, see figure 13. Therefore, we conclude that variability, Hα emission or maser detection. Their

23 2MASS designation as well as the entry number in group of DS, their assessment of the class YSO to the table3 are given in the title and their Gaia ID target, does not conflict with our classification of the can be found in the table3 in appendix A.1. It is target as a DS. The additional Planck data, which gives worth stressing that targets recognised as YSOs by the dust opacity (τ) along the line of sight, supposedly different catalogues outside Gaia+2MASS+WISE are helps restrain the areas where YSOs are likely located not always confirmed. Most of these stars are referred by incorporating the interstellar reddening. Marton to as YSOs in regards to their SED or colour, and as et al.(2016) found that 99% of the known YSOs are stated several times, DS are believed to share these located in regions with dust opacity values greater than properties. Such studies which classify targets as e.g. 1.3ˆ10´5. In this work, we are attempting to avoid YSOs based on colour are thus no different from our these regions to minimise the number of misclassified study and does not contribute to resolving the nature true YSOs and dust-enshrouded stars. Unless τ is of the target. Consequently, an object labelled YSO exceedingly high, a high classification probability of by an external source, may in fact fit the DS candidate the target as a YSO does, therefore, not invalidate our profile. classification. For each examined target in the following section, we show their respective SED and the best fitted DS model that was generated via the best fitted model 7.1.2 Marton et al.(2019) program described in section 5.4. In each figure, we have included the SED of the target in black, each A more recent and related study to the above applied DS model in light blue, the best fitted and mentioned, is the Identification of Young Stellar Object scaled model in dark red and the SED of the star on candidates in the Gaia DR2 x AllWISE catalogue with which the models were based on. The various models machine learning methods by Marton et al.(2019). are based on the effective temperature and covering Analogously to the former work (Marton et al. 2016) are YSO candidates identified using ML on Gaia fraction which can take the values: 200 ă T ă 600 K eff and AllWISE data. The difference is that in the and 0.1 ă fcov ă 0.95. The uncertainties of each magnitude measurement are approximately the size former work, the training samples were based on the of the data points (circles) but not included in the SIMBAD database, in the more recent study they also figures. Once again we stress that the best fitted included „ 80 catalogues from literature to identify model can vary between runs since the models are various objects for the training. The entries were based on a randomly selected star (see section 5.1.2). sorted into four classes: Main sequence stars (MS), Furthermore, a target with the best fitted covering extragalactic objects (EG), evolved stars (E) and fraction near 0.1 is not a credible DS candidate, but YSOs (Y). Two sets of probabilities for each of the we include a greater range of these models than what four classes are given for all entries. These sets of the training set is based on for the purpose of a greater probabilities are derived using different data. In the analysis. first set, the probabilities are calculated using all WISE bands, these are labelled with the prefix ”L”, e.g. the probability of the target being a YSO is denoted LY. 7.1 Frequent sources of reference The other set of probabilities are using only the shorter wavelength WISE bands, and these are denoted with Before going into detail about each promising target the prefix ”S”. Likewise, as for the earlier study, the we want to mention some reappearing sources of categorisation or probabilities, in this case, are found information that will be mentioned for most of the in a comparable way to our study. Therefore, a high entries discussed below. YSO probability does not necessarily conflict with our proposed DS candidate. 7.1.1 Marton et al.(2016) Firstly, we have the work by Marton et al.(2016)– An all-sky support vector machine selection of WISE YSO 7.1.3 Stassun et al.(2018) candidates. In this work, they use a machine learning Another reoccurring source is The TESS Input technique on the flux measurements of WISE, 2MASS catalogue (TIC) and Candidate target list (Stassun and Gaia together with extinction map from Planck et al. 2018). A parameter in this catalogue, that (Planck Collaboration 2014). This means that their consistently showed the same for all the DS candidates study has a very similar approach to categorisation of investigated, was the luminosity class. Each of the DS targets as this work. The main difference is, trivially candidates discussed below was assigned the luminosity that our study focuses on a third group, the DS class dwarf. The classification is based on a relation candidates, and not on the YSOs. DS models have very referred to as the dwarf relation in Stassun et al.(2018) similar characteristics as YSOs, as stated many times and describes the relation between the colours: V ´ K before. Both objects show a dimming in the optical and and J ´ H, derived from Gaia and 2MASS. The only an excess in the IR. Since Marton et al.(2016) is using two available flags for that parameter is DWARF or the same sources of brightness in various wavelength GIANT. If the flag is indicating DWARF, it could also ranges as in this work, and does not consider a third mean that the star is a . Since the colours,

24 Figure 14: The colour-magnitude diagrams show the top 8 candidates (red) discussed in section 7.2 with their respective number in table3. The black dots represent common stars from a sample of objects within a distance of 100pc. Each figure shows the Gaia G magnitude on the vertical axis and the colours BP ´ RP , G ´ J and G ´ W 3 from left to right.

V ´ K and J ´ H, are analogous to the effective 5.15pGBP ´ GRPq ` 4.12) separates the two. The radii temperature, the luminosity class can thereafter be are commonly derived using the Stefan-Boltzmann determined from its position in the HR diagram. relation: According to Stassun et al.(2018), 4% of the dwarfs were misidentified when the classification technique was crosschecked. Note that the object is assigned R 1 a class solely based on its colour, and as previously “ p4.74 ´ 5 ` 5log10pDq ´ G ´ 10log10pTeff {5772q ´ BCGq R@ 5 stated, DS candidates appear similar in colour and (28) magnitude as giant stars. The luminosity class does, therefore, not rule out the possibility of the object being a DS candidate. Furthermore, the The distance D in eq. 28 is the inverse Gaia and radius are also given in the catalogue. These parallax D “ 1{p, G “ Gobs ´ AG is the Gaia parameters were derived using numerous techniques. magnitude corrected for extinction and BCG is the The radii were estimated in the following order of bolometric correction in the Gaia passband as a preference: (1) radii given by the specially curated Cool function of Teff . For further details on the adapted BCG Dwarf- or Hot Subdwarf list, (2) with Gaia parallax values, see section 2.3.5 in Stassun et al.(2019). The and bolometric corrections, (3) from spectroscopic effective temperatures used in eq. 28 are derived from relations from Torres et al.(2010), and finally (4) by a spectroscopy if the error is ă 300 K, otherwise from unified relation based on measured radii for eclipsing dereddened colours (relation seen in figure 8 of Stassun binaries and simulations using Galactic structure et al.(2019)). models (Stassun et al. 2018). The masses are derived via similar ways as step (1) and (4) as for the radii, namely from the specially curated cool dwarf- or hot 7.2 A selection of intriguing targets dwarf list or based on measured masses of eclipsing The top 8 targets which show no evident indication of binaries and simulations of Galactic structure models dust in the vicinity, variability, Hα or maser emission (Stassun et al. 2018). are further investigated in the sections below. The 8 targets are plotted in each colour-magnitude diagrams in figure 14 as red dots with their corresponding 7.1.4 Stassun et al.(2019) number in table3. A sample of ”normal” stars are also plotted for reference (black dots) in each figure. The The revised TESS input catalogue and target list various figures represent different colours (BP ´ RP , (Stassun et al. 2019) came out a after the formerly G ´ J and G ´ W 3) to highlight the properties of mentioned Stassun et al.(2018) and included data each target. The DS models we have used in the from the second Gaia release (DR2). Equivalently to training process are based on solar-type magnitudes, the predecessor, Stassun et al.(2019) also includes thus we would expect to see the candidates around a luminosity class which displays either DWARF that range (4 ă G ă 6m) and lower. We also expect or GIANT, where subdwarfs are likewise included the candidates to have a significant IR-excess, i.e. in the former class. The luminosity class in the rightwards of the MS in the third graph of figure 14. revised catalogue is established on a relationship In the optical and NIR colours (left and middle graph based on Gaia data. The colour-magnitude diagram of figure 14) we expect the candidates to lie close or in figure 10 of Stassun et al.(2019) shows the two on the MS. Evidently, the candidates follow the colour distinguished groups, where the green line (MG “ restraints that are anticipated of DSs although the

25 brightness of the majority are higher (i.e. G ă 5) than what the models on which the algorithms proposed.

7.2.1 J18212449-2536350 (Nr.1) One of the first DS candidates identified by the algorithm was the star Nr.1 in table3). Visual inspections in Aladin Lite and IRSA, and the overall view in VizieR, showed no evident signs that are associated with nebulous objects. Those signs being apparent nebulae, Hα emission, maser emission or m variability. This very faint star (mG « 17 ) is located in a stellar rich region on the southern hemisphere approximately 340 pc away from us. The TESS input catalogues (Stassun et al. 2019) has assorted the star into the luminosity class dwarf. As stated in the previous section, the classification is solely based on the colour of the stars and these low luminosity stars share similarities to DS Figure 15: The scaled best fitted DS model (deep red) on the candidates when it comes to colour and magnitude. SED of the DS candidate Nr.1 (black). The beige curve shows This classification does, therefore, not rule out the the initial star on which the DS model was based on and the light possibility of the star being a DS candidate. The blue curves all tried models (non scaled) with varying fcov and Teff . The best fitted model is with fcov “ 0.95 and Teff =600K stellar radius and mass have not been derived by and RMSD=1.3553 Stassun et al.(2019) for the target, thus we can not establish if the object is a dwarf or subgiant according to the catalogue. Unfortunately, no source provides Moreover, related articles found via SIMBAD, suggest information on the mass and radius for this entry. that the target is a post-Asymptotic Giant Branch This target is also found in the list of SVM selected (post-AGB) star. Some of the sources are reviewing YSO candidates by Marton et al.(2016). Planck’s dust maser emission from post-AGB stars including this opacity value exceeds the limit of the common value object. However, the sources state non-detections found for YSO rich regions, τ=0.00004. A considerable of SiO and H2O emission, which are characteristic dust opacity could be the source of a high IR excess, for masers (G´omez et al. 2015, Yoon et al. 2014). even though none were seen in our visual analysis. Both dust enclosed post-AGB stars and DS candidates Recall that likewise to this work, Marton et al.(2016) have double peaks in their SEDs, where post-AGB are using machine learning techniques based on colours stars have one in the optical-to NIR which originates to classify entries with unusual properties. Therefore, from the star and a second far-IR (FIR) peak due to we stress again that the conclusion of Marton et al. the emission from the cold dust shell. The star is (2016) does not conflict with our results. The follow- categorised as an obscured post-AGB in a two-part up work by Marton et al.(2019) presented low work by Ramos-Larios et al.(2009)& Ramos-Larios probabilities for the target being an evolved or main et al.(2012). Ramos-Larios et al.(2009) used IRAS sequence star (LE=0.17160, LMS=0.00120). Higher and 2MASS measurements to determine the stars true probabilities for an extragalactic source and YSO nature. The SED of this star, among other suggested were found (LEG=0.39040, LY=0.43680). However, post-AGB stars in Ramos-Larios et al.(2009), shows these probabilities are based on WISE and Gaia an increment in the near to mid-IR. The classification data (magnitudes) and clearly, the entry is not of this star, as a post-AGB star, is based on lower an extragalactic source according to the measured emission in the optical and more emission near the distance by Gaia as previously stated. Note IR. Therefore, the analysis in Ramos-Larios et al. also that the YSO probability is rather low, ă (2009), is again something that does not contradict the 50%. The probabilities using the shorter wavelength suggestion of this target being a DS candidate, since WISE bands, instead shows a higher probability the categorisation of the target is done on the same of the star being an evolved star (SE=0.5098) and premise as our study. smaller probabilities for the three remaining classes The SED of this DS candidate diverges from the (SY=0.29680, SEG=0.19220 and SMS=0.00120). conventional view of a Dyson sphere since it does not Again the leading probability is not convincingly high have two distinguished peaks, one near optical and a and no other sources can confirm this statement due second in IR. The black curve in figure 15 shows the to the missing information on the stellar radius, mass SED of this DS candidate which appears to have a and luminosity. continuous growth from optical to IR in a common The target is also identified within the Database peak. The shape diverges slightly in the optical from of circumstellar OH masers, however, the flag for various DS models (light blue curves) and the best detected maser emission indicates non-detections. fitted model (dark red) also seen in figure 15. The best

26 fitted DS model has a RMSD=1.3553, a temperature of 600 K and a high covering fraction of 0.95. The somewhat higher RMSD is mostly due to the shorter wavelength region, where the target’s SED reaches much lower magnitudes than what is predicted for a DS. The model also deviates at longer wavelengths where the target show a greater excess. Following the above discussion, we view this star as an intriguing target that fulfils many requirements of a DS candidate. The target has a rather low apparent G magnitude (G=17.692m), which could imply difficulties in further observations. However, observations are plausible with the right telescope. We note that the SED does deviate slightly from a classical DS model due to the low optical radiation. Properties such as mass and radius is not provided by any published source, therefore makes it more difficult to establish the true nature of the object. Even though the DS model fit is unsatisfactory, we view this entry as one of Figure 16: The scaled best fitted DS model (deep red) on the the most intriguing results due to the little information SED of the DS candidate Nr.8 (black). The beige curve shows and exceedingly high IR-excess. the initial star on which the DS model was based on and the light blue curves all tried models (non scaled) with varying fcov and Teff . The best fitted model is with fcov “ 0.20 and Teff =400K 7.2.2 J04243606+1310150 (Nr.8) and RMSD=0.30702 The DS candidate Nr.8 in table3 is a brighter star (G=11.180m)(Gaia Collaboration 2016, 2018) definition. However, even if this target indicates a located on the northern hemisphere (ra=66.150˝ and slightly higher mass than the Sun and a lower Teff , dec=13.171˝). The target has few nearby neighbours we will not exclude the possibility that the star is a on the celestial sphere and is one of the brightest in the MS star. That is because the masses are derived from region, both in optical and mid-IR. Furthermore, the the Gaia temperatures which are known to have rather star appears red in the AllWISE filter in Aladin Lite, big uncertainties „ 300 K from Gaia Collaboration thus suggesting a significant IR excess which is indeed (2018) and 122 K from Stassun et al.(2019). Moreover, evident in figure 16 (black curve). masses of are implicitly given in Stassun et al. The target is also included in the work of Marton (2019) and should be regarded with cautionaccording et al.(2019), where the ML algorithms have categorised to the authors. the target as a YSO with probability LY=0.68920 and A further suggested source in VizieR is the SY=0.65720. The probabilities are somewhat higher catalogue of proper motions of open clusters from than for the former target. The dust opacity from UCAC4 (Dias et al. 2014). The target is considered the earlier work by Marton et al.(2016) is higher a member of the Melotte 25 open cluster, also known than the limiting value found for typical YSO regions as Hyades . The membership probability is τ “ 0.0000293. Trivially the target is included in the based on the stellar and estimated to YSO candidate list by Marton et al.(2016), where it 97%. Although, this cluster resides at „ 50 pc, which also states that the SIMBAD type is a T-Tauri star. is considerably closer to us than what other catalogues The object type in SIMBAD originate from literature suggest that the star does. Most catalogues presents statements and can show multiple indices. a distance of « 300 pc. Since the majority of the The TESS Input Catalogue indicates that the estimated distances concede in much higher values than luminosity class of the target is a DWARF (Stassun 50 pc, we regard this source as less trustworthy. et al. 2019). The radius is estimated to R=1.706 R@ The target is also mentioned in previous work by via eq. 28, thus suggesting that the target is identified Gregorio-Hetem et al.(1992), which searched for T- as a subgiant in Stassun et al.(2019). The stellar mass Tauri stars based on the IRAS point source catalogue. is estimated to a near solar value M “ 1.066 M@. This is likely the reference to SIMBADs object type Furthermore, the temperature is estimated to Teff “ classification. They proposed that the target is a shell 5452.67 K by Gaia ( Prime) and the radius to star, but also mentioned it as a possible pre-main- 1.80 R@. The stellar luminosity is estimated to L “ sequence star. A shell star has extremely broad and 2.59 L@ by Gaia. The near solar temperature and narrow absorption lines in its spectrum, and typically luminosity more than twice that of the Sun suggests emission lines from e.g. the Balmer series. They are that the star is located somewhere between the MS common variables, where the irregular variability is and the subgiant branch (see figure4 for reference). A caused by changes of the ”shell”. The object is not main sequence star with a higher stellar mass than the noted as a variable by any source, nor is any significant Sun must have a higher temperature than the Sun by absorption line recorded. However, a part of the stars’

27 spectrum is published within the work by Gregorio- Hetem et al.(1992) and further indicates the presence of the Hα absorption line (see fig. 9 in Gregorio-Hetem et al.(1992)). Although, the star is listed as one of the few stars from the sample that showed no Hα emission. In figure 16 we see that the SED of the target (black) show the characteristics of a DS with both an optical peak and IR excess. The best fitted model (dark red) has RMSD=0.30702 and the properties, fcov “ 0.20 and Teff “ 400 K. The IR excess of the DS candidate is not excessive, thus the low covering fraction of the best fitted model. However, the overall fit between the SEDs of the DS model and the target is rather successful. Even if multiple sources indicate that the entry is a young star, we know that the colour properties of such objects are similar to DS candidates. Moreover, the deviating distance estimations and the conflicting results of Dias et al.(2014) are interesting. Therefore, with a rather promising SED, with good Figure 17: The scaled best fitted DS model (deep red) on the shape but rather low fcov, we encouraged further SED of the DS candidate Nr.10 (black). The beige curve shows observations and analysis of this target. the initial star on which the DS model was based on and the light blue curves all tried models (non scaled) with varying fcov and Teff . The best fitted model is with fcov “ 0.25 and Teff =300K 7.2.3 J18242978-2946492 (Nr.10) and RMSD=0.48136 Yet another star that shows high brightness in all three spectral regions considered in this work (DSS2, have proposed a (proto)planetary nebula (Pottasch 2MASS and AllWISE) is target Nr.10. The star has a et al. 1988). According to Fujii et al.(2002), more Gaia G magnitude of G “ 10.743m and appears bright recent data and analysis are more reliable, thus arguing red in AllWISE when vied through Aladin Lite, again that the target is a galaxy with redshift 33678 kms´1 implying a high IR-excess. The star lies in the northern (Nakanishi et al. 1997). A more recent publication hemisphere (ra=44.047˝ & dec=44.048˝) and has two (Vickers et al. 2015) lists the object as a likely post- nearby stars within 1 arcmin field of view. The closest AGB with dust temperature TD “ 116 ´ 470 K of those stars lies on the upper right side of the targets which was established through multiple black body field of view and opposite, on the upper left lies a fits. Vickers et al.(2015) does not explicitly establish faint galaxy. Both of these objects, star and galaxy, the true nature of the source, but commence a lies within 20 arcsec from the target of interest. The detailed analysis of a known sample of post-AGBs from galaxy is faint in the optical and NIR and hidden by the Toru´ncatalogue of Szczerba et al.(2007). In the target’s red rim in mid-IR. Likewise for the closest particular, Vickers et al.(2015) estimates the distance star which is dimmed and obscured by the red rim of to post-AGB candidates by the fitting of black bodies the target when vied in IR, but is bright in optical and and SED modellings. The estimated distance to this NIR. likely post-AGB is 4.34 ˘0.98 kpc. A distance which The AllWISE catalogue does not indicate any signs deviates significantly from Gaias estimate, p “ 2.540 of variability (var “ 1111), nor is it viewed as variable mas „ 394 pc and from the TESS Input Catalogue in Fujii et al.(2002) and Garcia-Lario et al.(1997). (Stassun et al. 2018), D « 383 pc. From the same Both aforementioned references are based on NIR catalogue (Stassun et al. 2018), the radius of the photometry from IRAS. This target, with IRAS name star is estimated to R “ 2.316 R@ and mass to IRAS02528 ` 4350, was first noted as a post-AGB due M “ 1.141 M@. However, the latter TESS Input to its IRAS colours in the earlier work of Fujii et al. catalogue (Stassun et al. 2019) shows a somewhat (2002). Within their work (Fujii et al. 2002) post- larger distance D “ 403.272 pc, R “ 1.561 R@ and AGB candidates were selected based on their IRAS M “ 1.850 M@. In the Fundamental parameters and colours and further on designated the star as a post- infrared excesses of Tycho–Gaia stars by McDonald AGB, based on further photometric observations in et al.(2017), they deduced a similar distance as Gaia various filters (BVRIJHK). Thus, the classification and eminently as Stassun et al.(2018), namely D “ of the star is based on colours and fluxes in numerous 382.500 pc. They also estimated the temperature to bands, like our work. Although, it is stated in Fujii Teff “ 3878 K, radius R “ 8.615 R@ and luminosity et al.(2002) that the object was excluded from further to L “ 15.082 L@. The effective temperature and analysis since the evolutionary status was not clear. luminosity estimates derived by McDonald et al.(2017) The article also mentions previous analyses that have are based on distances from Gaia DR1. Moreover, the considered the star as an ultraluminous infrared galaxy stellar class of the target is predicted by firstly cross- (Gawiser & Smoot 1997, Nakanishi et al. 1997). Others reference catalogues of multiwavelength photometry, to

28 form SEDs. The SEDs are thereafter compared against stellar atmosphere models to derive the effective temperatures. Together with astrometric solutions from Gaia DR1, the luminosity is acquired for each star. Based on these two parameters, the stars can thereafter be placed on the HR diagram, hence categorised. The temperature is much lower and luminosity much higher than what Gaia have predicted (Teff =7585 K and L “ 5.75 L@). The paper by McDonald et al.(2017) does not establish a stellar type of star but the great luminosity and low effective temperature suggest that the star lies in the right part of the giants’ group in the HR diagram. It is unclear why the stellar parameters vary to such a great extent between Gaia and McDonald et al.(2017). Finally, SIMBADs spectral classification declare that the star is of a spectral type A0e, i.e. a rather blue star (see figure4) with emission lines. Firstly, no source catalogues with temperature information acknowledge Figure 18: The scaled best fitted DS model (deep red) on the this claim, since the temperatures are all far too SED of the DS candidate Nr.22 (black). The beige curve shows low. Secondly, SIMBADs classification originates from the initial star on which the DS model was based on and the light Garcia-Lario et al.(1997) but the reference does not blue curves all tried models (non scaled) with varying fcov and Teff . The best fitted model is with fcov “ 0.10 and Teff =600K provide further information than the spectral type. and RMSD=0.22862 One must assume that their spectral analysis does indicate emission of some sort, but the explicit emission line(s) is not stated. It is therefore hard to arrive at a (Neugebauer et al. 1984). A galaxy resides within conclusion of this star. ă1 arcmin from the target, however, no interference, as The best fitted DS model adapts the temperature the confusion flag from AllWISE ccf=0000, suggests that. The target appears green when viewed in the Teff “ 300 K and a rather low covering fraction AllWISE filters in Aladin Lite, this suggests that the fcov “ 0.25. The low fcov is the result of a small flux difference between the optical peak and mid-IR star does not show a high level of IR excess. This is peak (AllWISE W 3, second-most right data point of indeed the case as seen in figure 18, where the black the black curve in figure 17. The best fitted DS model curve represents the DS candidate, which has a rather (red curve in figure 17) with RMSD=0.48136 seems to low IR peak when compared to the optical one. The fit the optical part well. The IR excess is somewhat anticipated SED for a DS has a greater output in lower than what the model suggests and similar to IR as seen by the dark red curve in figure 18. This entry Nr.1, a lower temperature model (e.g. 100 K) curve represents the best fitted model with the smallest would likely fit the targets SED better since the IR deviation (RMSD=0.22862) and has Teff “ 600 K and peak does not decrease towards longer wavelengths. It fcov “ 0.10. These parameter values represent high is worth noting that the star is rather bright in the respective low values within the range of the models. A optical as well and does not fall within the region of low covering fraction indicates a higher optical output, 4 ă G ă 6m of which we formed our DS models hence a lower IR emission and a high temperature upon. Based on the luminosity and temperature values suggest a sharper optical peak and a smaller difference from Gaia, the star resides on the MS. If instead, one between the energy output in the optical and IR. follows the values derived by McDonald et al.(2017), The variability flag in AllWISE does not suggest the star belongs to the giant-group. The somewhat strong variability of the source. The var “ 2650, higher G magnitude makes this target a less probable which states that the star is most likely not variable DS candidate. Nevertheless, the high brightness of in AllWISE filters W1, W3 & W4 and likely variable the star makes follow-up observations less complicated yet susceptible for false positives in W2 (Wright et al. and could thus be useful in order to establish its real 2010). Any values below 5 are labeled most likely character. not variables, 6-7 are likely variables, but susceptible of false-positive variability and ą 7 have the highest 7.2.4 J18170389+6433549 (Nr.22) probability of being true variables. Furthermore, the TASS Mark IV patches photometric catalogue Target Nr.22 is another bright star (G=11.453m) on the (Welch & Stetson 1993) displays the Welch-Stetson northern hemisphere (ra=274.267˝ & dec=64.565˝), variability index–a normalized measure of the degree located «272 pc away from us (Gaia Collaboration of correlation of V-band and I-band variations–WS=- 2016, 2018). The object does not sit in a nebulous 0.07, thus suggesting low and random variations region and has limited to no detected noise in the or variations smaller than the uncertainty of the individual AllWISE bands when viewed in IRSA measurement.

29 The star has the luminosity class DWARF in the Tess Input Catalog and the radius and mass are near solar-values; R “ 1.061 R@ & M “ 1.168 M@. The effective temperature is Teff “ 6055.0 K and the stellar luminosity is L “1.43 L@ according to Gaia DR2 (Gaia Collaboration 2016, 2018). With the Gaia and TESS stellar properties, the star is found on the MS and is not an evolved star. The target is also mentioned by (Marton et al. 2016) as a YSO candidate from their SVM selection. The paper does not provide further information on the specific target, however, as previously discussed, this classification does not violate our classification as a DS candidate.

We encountered one paper that analysed the nature of this target. Mo´oret al.(2021) reported a new sample of warm extreme debris disks (EDDs) from the AllWISE catalogue. EDDs are rare systems with Figure 19: The scaled best fitted DS model (deep red) on the an unusually large amount of warm dust that are SED of the DS candidate Nr.26 (black). The beige curve shows easily detected in IR due to their thermal emission the initial star on which the DS model was based on and the light by the dust. Most EDDs show strong solid-state blue curves all tried models (non scaled) with varying fcov and emission features and variability on monthly or yearly Teff . The best fitted model is with fcov “ 0.80 and Teff =400K and RMSD=0.72405 timescales at 3-5 µm (Su et al. 2019, Meng et al. 2014, 2015). Our target (TYC 4209 in Mo´oret al.(2021)) is suggested to have an EDD with disk temperature 7.2.5 J14492607-6515421 (Nr.26) Tdisk “ 530 K (from disk modelling) and mid-IR variations. Furthermore, the TESS satellite identified This star resides in the southern hemisphere at a periodic variations that are likely due to rotational distance of « 2.28 mas („439 pc) and with an apparent modulations by star-spots (Mo´oret al. 2021). The G magnitude of « 16.03m. As shown in figure 19, the IR variations are following the aforementioned var-flag shape of the DS candidate (black) matches the best from AllWISE which indicated a likelihood of variation fitted DS model (dark red, with fcov=0.65) rather well in the second passband W2. The flux density in the in the WISE filters W 2 and W 3. The candidates seems W1 band is 5.5σ higher than the model fit, which likely to have a generally lower brightness except in the NIR, indicates the presence of an additional dust component where the target has a higher emission than what the in the disk with higher temperature. Furthermore, the model predicts are common for DS candidates. We disk flux density appears to have significant variations can not argue that this target has a common DS shape on long term scales. By the year 2014, the flux of it’s SED, even though it proposes a high covering level dropped considerably, followed by a dramatic fraction fcov “ 0.8 brightening on a timescale of one year. The brightening No apparent nebula is seen while viewing the star resulted in a 56% and 64% increase at 3.4 µ and 4.6 µ in Aladin Lite nor when studied in the individual respectively. Thereafter, the disk flux level remained AllWISE filters via IRSA. Furthermore, non of the nearly constant for one year and increased to just above sources found in the VizieR search showed any the former peak in 2017. The corresponding light indication of Hα emission line detection nor maser curve is presented in figure 7 in Mo´oret al.(2021) and radiation. This suggests that the star is not embedded shows symmetry between the aforementioned period. in any dust disk nor is associated with YSOs. AllWISE It is stated in the paper that this object has the most is the only source that indicates the variability of complex case of variability and that it is inconsistent the target in the first two IR-filters W1 and W2 with any variable dust temperature models. Their (var “ 9911, one character per band, where an index favoured proposition of the behaviour of the source higher than 7 denotes highest probability of being a true is the formation of new dust via collisions. The disk variable ). properties were estimated via fitting of a simple black It is worth noting that a second brighter red star body model to the observed IR excess, comparable lies close to this DS candidate. This target (Gaia to parts of our work. The analysis and modelling in source id:5849041636617856512) was not identified Mo´oret al.(2021) is not fully inconsistent with our by the algorithm but shows similar colours in the work and proposed solution. However, according to WISE filters, and lies close on the sky to our DS our analysis the covering fraction of this target is very candidate. The coordinates and distance for our ˝ ˝ low, fcov “ 0.1. Such a low covering fraction suggests candidate are: ra=222.36˘0.03 , dec=-65.26˘0.04 & that the target is not a DS and the IR radiation is more p “ 2.28 ˘ 0.05 mas and for the neighbouring star likely caused by a debris disk or protoplanetary disk. ra=222.32˘0.02˝, dec=-65.26˘0.02˝ & p “ 2.30 ˘

30 0.04 mas. Thus, their differences and uncertainties are of the same order of magnitude and their values overlap which clearly suggests that the stars are adjacent. Furthermore, our candidate is not believed to be affected by the nearby star, having the ccf=00dh, but by the definition of the cc-flag, the possibility is not ruled out. The second star (not found by the algorithm) shows no sign of Hα emission nor maser radiation. The only source of variability information is again from WISE where WISE All-Sky Data Release gives var=3119 and AllWISE Data Release var=2213. The AllWISE catalogue supposedly supersedes the WISE All-Sky Release catalogue for most uses, thus the only strong indication of variability from the W4 filter in WISE, is unlikely since AllWISE does not point towards any variability. Finally, the two stars are both mentioned in a study of the nearby isolated dark globule DC 314.8-5.1, our DS candidate is pointed out as a YSO, Figure 20: The scaled best fitted DS model (deep red) on the however, it is not further motivated. The neighbouring SED of the DS candidate Nr.30 (black). The beige curve shows star (HD 130079) is recorded as the illumination source the initial star on which the DS model was based on and the light blue curves all tried models (non scaled) with varying f and of a small reflection nebula nearby (Whittet 2007). It is cov Teff . The best fitted model is with fcov “ 0.10 and Teff =400K discussed in that paper that the star has no reported and RMSD=0.3949 emission lines which suggest that it is not a pre-MS star and that the IR excess should be related to locally heated dust. flag which indicates that the source is believed to be The TESS catalogue (Stassun et al. 2018) real, but might be contaminated by the spike in W1 photometrically estimates the stellar radius and mass and W2 (ccf=dd00). No source claims evidence of variability. The to near solar values, R “ 1.15188 R@ and M “ variability flag of AllWISE is var “ 0011, i.e. 1.085 M@. The effective temperature is Teff =4855.0 K according to Gaia. Unfortunately, no luminosity most likely not variable in any band. Secondly measurements or estimates are available, however, a the Variability properties of TIC sources with KELT star with a stellar mass greater than the Sun must (Oelkers et al. 2018) show V ar “ 0 based on have a higher stellar temperature to lie on the MS. light curves from KELT (Kilodegree Extremely Little Therefore, the star is likely not a MS star, but plausibly Telescope). The TESS Input Catalogue (Stassun et al. a subgiant. However, the stellar mass is close to the 2019) assigned the luminosity class DWARF, to the Sun and since the specific method of radius and mass target and estimated it’s radius to R “ 0.895 R@ estimation and errors are not given by Stassun et al. and mass M “ 0.960 M@. The appointed class (2018), it is hard to draw a conclusion regarding this and parameters were determined in likeness to former target. analysed targets, whereas the older catalogue (Stassun et al. 2018) listed R “ 0.935 R@ and M “ 0.947 M@ based on photometry. The effective temperature is 7.2.6 J06110354-4711294 (Nr.30) Teff “ 5443 K and L=0.63 L@, according to Gaia (Gaia The DS candidate Nr.30 is a star on the southern Collaboration 2018) and the spectroscopically derived hemisphere (ra=92.765˝ & dec=-47.191˝), „186 pc Teff “ 5372 K is reported by Fundamental parameters away from us. The target is moderately bright in of Tycho-2 & TGAS stars (Stevens et al. 2017), the optical compared to stars in 10 arcmin field of available from LAMOST, RAVE and/or APOGEE. view (G=11.551m)(Gaia Collaboration 2016, 2018) The most well adapted DS model for target Nr.30 and the third brightest when viewed in IR. From a is a low DS temperature and covering fraction of comprehensive study of the target in the individual Teff “400 K and fcov “ 0.10. The low parameters are AllWISE filters via IRSA, we detect a small level of markedly due to the low IR excess, almost leveled with noise around the target. However, there is a clear the optical peak, and broad first peak that spans from contrast between the star and background. Some stars optical to mid-IR (see figure 20). The fit between the around the target of interest are only evident in the model and target is rather good with RMSD=0.3949. two first AllWISE filters W1 and W2. Furthermore, One single reference (Ruiz-Dern et al. 2018) a diffraction spike, from the brightest star in the regarding this target was found through SIMBAD aforementioned field off view, is visible near the target search. The article considers the empirical photometric in the same bands W1 and W2. Although, the calibration of the Gaia Red Clump. In essence, they diffraction spike does not appear to reach the target. provide photometric calibrations of the Red Clump This observation reflects the remark of AllWISEs cc- (RC) stars in the solar neighbourhood, based on

31 photometry and spectroscopy from multiple source catalogues. The Red Clump stars are low mass core He-burning, stars that are cooler than the instability strip–a narrow vertical region in the HR diagram which contains numerous types of variable stars. All RC stars have about the same absolute luminosity, thus end up in a ”clump” in the HR diagram (see figure5). This is what makes them standard candles, where the apparent brightness is directly related to their distance. Note however, that the luminosity reported by Gaia is rather low L=0.63 L@. The red clump stars have in the order of „100 times that of the Sun. This strongly conflicts with the hypothesis that this star is indeed a red clump star. The red giants are selected using a 3D extinction map and their colours are used in parameter estimations. Similar characteristics can be seen for a DS candidate which mimics the observed properties of this target. Although the IR excess is low Figure 21: The scaled best fitted DS model (deep red) on the for an ideal DS candidate, further investigation would SED of the DS candidate Nr.37 (black). The beige curve shows be invigorating. the initial star on which the DS model was based on and the light blue curves all tried models (non scaled) with varying fcov and Teff . The best fitted model is with fcov “ 0.15 and Teff =200K 7.2.7 J05261975-0623574 (Nr.37) and RMSD=0.20324 Yet another interesting target is the bright star (G “ 10.390m) close to the celestial equator on the southern former candidates have shown, L=7.73 L@, according hemisphere (ra=81.582˝ & dec=-6.399˝) is target Nr.37 to Gaia DR2 (originally derived from photometric in table3. The target appears bright in both optical measurements of the apparent G magnitude). With (DSS2) and mid-IR (AllWISE) and less bright in the higher stellar temperature in combination with NIR (2MASS), when viewed in Aladin Lite. Visual the higher luminosity, the star appears as a subgiant inspection in separate filters for each wavelength range, star on the HR diagram, see figure4. Furthermore, via IRAS, show no signs of background noise in IR, the TESS Input Catalogue (Stassun et al. 2019) shows hence the IR excess is not affected by e.g. nearby dust. luminosity class DWARF and is likely a subgiant based Furthermore, a nearby star („2 arcmin separation) on the radius and mass estimations: R “ 2.235 R@ and does not appear to influence measurements of the star M “ 1.380 M@. in AllWISE nor DSS (optical). However, a slight trace All available catalogues with variability remarks in of a diffraction spice is evident in all three 2MASS VizieR state no variability of the source, e.g. AllWISE filters. The cc-flag of AllWISE indicates non probable var “ 0000. These catalogues are mainly operating in contamination from the nearby star’s halo in the two IR (Wright et al. 2010, Fujii et al. 2002, Oelkers et al. first filters W1 and W2, and non in W3 and W4 2018). Most external sources seem to suggest that the (ccf “ hh00). star is a post-AGB star. Fujii et al.(2002) is one of the The distinct red colour in Aladin-AllWISE papers that lists the target as a post-AGB with spectral indicates a high IR excess. An IR-excess is indeed type F2II, the stellar temperature Tstar “ 7380 K and evident, but not exceedingly high, and represented by the surrounding dust temperature T “ 183 K. Since the black curve in figure 21. Target Nr.37 deviates both post-AGB and DS models have two peaks, one slightly from most DS model in the sense of a low near the optical and one in the min-IR, they are both emission ratio between the mid-IR and optical. As seen alike in their respective SEDs. in figure 21, the best fitted DS model (dark red curve) The best fitted DS model (dark red) in figure 21 with RMSD=0.20324 and properties Teff “ 200 K and shows an optical-to NIR peak and a second mid-IR fcov “ 0.15, fits rather well in both the optical and IR. peak. Intuitively, a DS can mimic the effects of a dust The scaling of the model is rather significant, „ 2m.A shell of a post-AGB and create the same IR excess. large scaling of the model indicates that the candidate We stress again that the classification of the stars in does not follow the ideal DS models that were used Fujii et al.(2002) is based on colours and fluxes in in the training of the algorithm. The large scaling numerous bands, in likeness to our study. However, factor thus makes the entry a less good candidate. the luminosity and temperature estimates from Gaia Gaia measurements suggest an effective temperature contradicts the claim that this target is a post-AGB of 6601 K and spectral observations from LAMOST since the luminosity is far too low. Alternatively, there states Teff “ 6395 K, as well as a stellar radius of were issues with the distance determination. Even 1.97 R@ and mass 1.45 M@ (Sichevskij 2017). The the lower mass-stars that evolve into AGB stars reach stellar luminosity is somewhat higher than what most higher luminosities than 7 times the solar luminosity.

32 Alternatively, the low luminosity is caused by a thicker dust shell that surrounds the star. Although, this means that the object is expected to show a high level of IR-excess, which it does not according to figure 21. The SVM search of YSO candidates (Marton et al. 2016) list this entry as a possible target. Although, the dust opacity τ “ 0.0000078 which is lower than the value found for YSO rich regions. The effective temperature and luminosity suggest that the star does not lie on the MS and the low covering fraction of the best fitted model does not suggest that the target is a strong DS candidate. We would therefore not argue that this is the best entry of our sample, however, as for previous targets, references diverge on the true nature of this object. Thus, further observations are needed to establish it.

7.2.8 J21173917+6855097 (Nr.46)

˝ Figure 22: The scaled best fitted DS model (deep red) on the Entry Nr.46 is a northern star (ra=319.413 & SED of the DS candidate Nr.46 (black). The beige curve shows ˝ dec=68.919 ) that appears bright in all three the initial star on which the DS model was based on and the light m considered wavelength regions (e.g. G=9.878 ). The blue curves all tried models (non scaled) with varying fcov and target glows red in the AllWISE filters, alluding to Teff . The best fitted model is with fcov “ 0.20 and Teff =500K and RMSD=0.50058 a strong IR excess. This is confirmed by its SED represented by the black curve in figure 22. However, the target seems to show a second small bump in near- Yung et al.(2014). They analysed both 2MASS and to mid-IR (at „ 3 µm) which is not common for a AKARI colours as well as OH/H2O maser emission. DS profile. The best fitted model (Teff “ 600 K The target of interest was not observed for OH lines and fcov “ 0.40) is rather unsuccessful, even with but showed no detections of H2O emission. The paper RMSD=0.50058 and little overlap of the targets SED does not provide any further discussion of the target, in all wavelength ranges. Seeing that the SED of the however, the reported properties do not support the target diverges from a classical DS, we assume that the claim of YSO or maser identification. Other sources algorithm identified the target as a DS due to the IR also report the star as a young star based on the shape excess and not on the shape of the curve. Similarly of the SED (Kun et al. 2009). However, as we know, the to the previously discussed target, entry Nr.46 shows SED of a YSO can be very similar to a DS candidate. a large scaling factor, „ 2.5m. Again we stress that a Its position in the HR diagram, based on large scaling is undesirable for a good DS candidate. temperature Teff “ 7848.805 K and luminosity L “ The AllWISE catalogue suggests that the star is 9.68 L@ from Gaia DR2, is somewhat higher up on not likely variable in all filters except W3 (ccf “ 5574). the MS than the Sun or even near the giant branch. Moreover, the TASS Mark IV patches photometric However, the high temperature typically indicates a catalog (Droege et al. 2006) notes plausible variability bluer star, while this target is clearly redder. Thus, of the star in the V and IC passbands (optical & the star is likely somewhat obscured which results in NIR). They report a Welch-Stetson variability index the IR excess. The list of Teff and for WS “ 0.13, where a low value (ă 2) indicates random Tycho-2 stars (Ammons et al. 2006) reports a stellar variations which are independent in each passband or temperature that deviates from Gaia’s estimate, Teff “ that the star varies less than the uncertainty of the 5997 K, supporting the latter statement. Furthermore, measurement (Welch & Stetson 1993). the same source has estimated the distance to the Marton et al.(2019) reports a high probability that star as D “ 39 pc, which is significantly less the target is a YSO using both datasets; LY=0.88460 than what Gaia shows, p “ 2.967 mas « 337 pc. and SY=0.85040. Once again we stress that this search The fundamental stellar properties are derived using is based on the colours of WISE and Gaia filters as well spline functions of broadband photometry and proper as the extinction map from Planck. For this target, the motion previously established in Hipparcos/Tycho 2 dust opacity from Planck is small (τ “ 0.00004) but and 2MASS (Ammons et al. 2006). In the same survey, higher than the limiting value for YSO rich regions the target star is classified as a dwarf star according to (Marton et al. 2016). However, no other clear signs the derived Teff and absolute V ´magnitude (MV “ of a YSO, such as maser radiation or Hα emission, 9.892m). The distances in this survey are based on is reported. The target is proposed as a pre-main proper motions and colours by fitting to Hipparcos sequence star by the former study of (Kun 1998), based parallax and thereafter converting to distance. The on a characteristic flux density distribution. Another model estimate of distances are plotted against known study that considered the target as a YSO is distances in figure 3 of Ammons et al.(2006) where,

33 seemingly, the model has trouble estimating the true Most surveys that have classified this object have distances at large distances. It is plausible that this based the classification on similar properties as our uncertainty affects the result, thus deviating from the study, such as colours. In regards to the four derived distance from Gaia. Although, the estimated key parameters in this study (presence of nebulae, temperature is clearly different from the one found variability, Hα detection or maser radiation) this target by Gaia, even when considering both of their great is viewed as a good DS candidate. On the other uncertainty, Gaia with uncertainty ą 200 K and hand, the SED does not resemble a characteristic Ammons et al.(2006) maximum error in temperature DS candidate since the optical-IR difference is model „ 400 K. Both temperatures are derived from small and a small bump is seen in near/mid-IR. colours, Gaia via Apsis-Priam (GBP ´ G & G ´ GRP ) Furthermore, information from the above-mentioned and Ammons et al.(2006) via polynomial fit of five sources indicates that the star is not a MS star. colours (BT ´ VT,BT ´ J, VT ´ H,VT ´ K, and J ´ K). Therefore we conclude that this target is not as likely Moreover, the TESS input catalogue (Stassun et al. a DS candidate. Although, there are many conflicting 2018) contradict that the star is a dwarf, with the estimates of various stellar parameters which makes the estimates of the size of the star. TESS reports a radius confirmation of the true nature of the source hard to twice the size and mass of the sun, R “ 1.728 R@, establish, further observations are encouraged. M “ 2.016 M@. A third paper that reports stellar properties is the Fundamental parameters and infrared excesses of 7.3 Summary of the results Tycho–Gaia stars by McDonald et al.(2017). They estimated the temperature to Teff “ 4190 K, luminosity The deeper investigation in the section above L “ 28.330 L@ and radius to R “ 10.115 R@. All concerned the top 8 candidates that in a briefer derived from a distance estimate of D “ 245.747 pc. overview, looked promising. It seems that with the This distance estimation is clearly closer to the one current published data in VizieR we can not establish given by Gaia DR2, however, both luminosity and the true nature of 2 out of the 8 targets. There are temperature differ significantly. The Simbad object type proposed classifications for all 8 targets, but for the states that the star is a YSO. However, according to entries Nr.1 & Nr.8, we can not rule out the possibility their classification scheme based on the position of the of an unnatural source which would be responsible for stars in the HR diagram (see figure 15 in McDonald the IR-excess. For the remaining 6 targets we found et al.(2017)), the star could belong to either the groups indications suggesting that they are likely evolved stars ”young star”, ”variables”, ”Binary stars” and ”evolved and often proposed as post-AGB stars. We argue stars”. Compared to the HR diagram model seen in that the luminosities for all targets are lower than figure4, the star clearly belongs to the giant groups. what is expected for post-AGBs, however, with dense A further noteworthy remark is that the star is dust shells which commonly surround these types of mentioned in the XMM(X-ray Multi-Mirror Mission)- stars, could dim the starlight. We stress that the true OM Serendipitous Source Survey Catalogue (Page et al. nature of the 6 less favourable targets is not confirmed. 2012), meaning that it is detected in the ultraviolet Various references differ in their assertion of stellar (UV) and X-ray. The UV magnitude measured in the type. UVW2 filter (λeff “ 2120 A,˚ wavelength range 1800- 2550 A)˚ is 12.6471m. The X-ray emission of the targets in the survey Page et al.(2012) is not explicit reported. However, the catalogue of UV sources is based on X-ray 8 Discussion sources that are left for the reader to cross-reference. In the XMM-Newton source catalogue6 (Zolotukhin et al. In the following section, we consider the technique 2017) this target has ”obs id=0673540901”. In all 9 of ML to find DS candidates. We will discuss energy bands (basic and broad) used in 4XMM-DR9 the favoured models, combinations and adjustments processing, this target show fluxes in the order of „ for the learning process and their respective results. 10´14 ´10´15 erg/cm2/s, i.e. „ 104 ´105 sfu (solar flux Thereafter we elaborate on some identified problems unit) (Zolotukhin et al. 2017). X-ray emission typically and complications that aroused within the process. For emanates from extremely hot gas. Some types of stars the most promising targets discussed in section7, we with strong X-ray emission are PMS and Herbig Ae/Be propose follow-up observations and analysis that would stars (Preibisch et al. 2005, G¨udel 2004) where the help identify and explain the true nature of the targets. X-rays are considered being formed in their stellar We also present an alternative method for searching for coronae via magnetic reconnection flares. Their large DS candidates via a grid search of the set of catalogues convective zones drive strong dynamos which result in Gaia+2MASS+WiSE. Finally, we present some future strong surface magnetic fields that lead to high X-ray prospects and suggest implementations and strategies emission. This could indicate that the target is indeed that could improve the applied ML technique in future a young star or surrounded by hot gas. researches as this one. 6http://xmm-catalog.irap.omp.eu/

34 8.1 Evaluation of the approach commonly seen as a good option for classifications of this type. It is possible that the overlapping subsets 8.1.1 The influence of various algorithms which are created are to diverse such that when the final ensemble is applied onto the larger dataset, it fails One of the most popular machine learning techniques to find common candidates. used in astronomical research is the support vector machine (Marton et al. 2016, 2019, Ma lek et al. 2013, The kNN method (sections 2.3.1& 4.4.4) identified Tasca et al. 2008, Fadely et al. 2012, Krakowski a sample of candidates, however, most of them have et al. 2016, Hartley et al. 2017, Baron 2019). This a rather bright G magnitude which is not expected is not surprising since the technique is well suited for DS and more common for YSOs or evolved stars. for classification problems and can handle nonlinear This model, therefore, gives an impure sample of DS separable n-dimensional data. Of all models tested in candidates which is not the desired result. Since the this work, we also found that the SVM technique is two training sets of YSOs and DS are close in the one of the leading ones. In particular, the combination euclidean metric of the colour-magnitude diagram, it is of linear and quadratic SVM gave candidates with plausible that the nearest neighbours of the input DS magnitudes and colours that are expected for DS are YSOs. This would cause confusion in the training, candidates. These algorithms also show high accuracy thus leading to miscategorisations of the entries in the and a low number of false positives on their confusion final dataset. Indeed when we check the confusion matrices. Furthermore, this combination was more matrix we see that there are some false positives (true successful in distinguishing DS candidates from dust- class stars and predicted class DSs). Furthermore, enshrouded stars. From the training sets, we see that the colour-magnitude- (e.g. G vs G ´ W 3) scatter the ”YSO” group or i.e. the dust-enshrouded stars, plot illustrates that the entries are more frequently have more similar colours to ”normal” stars than the misclassified where the overlap between the classes is DS but similar G magnitudes (see figure 10). This greater. This specific colour-magnitude diagram is means that the dust-enshrouded stars have lower W 3 important since the signifying feature we analyse is the magnitudes than the DSs. Although, it goes without IR-excess. saying that the magnitudes of the DS models depend The linear discriminant method (sections 2.3.4 on the magnitudes of the initial star on which the & 4.4.2) recognise many DS candidates, although DS models are based on. However, the base stars more than half of which lies in the same regions as for the DS models were mostly restrained within 4 ă the common stars and at very bright magnitudes. G ă 6m which resulted in G magnitudes of the DSs Furthermore, we identified some targets from the slightly lower than that region. On the other hand, second group as dust-enshrouded stars, similarly to plots with the training set of YSO and DS candidates the discussion from the previous section. The QDA identified by the algorithm show common colours for is not applicable if we include the cc-flags because the two groups while the YSOs are somewhat brighter the predictors are constant for a response class (DS). (see figure 12). Furthermore, the dust-enshrouded For the linear discriminant the cc-flag does not make stars that the algorithm classified as DS candidates a difference for the predictions when looking at the are found in the upper part of the candidate group confusion matrix. with higher brightness. The combination of linear Another model which classified many entries as and quadratic SVM thus reduces the number of dust- DS candidates is the decision tree (section 2.3.2& enshrouded stars while keeping the ones of similar 4.4.1) seen in the bottom figure (figure 23g). Here colours, i.e. greater G ´ W 3. we see many entries with bright magnitudes and a In figure 23 in appendix A.3 we present the results broad range of colours that reaches all the way to of five different algorithms and three versions of one the ”normal” stars. As for the previously mentioned in particular. These five models are (figure 23a) model, the majority of the DS candidates here are bagged trees, (figure 23b) kNN, (figure 23c) linear in fact dust-enshrouded stars, hence the algorithm discriminant, (figure 23d) linear SVM, (figure 23e) provides a contaminated result. The confusion matrix quadratic SVM, (figure 23f) combination of linear and of this classifier, for this particular training set, shows quadratic SVM and finally, (figure 23g) trees. Each a low number of FPs. The leading source which figure shows the colour-magnitude diagram of the DS is responsible for the slightly lower accuracy is the candidates (red) identified by respective algorithm and miscategorised YSOs that have been identified as stars. the training set of stars for reference (black). All Finally we go back to the favoured methods that models are trained on the same training set seen in is seen in figures 23d, 23e& 23f namely the SVM table1. The DS models are based on stars with G methods (sections 2.3.3& 4.4.3). On their own, magnitudes within the interval 4-6m. Each figure shows linear and quadratic SVM models seem to identify an example of a run, however, they are representative few faint magnitude-objects with uncommon colours for the typical outcomes. but also some with much higher brightness. The Firstly, the bagged trees method (see sections 2.3.5 quadratic SVM model appears to categorise objects & 4.4.5) was here unsuccessful in finding any DS with colours closer to the normal stars as DS while the candidates. Despite the fact that this approach is linear discern objects with higher IR emission. The

35 bright objects are not in accordance to typical DS, and DS training sets as well as the configuration of therefore unwanted in the results. The combination the DS training set. Starting with assuming a training of the two SVM techniques which gives the mutual DS set with the two groups ”normal” stars and YSOs candidates, eliminates the brighter sources and agree having 2000 respective 100 entries, and a third varying on the bright magnitudes ones. We acknowledge that training set of DSs. If the number of entries in the two out of four candidates in figure 23f coincides with DS training set is less than in the YSO group, the the normal stars. However, even though the graphs algorithm (linear & quadratic SVM) finds few objects of figure 23 represent the general results, some small that match the colours of the DS training set and many variations occur. Moreover, in this particular run, we entries with higher brightness that coincide with the see fewer candidates through each method than what is training set of stars. In other words, the algorithm commonly seen. This also contributes to the somewhat fails to identify the key features of DSs, thus identifying dissatisfying result of the combined SVM approach. false positives. It is likely that the low number of DS Another example can be seen in the leftmost graph of models also affect the outcome and not only the ratio figure 12 where more candidates have been recognised. between DSs and YSOs. A larger sample of DSs and Both models have high accuracy and few FPs and YSOs, where the DS group constitutes the larger part FNs in their confusion matrices, suggesting an accurate of the set, might still result in a more refined result. classification. The sophisticated methods of soft Contradictory to the former statement, a DS margins allows for a good ratio of misclassifications and training set of more entries is not in all cases beneficial. largest distance between the support vectors. Between We analysed the impact on the distribution of the DS all algorithms shown in figure 23, the combined SVM training set by comparing the results of two sets with is the one with the most favoured results both in the the same interval of the parameters: 200 ă T ă training, where they are the models with the highest eff 600 K and 0.5 ă f ă 0.95 but with varying grid accuracy and low FP, but also in the final results of DS cov size. Each set of DS models were based on one star candidates. The algorithms of the linear and quadratic and resulted in two samples of 100 and 414 entries SVM are the ones presented in appendix A.4. respectively. The former set resulted in more identified Combinations of numerous algorithms were also DS candidates than the latter. We stress that the set explored. This means that we let two or more of 100 DS models is the same as the one used for the algorithms, which have been trained on the same analysis on the ratio between DS and YSO training training set, classify the entries in a large dataset. sets (see figure 13), thus the result contains entries with Some of the targets will, hopefully, be classified equally properties unlike a classical DS. However, exclusive of by all models, thus resulting in a sample of common those deviating objects, a larger sample of entries with DS candidates. For some cases, it resulted in a small DS-like properties are identified using the smaller DS sample of common DS candidates, though for most training set. runs the separate algorithms differed in the result and had no mutual DS candidates. The final variable of the training set we will discuss As noted earlier, the classification learner in MATLAB is the number of stars on which the DS models are allows for hyperparameter tuning. We found that based on. Even with varying covering fraction and for the favoured model (linear and quadratic SVM) effective temperature of the DSs, there will be a additional hyperparameter tuning did not affect the difference between each model that is based on different results. For other models such as bagged trees, stars. The variance is not great for those models where we found that it may prove constructive to alter the G magnitudes of the base-stars are restrained the hyperparameters. For example, a few more DS within 4-6m, even so a difference. This means that if candidates were found when increasing the number of the DS models are based on a number of stars instead learners and number of splits for the bagged tree model. of one, they will cover a larger region in magnitude Note that many splits may result in overfitting and and colour. This would increase the chances of an cannot generalise enough to the real dataset. For most entry being classified as a DS by the trained algorithm. of our runs and the final adaptation, we therefore use Seemingly fewer false positives are also found for this the standard settings of the hyperparameters. setup of DS training sets. To summarise we conclude that the training set should consist of a larger fraction of ”normal stars” followed by DSs and lastly YSOs. 8.1.2 Training sets The DS models does not need to be a fine grid but The composition of the training sets have a significant should not be based on a single star but several. The impact on the training and thus the final results. algorithms found in the appendix A.4 are based on a We have seen the importance of including certain training sample that follows these principles and the parameters in the training as well as the constraints on number of entries seen in table1. It should also be the training data. In section 6.7 we briefly discussed noted that extinction was not included in the training the effect of having training sets of various class sets of DS models. The algorithms searches for DS with equal or different number of entries, with the candidates within a relatively near vicinity of the Milky conclusion that an uneven ratio is preferred. Here we Way and have identified targets with relatively good will elaborate more about the ratio between the YSO fits without incorporating extinction. Furthermore, by

36 including extinction within the models, we expect to (Stassun et al. 2018, 2019). If luminosities are see more YSOs and dust-obscured stars within the DS available, one can uncover the possible nature of the candidate output. This would lead to contaminated star via its estimated mass and effective temperature. results containing false DS candidates. If the mass is greater than the Sun’s, its Teff must also be higher than the Sun’s in order to be a MS 8.2 Challenges star. Otherwise, it is likely that the star is an evolved star. Secondly, one attempt on reducing the number Throughout the comprehensive discussions of the of evolved stars within the DS candidate output could propitious DS candidates, there was one category of be to include evolved stars within the training set of stars that repeatedly was proposed as the explanation the stars. However, if dust-enshrouded stars such as for the unusual IR property of the targets. That group post-AGBs were included in the training set of the of stars is the young stellar objects. We have argued stars, their IR-excess properties might intertwine with many times within the report that DS candidates and the classification of DS-like objects. Alternatively, YSOs appear very similar in e.g. colour and in the the training set of YSOs could be extended to more shape of their SEDs. Two examples of this are seen in luminous objects, thus including evolved stars such figure 11, where two YSOs SEDs (black) are plotted as post-AGBs. This implementation might enhance together with the best fitted model (red). These the algorithms performance in classifying targets with entries belong to the training set of YSOs, which is unusual IR properties. used in all training processes that include the third group. Both YSOs have SEDs decidedly similar to the modelled Dyson spheres, they have even smaller RMSD 8.3 Follow-up observations than for most of the DS candidates. Clearly, this The true nature of some of the DS candidates closeness makes it difficult for the ML algorithms to reviewed in section7 is not established after a separate YSOs from DSs since the training is primarily deeper investigation and, therefore, needs additional based on colours. The major parameters which help research and observations. The most common types distinguish YSOs from DSs in the training are the of objects that are classified as DS candidates by coordinates ra and dec. As previously stated, we numerous tested algorithms are YSOs. An important aimed to separate these two classes apart by including source of information when it comes to distinguishing the coordinates of the entries, consequently, many YSOs from DS candidates is emission lines. As YSOs which are located in prominent nebulae are previously mentioned, young stars often show Hα or more easily identified. This implementation was indeed maser emission. Further investigation of the stellar successfully to some extent. Many of the entries that spectrum, thus helps considerably in identifying a more have SEDs similar to DS but located in prominent conventional explanation of the unusual target. For nebulae, were indeed categorised into the third group more than half of the candidates listed in table3, of YSOs, by the algorithms. However, YSOs are we have noted Hα emission, for almost all of the undoubtedly found in more places than in the greatest remaining ones, we found no references that indicated nebulae. Small nebulous structures or dust and gas such emission. It is possible that these sources have clouds, are scattered all around and are not restrained no evidence for Hα emission and therefore not brought to the line in the sky that represents the direction up by any reference. However, further spectral analysis towards the centre of the Milky Way. The algorithms are encouraged given that the current knowledge of the can therefore not always recognise a pattern in their sources presented in this work is based on published location, thus not make the connection of nebulae and data. Additional investigation of unpublished data, YSO group. such as for Hα emission lines are encouraged. The Other targets which have been identified as DS same argument is applicable to the maser emission candidates are evolved stars, a typical example is the where fewer targets had references with maser emission post-AGB stars. The evolved stars have lower effective measurements. temperatures thus redder colours until they reach the Many fields in astronomy are thought to benefit final stages of stellar evolution, see figure5. Evolved greatly from the upcoming space telescope James Webb stars, like AGB stars, also show IR excess due to (JWST). The joint NASA-ESA-CSA telescope is the the dust shell which surrounds the star. Therefore, successor of the Hubble space telescope with improved these stars show comparable colours to the DS infrared resolution and sensitivity. The telescope will candidates, however, their luminosity is considerably observe in the red part of the optical, NIR to mid- higher. The luminosity of the entries in our ML IR. These features enable the telescope to observe high search are not explicitly used in the training or in redshift objects and cold objects such as debris disks the search for candidates, although their absolute G and cooler dust. Additionally, it will be equipped with magnitude were, which are related. The excessive both NIR & mid-IR cameras as well as spectrographs luminosity measurements were identified in the manual (Gardner et al. 2006). The high mid-IR sensitivity of investigation of the proposed candidates. Further JWST will be better suited than any predecessor to parameters which were used in the manual examination ascertain the orbiting material around these unusual that helped to recognised these targets were mass and stars. For example, it has been speculated that radius estimates, typically from the TESS catalogue 37 Table 2: The table shows a small sample of targets with unusual IR-properties, identified via the grid search. The coordinates, ra and dec are given in degrees. The AEN is the astrometric excess noise and AENσ the astrometric excess noise sig, p the parallax in mas. All parameters (excluding RMSD) listed in the table comes from the Gaia archive, DR2.

Gaia ID ra dec RMSD AEN AENσ p comment

6086886067748733824 201.071 -46.095 0.487 0.485 3.194 1.816 QSO 6816373098892897280 326.930 -23.391 0.475 4.478 13.956 3.970 QSO 1163161909231963776 229.214 6.685 0.451 0.573 3.129 6.251 QSO/galaxy 736104209954695296 160.269 31.396 0.486 4.369 52.071 2.578 QSO, Seyfert type-II (z=0.138) 1553988166345499136 197.827 46.584 0.490 6.909 176.688 4.126 QSO (z=0.271) 283619780299286656 88.739 62.560 0.483 3.003 14.505 2.304 QSO (X-ray measurement) 4733820427174044544 49.094 -54.825 0.467 8.653 214.942 3.079 QSO 5788603818854455680 180.430 -78.596 0.414 0.559 8.118 9.530 QSO? 4978823744197946880 6.427 -47.306 0.4777 7.333 286.942 3.424 QSO 3002973036657872000 97.810 -9.653 0.427 2.032 15.852 2.551 Nebula 3020931634953652736 92.031 -5.307 0.441 0.112 0.0529 3.458 Nebula 3102113040903161984 103.585 -4.470 0.411 1.128 2.718 1.759 Nebula 3216423702860032896 85.230 -2.797 0.380 0.573 5.780 2.892 Nebula 3377006993843055744 94.389 22.339 0.423 1.565 12.202 3.279 Nebula 527799975432792192 7.089 66.345 0.383 0.321 0.518 2.444 Nebula

observations by JWST of Tabby’s star would help G ´ W 4 ą 9.015) is retrieved from the joint catalogues to clarify the true nature of the unusual variability Gaia+2MASS+WISE. Thereafter, each entry that patterns of the star (Wright & Sigurdsson 2016). The fulfill the colour criterion seen in eq. 29 is compared to spectral signatures could give clues if the material a fine grid of DS models. Various ranges were tested is solid, such as fragments of cosmic collisions or if but the selection of the result in table2 were found the spectral signatures appears noisy it could indicate with the properties: Teff P r400 : 10 : 500s K and dust clouds. It is likely that both photometric and fcov P r0.7 : 0.01 : 0.95s. The colour criteria in eq. 29 spectrometric measurements from JWST may give are chosen such that each entry has a significant IR- further knowledge that would greatly help establishing excess and G magnitudes along the MS. Each entry the configuration of the candidates. that has a smaller RMSD than the chosen criterion for The James Webb space telescope is coveted DS selections (RMSDă 0.5 is a good fit) is saved to a by many different branches of astronomy and file and plotted against the best fitted DS model. An observations of these types of targets are perhaps less important difference of the grid search from the ML probable. Another suggestion is therefore to use the approach is that the program does not weigh in the airborne SOFIA observatory. SOFIA or Stratospheric magnitudes differently. In ML we have constructed the Observatory for Infrared Astronomy is a collective programs such that the algorithm focuses on finding work by NASA and the German Aerospace Centre candidates with an IR excess and relatively low G (DLR). The telescope is equipped with both infrared magnitude. The grid search does not direct more focus cameras and spectrometers covering near-,mid- and towards the IR excess but takes the whole shape of the far-IR wavelengths (Reinacher et al. 2018). The first SED into equal consideration. Therefore, it is expected light instruments of SOFIA covers a broad range of that the two methods identify different candidates, to topics such as dust temperatures and composition, some extent. Furthermore, the grid search will not protostellar environments (Krabbe & Casey 2002). distinguish DS candidates from nebulous objects or The various data in the broad infrared range of this dust-enshrouded stars which show resembling SEDs. telescope is ideal for further observations of these types As foreseen, the great majority of the candidates of targets. found via the grid search are dust-enshrouded stars in the vicinity of nebulae. Considering previous 8.4 Grid search elaboration in section 8.2 and figure 11, this is not surprising. Commonly, objects found via the grid In addition to the primary purpose of this work, namely search showed better RMSD values than via ML. to evaluate the application of machine learning to However, many dust-enshrouded stars show very good large astronomical catalogues, we study the results fits to the DS models. A rather interesting result of a grid search of the same catalogue. A selected from the grid search was that some candidates with sample of entries with IR excess (G ´ W 3 ą 7.569 and a relatively good fit turned out to be quasars (QSOs)

38 in the QSO sample, that showed nearly featureless continuum in SDSS spectra or identified line splitting 4 ă G ă 14 (29a) from the Zeeman effect were proposed misclassified G ą p5.0847 ˆ pBP ´ Gq ` 0.8474qq (29b) white dwarfs. Other QSOs were in close projection to G ą p3.33 ˆ pG ´ Jq ` 3.0qq (29c) foreground stars or suggested lensed QSOs. However, a substantial part of the sample showed broad emission G ą p2.72 ˆ pG ´ Hq ` 3.744qq (29d) lines in their spectra and were recognised as genuine G ą p2.45 ˆ pG ´ Ksq ` 4.85qq (29e) QSOs with non-zero parallaxes. Hwang et al.(2020) pG ´ W 3q ą 2.5 (29f) states that it is unclear why these QSO have non-zero pG ´ W 4q ą 2.5 (29g) parallaxes but proposes two explanations. Gaia shows parallaxes either due to varstrometry or systematics like an extended host. varstrometry is explained as an astrometric intrinsic variability pattern which causes and galaxies with parallaxes reported in Gaia. Some jitter in the photo centre of an unresolved source and example are seen in table2. Quasars are active astrometry precision. galactic nuclei, a luminous mechanism driven by The grid search was an interesting complement to accretion onto a supermassive black hole in the the main focus of this project. Even though a grid centre of a galaxy. QSOs are very distant objects search provides many IR excess candidates with better that are assumed to have undetectable parallaxes. fits (low RMSD) than most of the ML candidates, Consequently, Gaia–that mainly makes astrometrically the greater part is dust-enshrouded stars which are measurements of stars within the Milky Way–is avoided via ML. Some candidates are also found to be not expected to recover parallaxes to these objects. YSOs, much like the ML search. The biggest and most Furthermore, the Gaia catalogue provides information interesting difference is the non-zero Gaia parallax on how well the observed target fits the astrometric QSO recognised in the grid search, where most targets model of a star. This information is given by the likely suffer from astrometric noise. astrometric excess noise which takes the value 0 for a good fit. This variable measures the disparity between observations and the best-fitting astrometric 8.5 Future prospects and improvements model. Two examples for when the excess noise is non-zero is when the source is perturbed by a binary One of the biggest difficulties of using ML to find DS companion (unmodelled astrophysical behaviors) candidates is, as signified throughout the report, the or when the measurement exhibits unmodelled group of dust-enshrouded stars and YSOs which share instrumental noise (Hwang et al. 2020). Secondly, similar properties to DS models. Even though the Gaia gives an estimation of how significant the excess program was notably improved when introducing the noise is with the astrometric excess noise sig. If third class (YSOs) and included the coordinates within astrometric excess noise sig ď 2 the source is still the training, there is still room for improvement. In viewed as astrometrically well-behaved even though sections 7.1.1& 7.1.2 we reviewed the work by Marton astrometric excess noise is large. The majority et al.(2016, 2019) that used ML to identify YSO of the QSOs found as DS candidates in the grid candidates using Gaia & AllWISE data. In addition search have a non-zero astrometric excess noise, to brightness measurements in various bandpasses, typically above „ 3. Astrometric noise is, their work also included Planck dust opacity data. for these targets, the straightforward explanation YSOs are typically found in regions with higher dust for why they have parallaxes in the Gaia opacity, thus by including this parameter they hoped catalogue. Few identified QSOs were found to to guide the algorithm into classifying these objects have low or zero astrometric excess noise. The accordingly. We considered including such data in this parallax measurements for the targets with higher work with the aim of helping the algorithm distinguish astrometric excess noise are less likely explained DS candidates from YSOs. The regions of which we by astrometric noise. In a recent publication, Hwang find candidates and YSOs respectively are likely too et al.(2020) presents an analysis of QSOs with irregular and too small-scaled for the Planck data to significant Gaia parallax or proper motion, regardless improve the algorithm. Nevertheless, the possibility of their astrometric excess noise values. They has not been examined and could potentially improve present a selected QSO sample (25 objects) with the search. positive parallaxes with parallax over errorą5 Additional information that could supplement (parallax divided by its error) and Gaia-SDSS the current features used within the training are separation ă 1 arcsec to reduce contamination from monochromatic images obtained by WISE. These foreground stars. In this sample, 15 entries showed a pictures include both clear and well-behaved fields, single Gaia match and 10 entries showed two matches as well as field structures and noise which may within 1 arcsec. The true nature of these objects was affect the colours of the targets. Artifacts or field investigated via their SDSS spectra and optical colours structures can make the algorithm identify targets in SDSS and Pan-STARRS 1. Some of the objects behind such structures as DS candidates. By enabling

39 this information we anticipate less ”contaminated” 9 Conclusion sources to be identified as DS candidates. Within this work, we have explored the appliance of supervised machine learning on large astronomical There is a balance of how many features one data. Magnitude measurements in respective bandpass should include in the training of the algorithms. from the Gaia+2MASS+WISE catalogues were used Too many features is commonly a bad thing in to recognise entries with unusual infrared properties this context, however too many is a relative term (IR excess). We constructed training data of stars and that depends on the domain of the problem. In Dyson sphere models, in an attempt to identify objects general, an increased number of features increases the with IR properties similar to Dyson spheres that can dimensionality of the search space and the risk of not be explained by natural astronomical phenomena overfitting. In the final runs and training processes, we based on existing data. Evidently, nebulous objects have included 26 features, 10 magnitude measurements and dust-enshrouded stars have common properties with corresponding error, the two coordinates, and with Dyson sphere models. One such type of objects the cc-flags split into four components. The question which is particularly similar to a DS are the young is if the programs would improve if further features stellar objects (YSOs). For these types of stars, the IR would be included or if it would lead to further excess is caused by surrounding dust which reradiates confusion and overfitting? An example could be the high energy photons from the young star at longer above mentioned dust opacity from Planck and the wavelengths. By introducing a third group consisting monochromatic pictures of WISE. We can not state of stars within nebulae, MATLABs classification learner with certainty that the current dimensions are not too was better suited to distinguish non-nebulous objects many. The algorithms are successful in identifying IR- from relevant Dyson sphere candidates. dominated targets even if the sample is typically small. In total, we found 8 DS candidates that did This suggest that the models are not overfitted to an not show any indications of natural explanations of extent that the process is ineffective. However, further the unusual properties. Those signs were nebulous analysis and tests of fewer features in the training could structures in the vicinity, Hα emission, maser radiation be proposed as an extension of this work. and variability. After a deeper investigation of those 8 candidates, 6 were rejected due to stellar properties In the program that identifies the best fitted DS that contradict the classical view of Dyson spheres. models to the DS candidates, we allow scaling. As This left us with 2 intriguing targets (Nr.1 and Nr.8) previously noted, a large offset between the model that have conflicting reported stellar properties which and the candidate indicates that the target is less would require follow-up observations and/or analyses Dyson sphere like and most likely an evolved star, to establish their true nature. The first target does given that the target is brighter than the model. have a significant IR excess, but a very low optical peak Some of our targets have a relatively large offset from which deviates slightly from the traditional DS model. the DS models (e.g. Nr.10, Nr.37 and Nr.46, see The latter target’s SED is closer to the classical DS figures 17, 21& 22.) One suggestion is to constrain model but with a low covering fraction. Nevertheless, the offset such that the program does not find a these are the best candidates that emerged from this good fit with misleading DS candidates. Scaling of project. the models is not included within the program where Machine learning algorithms can, so far, not the trained algorithms identify DS candidates. We replicate the accuracy which human astronomers can expected to see candidates within, approximately, the achieve in the classification of astronomical objects. same magnitudes as the DS training set when the G The manual work which was done for 8 targets in order magnitude was included in the training features besides to establish their nature used further information that the colours. Evidently, the classifiers identifies entries was not available for the machine learning. However, as DS candidates even though their intrinsic brightness even with the data that the algorithms were provided, is higher than the models. A further suggestion is, we argue that less young stellar objects would be therefore, to force the program to find candidates with classified as DS candidates if done manually, given their magnitudes similar to the trained models. This would low IR-excess. Although, with the rapidly growing certainly reduce the number of inconsistent candidates data, we need to introduce new methods for e.g. that are likely dust-enshrouded evolved stars. Within classification of astronomical objects that would take this work we have limited the DS models to be based too long to do manually. Furthermore, we could argue on solar-like stars (4 ă G ă 6m) and thus arguing that the machine learning process reduces the number that the algorithms candidates diverge from the models of targets which, thereafter, can be further investigated predictions. Dyson spheres are not believed to be by human astronomers. Therefore, even a non-perfect limited to solar-like stars, but can hypothetically be classification eases the manual work. built around brighter stars as well. However, by forcing the algorithms to identify candidates similar to the With this work, we have shown that the models, we are in more control of the program and classification technique is capable of identifying expect cleaner results with less false candidates. deviant targets with unusual properties. Evidently,

40 entries with more similar properties are more Boch, T. & Fernique, P. (2014), Aladin Lite: Embed your Sky in challenging for the algorithms to distinguish, hence the Browser, in N. Manset & P. Forshay, eds, ‘Astronomical the occasional misclassification of the dust-enshrouded Data Analysis Software and Systems XXIII’, Vol. 485 of Astronomical Society of the Pacific Conference Series, p. 277. stars. We conclude that supervised machine learning is a well-suited technique for classification problems of Bonnarel, F., Fernique, P., Bienaym´e,O., Egret, D., Genova, F., Louys, M., Ochsenbein, F., Wenger, M. & Bartlett, J. G. large astronomical data that we expect to see more of (2000), ‘The ALADIN interactive sky atlas. A reference tool in the near future. for identification of astronomical sources’, A&AS 143, 33–40.

Boyajian, T. S., LaCourse, D. M., Rappaport, S. A., Fabrycky, D., Fischer, D. A., Gandolfi, D., Kennedy, G. M., Korhonen, Acknowledgement H., Liu, M. C., Moor, A., Olah, K., Vida, K., Wyatt, M. C., Best, W. M. J., Brewer, J., Ciesla, F., Cs´ak,B., Deeg, H. J., I would like to express my sincere gratitude to Dupuy, T. J., Handler, G., Heng, K., Howell, S. B., Ishikawa, my supervisor Erik Zackrisson for giving me the S. T., Kov´acs, J., Kozakis, T., Kriskovics, L., Lehtinen, J., opportunity to work with yet another inspiring and Lintott, C., Lynn, S., Nespral, D., Nikbakhsh, S., Schawinski, K., Schmitt, J. R., Smith, A. M., Szabo, G., Szabo, R., innovative project. Also for instructive discussions and Viuho, J., Wang, J., Weiksnar, A., Bosch, M., Connors, J. L., feedback & support at very short notice. Many thanks Goodman, S., Green, G., Hoekstra, A. J., Jebson, T., Jek, Erik. Thanks also to my subject reader Eric Stempels K. J., Omohundro, M. R., Schwengeler, H. M. & Szewczyk, for informative comments and intriguing questions. A. (2016), ‘Planet Hunters IX. KIC 8462852 - where’s the flux?’, MNRAS 457(4), 3988–4004. Finally, a big thanks to my supportive parents for their great encouragements throughout all of my academic Caltech (2017), ‘IV. WISE Data Processing,Pipeline Science studies. Modules, h. Photometric Calibration’. URL: https: // wise2. ipac. caltech. edu/ docs/ release/ allsky/ expsup/ sec4_ 4h. html This work was carried out with the help of AI4Research at Uppsala university. Furthermore, Carroll, B. W. & Ostlie, D. A. (2017), An introduction to modern astrophysics, 5th printing 2019, 2 edn, Cambridge University this work has made use of the VizieR catalogue Press. access tool, CDS, Strasbourg, France (Piskorz D. 2016). The original description of the VizieR service Dias, W. S., Monteiro, H., Caetano, T. C., L´epine,J. R. D., Assafin, M. & Oliveira, A. F. (2014), ‘Proper motions of the was published in A&AS 143, 23. The project optically visible open clusters based on the UCAC4 catalog’, also made use of ”Aladin sky atlas” developed at A&A 564, A79. CDS, Strasbourg Observatory, France (Bonnarel et al. Droege, T. F., Richmond, M. W., Sallman, M. P. & Creager, 2000), the SIMBAD database, operated at CDS, R. P. (2006), ‘TASS Mark IV Photometric Survey of the Strasbourg, France and data products from the Two Northern Sky’, PASP 118(850), 1666–1678. Micron All Sky Survey, which is a joint project Dyson, F. J. (1960), ‘Search for Artificial Stellar Sources of of the University of Massachusetts and the Infrared Infrared Radiation’, Science 131(3414), 1667–1668. Processing and Analysis Center/California Institute of Technology, funded by the National Aeronautics Engels, D. (2005), ‘AGB and post-AGB stars .’, Mem. Soc. Astron. Italiana 76, 441. and Space Administration and the National Science Foundation.” Fadely, R., Hogg, D. W. & Willman, B. (2012), ‘Star- Galaxy Classification in Multi-band Optical Imaging’, ApJ 760(1), 15.

References Fujii, T., Nakada, Y. & Parthasarathy, M. (2002), ‘BVRIJHK photometry of post-AGB candidates’, A&A 385, 884–895. Ammons, S. M., Robinson, S. E., Strader, J., Laughlin, G., Fischer, D. & Wolf, A. (2006), ‘The . IV. Gaia Collaboration (2016), ‘The Gaia mission’, A&A 595, A1. New Temperatures and Metallicities for More than 100,000 FGK Dwarfs’, ApJ 638(2), 1004–1017. Gaia Collaboration (2018), ‘Gaia Data Release 2. Summary of the contents and survey properties’, A&A 616, A1. Anthony Gonzalez, C. (2012), ‘Filter list, uf astronomy’. URL: http: // www. baryons. org/ ezgal/ filters. php Garcia-Lario, P., Manchado, A., Pych, W. & Pottasch, S. R. (1997), ‘Near infrared photometry of IRAS sources with Avenhaus, H., Schmid, H. M. & Meyer, M. R. (2012), ‘The colours like planetary nebulae. III’, A&AS 126, 479–502. nearby population of M-dwarfs with WISE: a search for warm circumstellar dust’, A&A 548, A105. Gardner, J. P., Mather, J. C., Clampin, M., Doyon, R., Greenhouse, M. A., Hammel, H. B., Hutchings, J. B., Bak Nielsen, A.-S., Hjorth, J. & Gall, C. (2018), ‘Early gray dust Jakobsen, P., Lilly, S. J., Long, K. S. et al. (2006), ‘The james formation in the type IIn SN 2005ip’, A&A 611, A67. webb space telescope’, Space Science Reviews 123(4), 485– 606. Baron, D. (2019), ‘Machine Learning in Astronomy: a practical overview’, arXiv e-prints p. arXiv:1904.07248. Gawiser, E. & Smoot, G. F. (1997), ‘Contribution of extragalactic infrared sources to cosmic microwave Bergstra, J. & Bengio, Y. (2012), ‘Random search for background foreground anisotropy’, The Astrophysical hyper-parameter optimization.’, Journal of machine learning Journal Letters 480(1), L1. research 13(2). G´omez,J. F., Rizzo, J. R., Su´arez,O., Palau, A., Miranda, L. F., Binks, A. S. & Jeffries, R. D. (2017), ‘A WISE-based search Guerrero, M. A., Ramos-Larios, G. & Torrelles, J. M. (2015), for debris discs amongst M dwarfs in nearby, young, moving ‘A search for water maser emission toward obscured post-AGB groups’, MNRAS 469(1), 579–593. star and planetary nebula candidates’, A&A 578, A119.

41 Gregorio-Hetem, J., Lepine, J. R. D., Quast, G. R., Torres, Lindholm, A., Wahlstr¨om, N., Lindsten, F. & Sch¨on, T. B. C. A. O. & de La Reza, R. (1992), ‘A Search for T Tauri Stars (2020), Supervised Machine Learning. Based on the IRAS Point Source Catalog. I.’, AJ 103, 549. URL: https://smlbook.org

G¨udel, M. (2004), ‘X-ray astronomy of stellar coronae’, Ma lek,K., Solarz, A., Pollo, A., Fritz, A., Garilli, B., Scodeggio, A&A Rev. 12(2-3), 71–237. M., Iovino, A., Granett, B. R., Abbas, U., Adami, C., Arnouts, S., Bel, J., Bolzonella, M., Bottini, D., Branchini, Hales, A. S., De Gregorio-Monsalvo, I., Montesinos, B., E., Cappi, A., Coupon, J., Cucciati, O., Davidzon, I., De Casassus, S., Dent, W. F. R., Dougados, C., Eiroa, C., Lucia, G., de la Torre, S., Franzetti, P., Fumana, M., Guzzo, Hughes, A. M., Garay, G., Mardones, D., M´enard, F., L., Ilbert, O., Krywult, J., Le Brun, V., Le Fevre, O., Palau, A., P´erez, S., Phillips, N., Torrelles, J. M. & Maccagni, D., Marulli, F., McCracken, H. J., Paioro, L., Wilner, D. (2014), ‘A CO Survey in Planet-forming Disks: Polletta, M., Schlagenhaufer, H., Tasca, L. A. M., Tojeiro, Characterizing the Gas Content in the of Planet R., Vergani, D., Zanichelli, A., Burden, A., Di Porto, C., Formation’, AJ 148(3), 47. Marchetti, A., Marinoni, C., Mellier, Y., Moscardini, L., Nichol, R. C., Peacock, J. A., Percival, W. J., Phleps, Hall, P., Park, B. U., Samworth, R. J. et al. (2008), ‘Choice of S., Wolk, M. & Zamorani, G. (2013), ‘The VIMOS Public neighbor order in nearest-neighbor classification’, the Annals Extragalactic Redshift Survey (VIPERS). A support vector of Statistics 36(5), 2135–2152. machine classification of galaxies, stars, and AGNs’, A&A 557, A16. Hartley, P., Flamary, R., Jackson, N., Tagore, A. S. & Metcalf, R. B. (2017), ‘Support vector machine classification of strong Martinez, M. A. S., Stone, N. C. & Metzger, B. D. (2019), gravitational lenses’, MNRAS 471(3), 3378–3397. ‘Orphaned exomoons: Tidal detachment and evaporation Hastie, T., Tibshirani, R. & Friedman, J. (2009), The elements following an -star collision’, MNRAS 489(4), 5119– of statistical learning: data mining, inference, and prediction, 5135. Springer Science & Business Media. Marton, G., Abrah´am,P.,´ Szegedi-Elek, E., Varga, J., Kun, M., Henry, T. J., Jao, W.-C., Subasavage, J. P., Beaulieu, T. D., K´osp´al, A.,´ Varga-Vereb´elyi,E., Hodgkin, S., Szabados, L., Ianna, P. A., Costa, E. & M´endez,R. A. (2006), ‘The Solar Beck, R. & Kiss, C. (2019), ‘Identification of Young Stellar Neighborhood. XVII. Parallax Results from the CTIOPI 0.9 Object candidates in the Gaia DR2 x AllWISE catalogue with m Program: 20 New Members of the RECONS 10 machine learning methods’, MNRAS 487(2), 2522–2537. Sample’, AJ 132(6), 2360–2371. Marton, G., T´oth, L. V., Paladini, R., Kun, M., Zahorecz, Hwang, H.-C., Shen, Y., Zakamska, N. & Liu, X. (2020), S., McGehee, P. & Kiss, C. (2016), ‘An all-sky support ‘Varstrometry for Off-nucleus and Dual Subkiloparsec AGN vector machine selection of WISE YSO candidates’, MNRAS (VODKA): Methodology and Initial Results with Gaia DR2’, 458(4), 3479–3488. ApJ 888(2), 73. MathWorks (2021), ‘fitcsvm: Box constraint’. Johnson, H. L. (1962), ‘Infrared stellar photometry.’, The URL: h ttps://se.mathworks.com/help/stats/fitcsvm.htmlbt7oo83- Astrophysical Journal 135, 69. 5 Jordi, C., Gebran, M., Carrasco, J. M., de Bruijne, J., Voss, H., McDonald, I., Zijlstra, A. A. & Watson, R. A. (2017), Fabricius, C., Knude, J., Vallenari, A., Kohley, R. & Mora, ‘Fundamental parameters and infrared excesses of Tycho-Gaia A. (2010), ‘Gaia broad band photometry’, A&A 523, A48. stars’, MNRAS 471(1), 770–791. Kataria, A. & Singh, M. (2013), ‘A review of data classification using k-nearest neighbour algorithm’, International Journal Meng, H. Y. A., Rieke, G., Dubois, F., Kennedy, G., Marengo, of Emerging Technology and Advanced Engineering 3(6), 354– M., Siegel, M., Su, K., Trueba, N., Wyatt, M., Boyajian, 360. T., Lisse, C. M., Logie, L., Rau, S. & Vanaverbeke, S. (2017), ‘Extinction and the Dimming of KIC 8462852’, ApJ Krabbe, A. & Casey, S. C. (2002), ‘First Light SOFIA 847(2), 131. Instruments’, arXiv e-prints pp. astro–ph/0207417. Meng, H. Y. A., Su, K. Y. L., Rieke, G. H., Rujopakarn, W., Krakowski, T., Ma lek, K., Bilicki, M., Pollo, A., Kurcz, A. & Myers, G., Cook, M., Erdelyi, E., Maloney, C., McMath, Krupa, M. (2016), ‘Machine-learning identification of galaxies J., Persha, G., Poshyachinda, S. & Reichart, D. E. (2015), in the WISE × SuperCOSMOS all-sky catalogue’, A&A ‘Planetary Collisions Outside the Solar System: Time Domain 596, A39. Characterization of Extreme Debris Disks’, ApJ 805(1), 77.

Kun, M. (1998), ‘Star Formation in the Cepheus Flare Molecular Meng, H. Y. A., Su, K. Y. L., Rieke, G. H., Stevenson, D. J., Clouds. I. Distance Determination and the Young Stellar Plavchan, P., Rujopakarn, W., Lisse, C. M., Poshyachinda, Object Candidates’, ApJS 115(1), 59–89. S. & Reichart, D. E. (2014), ‘Large impacts around a solar- analog star in the era of terrestrial planet formation’, Science Kun, M., Balog, Z., Kenyon, S. J., Mamajek, E. E. & Gutermuth, 345(6200), 1032–1035. R. A. (2009), ‘Pre-Main-Sequence Stars in the Cepheus Flare Region’, ApJS 185(2), 451–476. Mo´or, A., Abrah´am,´ P., Szab´o, G., Vida, K., Cataldi, G., ´ Larson, R. B. (2003), ‘The physics of star formation’, Reports Derekas, A., Henning, T., Kinemuchi, K., K´osp´al, A., Kov´acs, on Progress in Physics 66(10), 1651–1697. J., P´al,A., Sarkis, P., Seli, B., Szab´o,Z. M. & Tak´ats,K. (2021), ‘A New Sample of Warm Extreme Debris Disks from Lee, J., Song, I. & Murphy, S. (2020), ‘2MASS J15460752- the ALLWISE Catalog’, ApJ 910(1), 27. 6258042: a mid-M dwarf hosting a prolonged accretion disc’, MNRAS 494(1), 62–68. Mounce, S., Ellis, K., Edwards, J., Speight, V., Jakomis, N. & Boxall, J. (2017), ‘Ensemble decision tree models Li, X. & Zhao, H. (2009), ‘Weighted random subspace method using rusboost for estimating risk of iron failure in drinking for high dimensional data classification’, Statistics and its water distribution systems’, Water Resources Management Interface 2(2), 153. 31(5), 1575–1589.

42 Murakami, H., Baba, H., Barthel, P., Clements, D. L., Cohen, Planck Collaboration (2014), ‘Planck 2013 results. XI. All-sky M., Doi, Y., Enya, K., Figueredo, E., Fujishiro, N., Fujiwara, model of thermal dust emission’, A&A 571, A11. H., Fujiwara, M., Garcia-Lario, P., Goto, T., Hasegawa, S., Hibi, Y., Hirao, T., Hiromoto, N., Hong, S. S., , K., Planck Collaboration (2020a), ‘Planck 2018 results. I. Overview Ishigaki, M., Ishiguro, M., Ishihara, D., Ita, Y., Jeong, W.-S., and the cosmological legacy of Planck’, A&A 641, A1. Jeong, K. S., Kaneda, H., Kataza, H., Kawada, M., Kawai, T., Kawamura, A., Kessler, M. F., Kester, D., Kii, T., Kim, Planck Collaboration (2020b), ‘Planck 2018 results. VI. D. C., Kim, W., Kobayashi, H., Koo, B. C., Kwon, S. M., Lee, Cosmological parameters’, A&A 641, A6. H. M., Lorente, R., Makiuti, S., Matsuhara, H., Matsumoto, T., Matsuo, H., Matsuura, S., MUller,¨ T. G., Murakami, N., Planck HFI Core Team (2011), ‘Planck early results. IV. Nagata, H., Nakagawa, T., Naoi, T., Narita, M., Noda, M., First assessment of the High Frequency Instrument in-flight Oh, S. H., Ohnishi, A., Ohyama, Y., Okada, Y., Okuda, performance’, A&A 536, A4. H., Oliver, S., Onaka, T., Ootsubo, T., Oyabu, S., Pak, S., Park, Y.-S., Pearson, C. P., Rowan-Robinson, M., Saito, T., Sakon, I., Salama, A., Sato, S., Savage, R. S., Serjeant, Pottasch, S. R., Bignell, C., Olling, R. & Zijlstra, A. A. (1988), S., Shibai, H., Shirahata, M., Sohn, J., Suzuki, T., Takagi, ‘Planetary nebulae near the galactic center. I. Method of T., Takahashi, H., TanabE,´ T., Takeuchi, T. T., Takita, S., discovery and preliminary results.’, A&A 205, 248–256. Thomson, M., Uemizu, K., Ueno, M., Usui, F., Verdugo, E., Wada, T., Wang, L., Watabe, T., Watarai, H., White, Preibisch, T., Kim, Y.-C., Favata, F., Feigelson, E. D., G. J., Yamamura, I., Yamauchi, C. & Yasuda, A. (2007), ‘The Flaccomio, E., Getman, K., Micela, G., Sciortino, S., Stassun, infrared astronomical mission akari*’, PASJ 59, S369–S376. K., Stelzer, B. & Zinnecker, H. (2005), ‘The Origin of T Tauri X-Ray Emission: New Insights from the Chandra Orion Nakanishi, K., Takata, T., Yamada, T., Takeuchi, T. T., Shiroya, Ultradeep Project’, ApJS 160(2), 401–422. R., Miyazawa, M., Watanabe, S. & Sait¯o,M. (1997), ‘Search and redshift survey for iras galaxies behind the milky way R. M. Cutri (IPAC/Caltech), E. L. Wright (UCLA), T. C. I. J. and structure of the local void’, The Astrophysical Journal W. F. I. P. R. M. E. J. C. G. I. J. D. K. I. F. M. I. H. L. M. Supplement Series 112(2), 245. I. S. L. W. I. S. F.-A. I. L. Y. I. D. B. N. M. H. I. T. J. U. C. S. L. U. D. L. N. M. E. R. J. S. A. S. U. D. C.-W. T. J. F. L. Neugebauer, G., Habing, H. J., van Duinen, R., Aumann, H. H., J. G. H. I. A. M. J. D. G. U. F. . A. G. U. F. D. H. O. S. C. Baud, B., Beichman, C. A., Beintema, D. A., Boggess, N., A. R. C. K. A. M. C. U. D. P. N. M. F. S. U. V. R. B. I. M. Clegg, P. E., de Jong, T., Emerson, J. P., Gautier, T. N., P. I. M. W. I. (2013), ‘Explanatory supplement to the allwise Gillett, F. C., Harris, S., Hauser, M. G., Houck, J. R., data release products’. Jennings, R. E., Low, F. J., Marsden, P. L., Miley, G., Olnon, URL: https: // wise2. ipac. caltech. edu/ docs/ release/ F. M., Pottasch, S. R., Raimond, E., Rowan-Robinson, M., allwise/ expsup/ sec1_ 3. html# mep_ db Soifer, B. T., Walker, R. G., Wesselius, P. R. & Young, E. (1984), ‘The Infrared Astronomical Satellite (IRAS) mission.’, Ramos-Larios, G., Guerrero, M. A., Su´arez,O., Miranda, L. F. ApJ 278, L1–L6. & G´omez,J. F. (2009), ‘Searching for heavily obscured post- AGB stars and planetary nebulae. I. IRAS candidates with Noble, W. S. (2006), ‘What is a support vector machine?’, Nature 2MASS PSC counterparts’, A&A 501(3), 1207–1257. biotechnology 24(12), 1565–1567.

Ochsenbein, F. e. a. (2000), ‘The vizier database of astronomical Ramos-Larios, G., Guerrero, M. A., Su´arez,O., Miranda, L. F. catalogues’. & G´omez,J. F. (2012), ‘Searching for heavily obscured post- AGB stars and planetary nebulae. II. Near-IR observations of Oelkers, R. J., Rodriguez, J. E., Stassun, K. G., Pepper, J., IRAS sources’, A&A 545, A20. Somers, G., Kafka, S., Stevens, D. J., Beatty, T. G., Siverd, R. J., Lund, M. B., Kuhn, R. B., James, D. & Gaudi, B. S. Reinacher, A., Graf, F., Greiner, B., Jakob, H., Lammen, Y., (2018), ‘Variability Properties of Four Million Sources in the Peter, S., Wiedemann, M., Zeile, O. & Kaercher, H. J. TESS Input Catalog Observed with the Kilodegree Extremely (2018), ‘The SOFIA Telescope in Full Operation’, Journal Little Telescope Survey’, AJ 155(1), 39. of Astronomical Instrumentation 7(4), 1840007.

Page, M. J., Brindle, C., Talavera, A., Still, M., Rosen, S. R., Ruiz-Dern, L., Babusiaux, C., Arenou, F., Turon, C. & Yershov, V. N., Ziaeepour, H., Mason, K. O., Cropper, Lallement, R. (2018), ‘Empirical photometric calibration of M. S., Breeveld, A. A., Loiseau, N., Mignani, R., Smith, the Gaia red clump: Colours, effective temperature, and A. & Murdin, P. (2012), ‘The XMM-Newton serendipitous absolute magnitude’, A&A 609, A116. ultraviolet source survey catalogue’, MNRAS 426(2), 903– 926. Sch¨utze, H., Manning, C. D. & Raghavan, P. (2008), Introduction to information retrieval, Vol. 39, Cambridge Paolo Montegriffo, T. E. S. A. E. (2020), ‘External calibration’. University Press Cambridge. URL: https: // gea. esac. esa. int/ archive/ documentation/ GDR2/ Data_ processing/ chap_ cu5pho/ Sichevskij, S. G. (2017), ‘Estimates of the radii, masses, sec_ cu5pho_ calibr/ ssec_ cu5pho_ calibr_ extern. html and luminosities of LAMOST stars’, Astrophysical Bulletin 72(1), 51–57. Pearce, L. A., Kraus, A. L., Dupuy, T. J., Mann, A. W. & Huber, D. (2021), ‘Boyajian’s Star B: The Co-moving Companion to KIC 8462852 A’, ApJ 909(2), 216. Skrutskie, M. F., Cutri, R. M., Stiening, R., Weinberg, M. D., Schneider, S., Carpenter, J. M., Beichman, C., Capps, R., Piskorz D., Benneke B., C. N. L. A. B. G. B. T. B. C. B. M. C. Chester, T., Elias, J., Huchra, J., Liebert, J., Lonsdale, C., J. F. D. H. A. I. H. J. J. (2016), ‘VizieR Online Data Catalog: Monet, D. G., Price, S., Seitzer, P., Jarrett, T., Kirkpatrick, HD 88133 11-yrs measurements’. J. D., Gizis, J. E., Howard, E., Evans, T., Fowler, J., Fullmer, L., Hurt, R., Light, R., Kopan, E. L., Marsh, K. A., McCallon, Planck Collaboration (2011a), ‘Planck early results. II. The H. L., Tam, R., Van Dyk, S. & Wheelock, S. (2006), ‘The Two thermal performance of Planck’, A&A 536, A2. Micron All Sky Survey (2MASS)’, AJ 131(2), 1163–1183.

Planck Collaboration (2011b), ‘Planck early results. VII. The Smola, A. J. & Sch¨olkopf, B. (1998), Learning with kernels, Early Release Compact Source Catalogue’, A&A 536, A7. Vol. 4, Citeseer.

43 Stassun, K. G., Oelkers, R. J., Paegert, M., Torres, G., Pepper, Welch, D. L. & Stetson, P. B. (1993), ‘Robust J., De Lee, N., Collins, K., Latham, D. W., Muirhead, P. S., Detection Techniques Suitable for Automated searches: New Chittidi, J., Rojas-Ayala, B., Fleming, S. W., Rose, M. E., Results for NGC 1866’, AJ 105, 1813. Tenenbaum, P., Ting, E. B., Kane, S. R., Barclay, T., Bean, J. L., Brassuer, C. E., Charbonneau, D., Ge, J., Lissauer, J. J., Wenger, M., Ochsenbein, F., Egret, D., Dubois, P., Bonnarel, F., Mann, A. W., McLean, B., Mullally, S., Narita, N., Plavchan, Borde, S., Genova, F., Jasniewicz, G., Lalo¨e,S., Lesteven, S. P., Ricker, G. R., Sasselov, D., Seager, S., Sharma, S., Shiao, & Monier, R. (2000), ‘The SIMBAD astronomical database. B., Sozzetti, A., Stello, D., Vanderspek, R., Wallace, G. & The CDS reference database for astronomical objects’, A&AS Winn, J. N. (2019), ‘The Revised TESS Input Catalog and 143, 9–22. Candidate Target List’, AJ 158(4), 138. Whittet, D. C. B. (2007), ‘A Study of the Isolated Dark Globule Stassun, K. G., Oelkers, R. J., Pepper, J., Paegert, M., De Lee, DC 314.8-5.1: Extinction, Distance, and a Hint of Star N., Torres, G., Latham, D. W., Charpinet, S., Dressing, C. D., Formation’, AJ 133(2), 622–630. Huber, D., Kane, S. R., L´epine,S., Mann, A., Muirhead, P. S., Rojas-Ayala, B., Silvotti, R., Fleming, S. W., Levine, A. & Wright, E. L., Eisenhardt, P. R. M., Mainzer, A. K., Ressler, Plavchan, P. (2018), ‘The TESS Input Catalog and Candidate M. E., Cutri, R. M., Jarrett, T., Kirkpatrick, J. D., Padgett, Target List’, AJ 156(3), 102. D., McMillan, R. S., Skrutskie, M., Stanford, S. A., Cohen, M., Walker, R. G., Mather, J. C., Leisawitz, D., Gautier, Stevens, D. J., Stassun, K. G. & Gaudi, B. S. (2017), ‘Empirical Thomas N.f, I., McLean, I., Benford, D., Lonsdale, C. J., Bolometric Fluxes and Angular Diameters of 1.6 Million Blain, A., Mendez, B., Irace, W. R., Duval, V., Liu, F., Tycho-2 Stars and Radii of 350,000 Stars with Gaia DR1 Royer, D., Heinrichsen, I., Howard, J., Shannon, M., Kendall, Parallaxes’, AJ 154(6), 259. M., Walsh, A. L., Larsen, M., Cardon, J. G., Schick, S., Su, K. Y. L., Jackson, A. P., G´asp´ar,A., Rieke, G. H., Dong, R., Schwalm, M., Abid, M., Fabinsky, B., Naes, L. & Tsai, C.- Olofsson, J., Kennedy, G. M., Leinhardt, Z. M., Malhotra, R., W. (2010), ‘The Wide-field Infrared Survey Explorer (WISE): Hammer, M., Meng, H. Y. A., Rujopakarn, W., Rodriguez, Mission Description and Initial On-orbit Performance’, AJ J. E., Pepper, J., Reichart, D. E., James, D. & Stassun, 140(6), 1868–1881. K. G. (2019), ‘Extreme Debris Disk Variability: Exploring the Diverse Outcomes of Large Asteroid Impacts During the Wright, J. T., Cartier, K. M. S., Zhao, M., Jontof-Hutter, Era of Terrestrial Planet Formation’, AJ 157(5), 202. D. & Ford, E. B. (2016), ‘The Search for Extraterrestrial Civilizations with Large Energy Supplies. IV. The Signatures Suffern, K. G. (1977), ‘Some Thoughts on Dyson Spheres’, and Information Content of Transiting Megastructures’, ApJ Proceedings of the Astronomical Society of Australia 3, 177. 816(1), 17.

Supplement, A. E. (10 May 2018), ‘Allwise source catalog and Wright, J. T., Griffith, R. L., Sigurdsson, S., Povich, M. S. & reject table’. Mullan, B. (2014), ‘The Gˆ Infrared Search for Extraterrestrial URL: https: // wise2. ipac. caltech. edu/ docs/ release/ Civilizations with Large Energy Supplies. II. Framework, allwise/ expsup/ sec2_ 1a. html Strategy, and First Result’, ApJ 792(1), 27. Szczerba, R., Si´odmiak, N., Stasi´nska, G. & Borkowski, J. Wright, J. T., Mullan, B., Sigurdsson, S. & Povich, M. S. (2014), (2007), ‘An evolutionary catalogue of galactic post-agb and ˆ related objects’, Astronomy & Astrophysics 469(2), 799–806. ‘The G Infrared Search for Extraterrestrial Civilizations with Large Energy Supplies. I. Background and Justification’, ApJ Tasca, L., Rouan, D., Pelat, D., Kneib, J.-P., Le Fevre, 792(1), 26. O., Capak, P., Kartaltepe, J., Koekemoer, A., McCracken, H., Salvato, M., Sanders, D. & Willott, C. (2008), ‘A Wright, J. T. & Sigurdsson, S. (2016), ‘Families of Plausible robust morphological classification of high-redshift galaxies Solutions to the Puzzle of Boyajian’s Star’, ApJ 829(1), L3. using support vector machines on seeing limited images. ii. quantifying morphological k-correction in the cosmos field at Yoon, D.-H., Cho, S.-H., Kim, J., Yun, Y. j. & Park, Y.-S. 1¡z¡2: Ks band vs. i band’, Astronomy and Astrophysics 497. (2014), ‘SiO and H2O Maser Survey toward Post-asymptotic Giant Branch and Asymptotic Giant Branch Stars’, ApJS The European Space Agency, E. (2019), ‘Gaia overview’. 211(1), 15. URL: https: // www. esa. int/ Science_ Exploration/ Space_ Science/ Gaia_ overview Yung, B. H. K., Nakashima, J.-i. & Henkel, C. (2014), ‘Maser and Infrared Studies of Oxygen-rich Late/Post-asymptotic Giant Torres, G., Andersen, J. & Gim´enez,A. (2010), ‘Accurate masses Branch Stars and Water Fountains: Development of a New and radii of normal stars: modern results and applications’, Identification Method’, ApJ 794(1), 81. A&A Rev. 18(1-2), 67–126.

Vickers, S. B., Frew, D. J., Parker, Q. A. & Bojiˇci´c,I. S. (2015), Zackrisson, E., Korn, A. J., Wehrhahn, A. & Reiter, J. (2018), ‘New light on Galactic post-asymptotic giant branch stars - I. ‘SETI with Gaia: The Observational Signatures of Nearly First distance catalogue’, MNRAS 447(2), 1673–1691. Complete Dyson Spheres’, ApJ 862(1), 21.

Wang, S., Li, A. & Jiang, B. W. (2015), ‘Very Large Interstellar Zolotukhin, I. Y., Bachetti, M., Sartore, N., Chilingarian, I. V. Grains as Evidenced by the Mid-infrared Extinction’, ApJ & Webb, N. A. (2017), ‘The Slowest Spinning X-Ray Pulsar 811(1), 38. in an Extragalactic Globular Cluster’, ApJ 839(2), 125.

44 A Appendix A.1 Candidates amp. ąą rrlyrae edge of nebula small nebula? ASAS-SN mag.error optical: halo of a bright star artifact AllWISE? barely visible in AllWISE Gaia; vari low noise in WISE filters low noise in W3 &low W4 noise in W3 &low W4 noise in W3 & W4 low noise in W4 poor fit to DS model Maser Comments NO NO YES YES YES α YES YES YES YES YES YES NO YES MAYBE YES YES YES YES YES YES YES YES MAYBE YES MAYBE NO NO NO MAYBE NO YES YES NO NO NO MAYBE NO NO YES YES YES MAYBE YES YES YES YES MAYBE YES YES YES MAYBE MAYBE YES NO YES magnitude. All four entries G Nebula Variability H NO MAYBE NO MAYBE NO YES NO NO NO NO YES MAYBE NO NO NO NO NO NO YES NO NO NO MAYBE MAYBE MAYBE NO MAYBE NO NO NO NO Gaia G [deg] [deg] [mas] [mag] or maser emission and are discussed in section 7 are highlighted in bold. α source id, their sky position ra & dec, the distance parallax and the DR2. The first coloured column informs whether there is any present nebulous structure around the target. Gaia emission or maser radiation commonly associated with YSOs. In the final column we listed eventual comments on Gaia α Resulting list of intriguing objects with unusual infrared properties that are suitable for follow-up observations. The 2 4688962861072173184 13.307 -72.740 1.83711 16.338 12 3020042267484574592 4688999282403049984 91.74415 14.160 5529771188178522240 -5.91917 -72.47318 4512807594894454912 1.160 128.149 4.231 3216511182753680512 -37.984 287.797 12.340 17.467 84.559 2.804 15.788 -2.267 3.759 9.650 2.672 7.354 10.312 19 423255218711266176 1.27323 307981178401165184 59.650 17.42527 2.22828 5864379647690191872 27.950 1237696157506521600 14.150 207.025 5.10631 220.608 -63.765 4915465314358526592 11.408 19.503 1.983 19.431 42.318 19.297 -52.559 9.426 2.411 10.780 Nr source id1 4053122193037467648 275.352 -25.609 ra 2.937 17.418 dec parallax 34 20223287863617323525 20669408865141588486 515207994619423744 289.9617 3023174123277410304 311.6908 23.0695 33380057745145427209 34.845 43.753 3307329537537441024 1.729 85.80010 66.150 178330585480057088 83.790 433515788197481984 1.384 64.411 -4.997 15.709 13.171 44.04713 10.031 1.164 67.744 11.691 3.149 2.60114 4688962861072173184 44.048 2.409 2230965309587597824 37.649 10.854 2.541 11.180 16 9.031 13.307 2.777 9.838 5978759643294378752 335.388 10.744 -72.740 73.674 259.031 11.364 20 2.426 2.90421 -34.260 520116052593498828822 5789239989411572608 16.338 8.277 2161325369818713984 10.353 167.70624 274.266 195.55625 12.382 -77.298 2927856876162372480 64.56526 -76.633 249236647249997696 5.401 3.672 5849041464819162112 106.315 5.152 222.359 52.07829 11.453 14.496 -22.635 -65.26230 12.560 3481965995873045888 2.278 3.888 48.663 5554553591848668672 92.765 173.075 16.031 5.739 12.639 -47.191 -30.309 5.382 12.559 21.585 11.551 18.492 Table 3: columns show the are taken from The second onedetected tells any if H any source has identified any variability of the star. Thereafter, we note if there is any source that has the analysis. TheML notes process, marked where in most theevidence of coloured for the columns nebulosity, information variability, are H is based taken from on the manual, VizieR external database investigation, ( Ochsenbein not).2000 incorporated The within targets the that are showing no

i Table 3: Continued

Nr source id ra dec parallax G Nebula Variability Hα det. Maser det. Comments [deg] [deg] [mas] [mag]

32 5520949222275945728 122.936 -44.086 2.526 9.798 NO NO YES poor fit to DS model, low IR excess 33 5520129673798353664 123.694 -44.693 2.848 10.609 NO YES poor fit to DS model, low IR excess 34 5540151196420957952 125.800 -39.117 2.890 9.866 NO NO YES poor fit to DS model 35 6001669793442284416 235.193 -42.498 7.424 8.207 NO YES YES YES 36 3014390365402220544 79.002 -9.810 3.210 9.874 NO YES YES 37 3208927022126871680 81.582 -6.399 2.568 10.390 NO NO 38 3217370485451141760 84.260 -1.623 2.502 9.211 NO NO YES poor fit to DS model, low IR excess 39 3222064884704575232 80.802 0.913 2.805 10.102 NO YES 40 3221803265361341056 81.773 0.419 2.191 9.945 NO YES YES 41 3234348869128542080 81.034 2.463 2.784 9.690 NO NO YES NO poor fit to DS model, low IR excess 42 3222263312193940096 81.178 1.730 2.728 10.081 NO YES YES 43 3340623299383790848 82.579 11.339 2.313 10.022 NO YES YES poor fit to DS model, low IR excess 44 120035647206281344 54.752 29.696 2.505 10.580 NO YES YES ii 45 167380583540276864 61.498 29.944 2.906 12.076 NO NO YES 46 2270536045876468096 319.413 68.919 2.967 9.878 NO NO NO 47 4048015133392824064 276.124 -29.781 8.775 8.054 NO NO YES YES 48 4278396491852225024 275.925 4.497 2.771 13.152 MAYBE NO low noise in AllWISE 49 1932400859866635264 339.146 40.004 6.764 13.302 NO YES YES 50 2277845423082083200 313.277 74.843 2.683 11.972 NO YES 51 5825256966729127552 228.856 -64.517 2.234 15.688 NO YES 52 3014890093437477120 81.155 -8.701 6.313 9.659 NO YES 53 3208985468042433792 82.298 -6.135 2.892 10.322 NO YES YES 54 3215055566797377024 79.189 -1.856 2.678 10.593 NO YES NO 55 3220462655745525632 80.879 -1.073 2.821 10.279 NO YES YES 56 3373609228036492032 91.956 18.657 1.488 11.375 MAYBE YES YES YES 57 2203217037728935296 329.775 60.681 1.151 12.458 NO YES YES quasar? A.2 Uncertainty derivations

σ 2 σ “ 1.0857 fx ` pσ q2 (30) mx f mx0 dˆ ˆ x ˙˙

2 2 σfx 2 1 1 σmx “ 1.0857 ` pσxq ` max ´ 2.5log10pfx ` σfx q ` 2.5log10pfxq ˆ ´ 2.5log10pfxq ` 2.5log10pfx ´ σfx q d fx 1 1 ˆ ˆ ˙˙ ´ˇ ˇ ˇ ˇ ¯ ˇ ˇ ˇ (31) ˇ ˇ ˇ ˇ ˇ Where σx is the error of the photometric zeropoints in the Vega magnitude system for each passband (x); σG “ 0.0018, σBP “ 0.0014 and σRP “ 0.0019

iii A.3 Results of various models

(a) The result after training of the (b) The result after training of the (c) The result after training of the algorithm bagged trees. algorithm kNN. algorithm linear discriminant.

(d) The result after training of the (e) The result after training of the (f) The combined result after training of algorithm linear SVM. algorithm quadratic SVM. the algorithms linear and quadratic SVM.

(g) The combined result after training of the algorithm trees

Figure 23: Colour-magnitude diagrams showing DS candidates (red) found by the different algorithms indicated below each sub-figure. Each graph also contains the training set of stars (black). The DS models on which the algorithms are trained on are based on stars with G magnitudes within 4-6m.

iv A.4 Algorithm In this section, we present the MATLAB code for the two favourable machine learning algorithms; linear and quadratic SVM, i.e. the two classifiers. The features used in the training are; ra, dec the G magnitude, colours (BP-G, G-RP, G-J, G-H, G-Ks, G-W1, G-W2, G-W3 & G-W4 ) and corresponding errors. The cc-flags are split into four binary variables cc flag1, cc flag2, cc flag3 and cc flag4. The responses are the ”object types”, i.e the three classes ”stars”, ”DSs” and ”YSOs”. The codes presented below can be used to recreate the training of the algorithms assuming similar training data by giving the command; [trainedClassifier, validationAccuracy] = trainClassifier(T), where T is the training data. The reader can thereafter apply the classifier to a new dataset of choice to classify its entries, again, given that it includes the aforementioned parameters that were used in the training. This is done with the command; yfit = trainedClassifier.predictFcn(T2), where T2 is the new dataset. All commands and codes are adaped for MATLAB.

A.4.1 linear SVM

function[trainedClassifier, validationAccuracy] = trainClassifier( trainingData) %[trainedClassifier, validationAccuracy]= trainClassifier(trainingData) % returnsa trained classifier and its accuracy. This code recreates the % classification model trained in Classification Learner app. Use the % generated code to automate training the same model with new data, or to % learn how to programmatically train models. % % Input: % trainingData:a table containing the same predictor and response % columns as imported into the app. % % Output: % trainedClassifier:a struct containing the trained classifier. The % struct contains various fields with information about the trained % classifier. % % trainedClassifier.predictFcn:a function to make predictions on new % data. % % validationAccuracy:a double containing the accuracy in percent. In % the app, the History list displays this overall accuracy score for % each model. % % Use the code to train the model with new data. To retrain your % classifier, call the function from the command line with your original % data or new data as the input argument trainingData. % % For example, to retraina classifier trained with the original data set %T, enter: %[trainedClassifier, validationAccuracy]= trainClassifier(T) % % To make predictions with the returned’trainedClassifier’ on new data T2, % use % yfit= trainedClassifier.predictFcn(T2) % % T2 must bea table containing at least the same predictor columns as used % during training. For details, enter: % trainedClassifier.HowToPredict

% Auto-generated byMATLAB on 10-Jun-2021 16:03:51

% Extract predictors and response % This code processes the data into the right shape for training the

v % model. inputTable= trainingData; predictorNames={’ra’,’dec’,’G’,’BP_minus_G’,’G_minus_RP’,’G_minus_J’, ’G_minus_H’,’G_minus_Ks’,’G_minus_W1’,’G_minus_W2’,’G_minus_W3’,’ G_minus_W4’,’G_err’,’err_BP_minus_G’,’err_G_minus_RP’,’err_G_minus_J’ ,’err_G_minus_H’,’err_G_minus_Ks’,’err_G_minus_W1’,’err_G_minus_W2’, ’err_G_minus_W3’,’err_G_minus_W4’,’cc_flag1’,’cc_flag2’,’cc_flag3’,’ cc_flag4’}; predictors= inputTable(:, predictorNames); response= inputTable.object_type; isCategoricalPredictor=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false];

% Traina classifier % This code specifies all the classifier options and trains the classifier. template= templateSVM(... ’KernelFunction’,’linear’,... ’PolynomialOrder’,[],... ’KernelScale’,’auto’,... ’BoxConstraint’, 1, ... ’Standardize’, true); classificationSVM= fitcecoc(... predictors,... response,... ’Learners’, template,... ’Coding’,’onevsone’,... ’ClassNames’,{’Dyson sphere’;’Star’;’YSO’});

% Create the result struct with predict function predictorExtractionFcn=@(t)t(:, predictorNames); svmPredictFcn=@(x) predict(classificationSVM,x); trainedClassifier.predictFcn=@(x) svmPredictFcn(predictorExtractionFcn(x)) ;

% Add additional fields to the result struct trainedClassifier.RequiredVariables={’ra’,’dec’,’G’,’BP_minus_G’,’ G_minus_RP’,’G_minus_J’,’G_minus_H’,’G_minus_Ks’,’G_minus_W1’,’ G_minus_W2’,’G_minus_W3’,’G_minus_W4’,’G_err’,’err_BP_minus_G’,’ err_G_minus_RP’,’err_G_minus_J’,’err_G_minus_H’,’err_G_minus_Ks’,’ err_G_minus_W1’,’err_G_minus_W2’,’err_G_minus_W3’,’err_G_minus_W4’,’ cc_flag1’,’cc_flag2’,’cc_flag3’,’cc_flag4’}; trainedClassifier.ClassificationSVM= classificationSVM; trainedClassifier.About=’This struct isa trained model exported from Classification Learner R2017a.’; trainedClassifier.HowToPredict= sprintf(’To make predictions ona new table ,T, use:\n yfit=c.predictFcn(T)\nreplacing’’c’’ with the name of the variable that is this struct,e.g.’’trainedModel’’.\n\nThe table, T, must contain the variables returned by:\nc.RequiredVariables\ nVariable formats(e.g. matrix/vector, datatype) must match the original training data.\nAdditional variables are ignored.\n\nFor more information, seeHow to predict using an exported model.’);

% Extract predictors and response % This code processes the data into the right shape for training the % model. inputTable= trainingData;

vi predictorNames={’ra’,’dec’,’G’,’BP_minus_G’,’G_minus_RP’,’G_minus_J’, ’G_minus_H’,’G_minus_Ks’,’G_minus_W1’,’G_minus_W2’,’G_minus_W3’,’ G_minus_W4’,’G_err’,’err_BP_minus_G’,’err_G_minus_RP’,’err_G_minus_J’ ,’err_G_minus_H’,’err_G_minus_Ks’,’err_G_minus_W1’,’err_G_minus_W2’, ’err_G_minus_W3’,’err_G_minus_W4’,’cc_flag1’,’cc_flag2’,’cc_flag3’,’ cc_flag4’}; predictors= inputTable(:, predictorNames); response= inputTable.object_type; isCategoricalPredictor=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false];

% Perform cross-validation partitionedModel= crossval(trainedClassifier.ClassificationSVM,’KFold’, 5) ;

% Compute validation predictions [validationPredictions, validationScores] = kfoldPredict(partitionedModel);

% Compute validation accuracy validationAccuracy=1- kfoldLoss(partitionedModel,’LossFun’,’ ClassifError’); space space space

A.4.2 quadratic SVM

function[trainedClassifier, validationAccuracy] = trainClassifier( trainingData) %[trainedClassifier, validationAccuracy]= trainClassifier(trainingData) % returnsa trained classifier and its accuracy. This code recreates the % classification model trained in Classification Learner app. Use the % generated code to automate training the same model with new data, or to % learn how to programmatically train models. % % Input: % trainingData:a table containing the same predictor and response % columns as imported into the app. % % Output: % trainedClassifier:a struct containing the trained classifier. The % struct contains various fields with information about the trained % classifier. % % trainedClassifier.predictFcn:a function to make predictions on new % data. % % validationAccuracy:a double containing the accuracy in percent. In % the app, the History list displays this overall accuracy score for % each model. % % Use the code to train the model with new data. To retrain your % classifier, call the function from the command line with your original % data or new data as the input argument trainingData. % % For example, to retraina classifier trained with the original data set %T, enter: %[trainedClassifier, validationAccuracy]= trainClassifier(T) %

vii % To make predictions with the returned’trainedClassifier’ on new data T2, % use % yfit= trainedClassifier.predictFcn(T2) % % T2 must bea table containing at least the same predictor columns as used % during training. For details, enter: % trainedClassifier.HowToPredict

% Auto-generated byMATLAB on 10-Jun-2021 16:03:19

% Extract predictors and response % This code processes the data into the right shape for training the % model. inputTable= trainingData; predictorNames={’ra’,’dec’,’G’,’BP_minus_G’,’G_minus_RP’,’G_minus_J’, ’G_minus_H’,’G_minus_Ks’,’G_minus_W1’,’G_minus_W2’,’G_minus_W3’,’ G_minus_W4’,’G_err’,’err_BP_minus_G’,’err_G_minus_RP’,’err_G_minus_J’ ,’err_G_minus_H’,’err_G_minus_Ks’,’err_G_minus_W1’,’err_G_minus_W2’, ’err_G_minus_W3’,’err_G_minus_W4’,’cc_flag1’,’cc_flag2’,’cc_flag3’,’ cc_flag4’}; predictors= inputTable(:, predictorNames); response= inputTable.object_type; isCategoricalPredictor=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false];

% Traina classifier % This code specifies all the classifier options and trains the classifier. template= templateSVM(... ’KernelFunction’,’polynomial’,... ’PolynomialOrder’, 2, ... ’KernelScale’,’auto’,... ’BoxConstraint’, 1, ... ’Standardize’, true); classificationSVM= fitcecoc(... predictors,... response,... ’Learners’, template,... ’Coding’,’onevsone’,... ’ClassNames’,{’Dyson sphere’;’Star’;’YSO’});

% Create the result struct with predict function predictorExtractionFcn=@(t)t(:, predictorNames); svmPredictFcn=@(x) predict(classificationSVM,x); trainedClassifier.predictFcn=@(x) svmPredictFcn(predictorExtractionFcn(x)) ;

% Add additional fields to the result struct trainedClassifier.RequiredVariables={’ra’,’dec’,’G’,’BP_minus_G’,’ G_minus_RP’,’G_minus_J’,’G_minus_H’,’G_minus_Ks’,’G_minus_W1’,’ G_minus_W2’,’G_minus_W3’,’G_minus_W4’,’G_err’,’err_BP_minus_G’,’ err_G_minus_RP’,’err_G_minus_J’,’err_G_minus_H’,’err_G_minus_Ks’,’ err_G_minus_W1’,’err_G_minus_W2’,’err_G_minus_W3’,’err_G_minus_W4’,’ cc_flag1’,’cc_flag2’,’cc_flag3’,’cc_flag4’}; trainedClassifier.ClassificationSVM= classificationSVM; trainedClassifier.About=’This struct isa trained model exported from Classification Learner R2017a.’; trainedClassifier.HowToPredict= sprintf(’To make predictions ona new table

viii ,T, use:\n yfit=c.predictFcn(T)\nreplacing’’c’’ with the name of the variable that is this struct,e.g.’’trainedModel’’.\n\nThe table, T, must contain the variables returned by:\nc.RequiredVariables\ nVariable formats(e.g. matrix/vector, datatype) must match the original training data.\nAdditional variables are ignored.\n\nFor more information, seeHow to predict using an exported model.’);

% Extract predictors and response % This code processes the data into the right shape for training the % model. inputTable= trainingData; predictorNames={’ra’,’dec’,’G’,’BP_minus_G’,’G_minus_RP’,’G_minus_J’, ’G_minus_H’,’G_minus_Ks’,’G_minus_W1’,’G_minus_W2’,’G_minus_W3’,’ G_minus_W4’,’G_err’,’err_BP_minus_G’,’err_G_minus_RP’,’err_G_minus_J’ ,’err_G_minus_H’,’err_G_minus_Ks’,’err_G_minus_W1’,’err_G_minus_W2’, ’err_G_minus_W3’,’err_G_minus_W4’,’cc_flag1’,’cc_flag2’,’cc_flag3’,’ cc_flag4’}; predictors= inputTable(:, predictorNames); response= inputTable.object_type; isCategoricalPredictor=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false];

% Perform cross-validation partitionedModel= crossval(trainedClassifier.ClassificationSVM,’KFold’, 5) ;

% Compute validation predictions [validationPredictions, validationScores] = kfoldPredict(partitionedModel);

% Compute validation accuracy validationAccuracy=1- kfoldLoss(partitionedModel,’LossFun’,’ ClassifError’);

ix