Draft version December 22, 2020 Typeset using LATEX twocolumn style in AASTeX63

Untangling the III: Photometric Search for Pre-main Sequence with Deep Learning

Aidan McBride,1 Ryan Lingg,2 Marina Kounkel,1 Kevin Covey,1 and Brian Hutchinson2, 3

1Department of Physics and Astronomy, Western Washington University, 516 High St, Bellingham, WA 98225 2Department of Computer Science, Western Washington University, 516 High St, Bellingham, WA 98225 3Computing and Analytics Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99354

ABSTRACT A reliable census of of pre-main sequence stars with known ages is critical to our understanding of early stellar evolution, but historically there has been difficulty in separating such stars from the field. We present a trained neural network model, Sagitta, that relies on Gaia DR2 and 2MASS photometry to identify pre-main sequence stars and to derive their age estimates. Our model successfully recovers populations and stellar properties associated with known forming regions up to five kpc. Further- more, it allows for a detailed look at the star-forming history of the solar neighborhood, particularly at age ranges to which we were not previously sensitive. In particular, we observe several bubbles in the distribution of stars, the most notable of which is a ring of stars associated with the Local Bubble, which may have common origins with the Gould’s Belt. 1. INTRODUCTION two new techniques became available to the community. Historically, the pre-main sequence (PMS) stars that First is the phase space clustering. Young stars form have been the easiest to identify and classify are those in the dynamically cold molecular clouds. These clouds that are the youngest and are still in possession of would typically form anywhere from a few hundred to their natal envelopes and/or protoplanetary disks. Such several thousands of stars in relatively close proximity sources could be identified on the basis of large infrared that would typically have low velocity dispersion. Thus, excess, and these dusty young stellar objects (YSOs) through searching for an overdensity in the position and have been searched for using a number of all-sky sur- velocity space it is possible to identify a young comoving veys, such as using IRAS, 2MASS, AKARI, and WISE, group of stars. Such clustering has been employed both (e.g., Prusti et al. 1992; Koenig et al. 2012; T´othet al. systematically across the entire Galactic Disk (Kounkel 2014; Marton et al. 2016). Furthermore, detailed in- & Covey 2019, hereafter Paper I) as well as to better frared maps of a large number of star forming regions constrain the membership of individual star forming re- have been constructed with more targeted surveys, such gions (e.g., Kounkel et al. 2018; Galli et al. 2019; Dami- as with Spitzer and Herschel (e.g., Evans et al. 2009; ani et al. 2019). Megeath et al. 2012; Fischer et al. 2017). But, clustering requires that all of the stars in a co- However, after a star loses its protoplanetary disk, its moving group retain the group’s characteristic velocity colors begin to resemble those of much more evolved in order to be identifiable. As these groups slowly dis- field stars, making follow-up identification difficult. solve into the Galaxy and lose coherence, an increas- ingly small fraction of them is recoverable. Indeed, even 1.1. Gaia DR2 classification of YSOs 1 Myr populations have some stars that have already arXiv:2012.10463v1 [astro-ph.SR] 18 Dec 2020 In comparison to previously available techniques (such been ejected from the clusters they inhabit (McBride & as spectroscopic follow up measurements, or X-ray emis- Kounkel 2019; Schoettler et al. 2020; Farias et al. 2020). sion), the release of Gaia DR2 (Gaia Collaboration et al. Searching for such high velocity YSOs may be of inter- 2018) has allowed a revolution in the search and char- est to better characterize intracluster dynamics, but it is acterization of young stars. Through its unprecedented impossible to do through clustering. Furthermore, some precision in the measurements of parallax and proper young populations may be too small or too diffuse to motions, as well as its remarkable photometric quality, robustly identify them with clustering at all. The second method that Gaia DR2 made possible is through examining the position of the stars on the HR [email protected], [email protected], diagram. YSOs are overluminous compared to their [email protected] main sequence counterparts due to their still-inflated 2 radii, and most are fainter and cooler than the red gi- community in the recent years are the isor- ants. If the distance is known accurately, it is possible chrones (Marigo et al. 2017). Due to slightly different to resolve the degeneracy between the nearby dwarfs assumptions regarding the underlying stellar physics, and distant giants, and thus separate YSOs from more these models produce distinct isochrones and evolution- evolved stars. ary tracks, and thus produce different age and mass esti- Such an approach is rather simple to use when at- mates even when applied to the same stellar population tempting to identify YSOs in a single star forming region (Hillenbrand et al. 2008). However, no isochrones offer with a known position on the sky and known distance, a perfect match to the data, especially for the low mass particularly if this region is only a few Myr old. In this stars. case, the low mass PMS stars can be cleanly separated For example, the M dwarfs appear to be overinflated from the low mass main sequence counterparts, and it compared to what the isochrones would suggest even in is possible to determine color cuts that would prevent one of the best studied open clusters, the Pleiades (Jack- contamination from red giant stars and or high mass son et al. 2018). Thus, attempting to estimate their age main sequence stars. However, extending this to mul- through isochrone fitting would yield a systematically tiple populations that have different ages, distances, or younger age than what is appropriate for the cluster. extinctions is difficult. Zari et al.(2018) performed a se- Similarly, a cluster “birthline” (i.e., the region of the lection of YSOs in the Gaia DR2 catalog that are consis- parameter space that would correspond to a 0 Myr pop- tent with being younger than the 20 Myr isochrone and ulation) is ill-defined, such that in the young populations that are located within 500 pc. Although the catalog that are just a few Myr old, the higher mass stars appear effectively identifies sources throughout known nearby to be systematically older than their low mass counter- star forming regions, at larger distances the contamina- parts (Hartmann 2009; Herczeg & Hillenbrand 2015). tion does become significant. Thus, it is necessary to The presence of a protoplanetary disk further alters the reevaluate the selection criteria if one wishes to reliably photometry in such a way that makes it difficult to place extend the catalog beyond 500 pc. a star onto the isochrones correctly. Machine learning, and, in particular, the use of neu- The second issue lies with the physical properties of ral networks, is a method that facilitates the search for YSOs. They tend to be more complex than even the complex correlations in large volumes of data. Machine most advanced stellar evolution models can account for. learning classifiers have been used to search for young Due to being mostly convective, a large fraction of their stars in a number of works, from searching for infrared photospheres are covered with spots, resulting in a mis- excess (Marton et al. 2016, 2019; Chiu et al. 2020), to match in the effective temperature Teff and the ex- using Hα in conjunction with photometry to search for pected mass of the star. Finally (albeit unavoidably) Herbig Ae/Be stars (Vioque et al. 2020), to using the op- is an issue of stellar multiplicity. Among the main se- tical Hubble Space Telescope colors to give a probabilis- quence stars, the binary sequence can be clearly seen tic assessment of young stars in the Magellanic clouds and separated from the single stars. However, in young (Ksoll et al. 2018). populations with an intrinsic age spread of even a few Myr, the visual binaries may be inferred to be system- 1.2. Derivation of stellar ages atically younger than what is appropriate (e.g., Bouma et al. 2020). Beyond classifying a star as young, extracting its prop- Third, there is an issue of the self-consistency of the erties (such as its age) can be a challenge. The way this fitting process. Even for the same populations of stars, is commonly done is through comparison of photometric using the same set of isochrones, but focusing on some- colors (or age-sensitive spectroscopic features) to theo- what different features and using a different interpola- retical isochrones. While this practice has a long stand- tion method, it is possible to produce an age estimate ing history (e.g., Cohen & Kuhi 1979; Greene & Meyer that is somewhat inconsistent between various works. 1995; Covey et al. 2010; Da Rio et al. 2012), this process Taking Orion as an example, particularly the region near is not trivial (e.g., Olney et al. 2020), especially with the 25 Ori cluster, while different authors were able to es- inconsistencies between young stars discussed earlier. timate roughly comparable ages (Kounkel et al. 2018; The first issue lies with the isochrones themselves. Brice˜noet al. 2019; Zari et al. 2019), some infer ages Over the years, a number of different stellar evolu- that are systematically older (Kos et al. 2019). Such tion models have been developed (e.g., D’Antona & differences would only compound when comparing ages Mazzitelli 1994; Baraffe et al. 1998; Siess et al. 2000; of individual stars. Baraffe et al. 2015; Choi et al. 2016), and the ones that seem to have gained the most wide-spread usage in the 3

To compensate for some of these issues, data driven have (e.g., it does not account for the presence of a pro- models may perform better compared to the theoretical toplanetary disk), but rather, it is an estimate along the isochrone fitting. With distilling the previously exist- line of sight to the star based on its galactic positions l, ing estimates of ages for stars (both in the cases when b, and π (Section 3.3). the age can be assigned to all stars in a cluster on a The tasks of both classification (i.e., blind selection population level, and in cases where measurements for of PMS stars) and regression (i.e., interpolation of the individual stars are available), it is possible to construct ages for the identified young stars) are reliant on the a neural network that would assign ages to stars. While different features of this catalog; furthermore, they re- the predictions it would generate could only be as accu- quire different augmentation procedures to improve the rate as input data on which the network is trained on, homogeneity of coverage. through leveraging the ages derived by various meth- ods, the systematic differences can be significantly re- 2.1. Classification training sample duced. This includes systematic differences between low The first necessary step is to select the PMS stars in and high mass stars, as well as differences between var- the catalog from Paper II. Massive YSOs will reach the ious stellar evolution models, resulting in a more self- main sequence sooner than the low mass ones: e.g., OB consistent interpolation that is more faithful to the data. stars will be born directly on the main sequence, whereas In this work, we present a tiered deep learning model, M dwarfs may take as long as 100 Myr to reach it. As that we refer to as Sagitta. This model identifies the main sequence stars of similar mass are difficult to dis- PMS stars using Gaia DR2 and 2 Micron All-Sky Survey tinguish from one another, regardless of their age, a sim- (2MASS) photometry and astrometry and estimates the ple cut in age is not sufficient to reliably separate PMS ages of these stars. In Section2 we describe the data stars. Rather, we compared the HR diagrams of popu- that were used to train Sagitta, as well as the data on lations in each 0.1 dex age bin to that of the Pleiades which we perform the evaluation. In Section3 we detail to determine the most massive/bluest star that can still the process of constructing and training the model. In be considered pre-main sequence. We then assign the Section4 we test the results benchmarked against other extinction-corrected [GBP − GRP ]0 color corresponding catalogs of PMS stars, as well as known star forming to such star as the location of the turn-off for that age regions. In Section5 we discuss the features in the data, bin. The extinction was estimated on a population level such as their implications on the star forming history of in Paper II. We then interpolated across all the age bins the solar neighborhood and the origin of the Gould’s to obtain the cut of belt. Finally, we conclude in Section6. 2 [GBP − GRP ]0 > 49.3686 − 14.3347 × t + 1.05042 × t 2. DATA where t is the age of the population in dex. This relation Neural networks are reliant on labelled training data is valid for t < 7.85 dex (Figure1). Using this age- in order to generate a model able to generalize; that is, dependent relation to identify the color at which the make accurate predictions for new sources. The bulk MS-PMS transition should occurs at a given age, we of the training data was obtained from Kounkel et al. assign a preliminary PMS designation to all low-mass (2020, hereafter Paper II). This catalog contains almost stars redward of the critical color in each population, 1 million stars that have been clustered together into according to its age. more than 8000 different moving groups, extending up Furthermore, we imposed a cut of to parallax limit of π > 0.2 mas. Average ages (rang- ing from < 1 Myr to 1 Gyr) has been inferred through MG0 > 2.8 × [GBP − GRP ]0 − 2 the photometry of each moving group’s members using the Auriga neural network, which is described in full in to separate possible contamination from red giants. Paper II. A total of 62,484 sources (about 6%) of the catalog To train its sister network, Sagitta, that we present in in Paper II meet these criteria, and are identified as the this paper, we ingest 8 parameters for each star in the training sample of likely PMS stars for Sagitta. The re- sample. Six of them are the photometries in different maining sources are considered to be evolved, i.e., main bandpasses, namely G, GBP , GRP , J, H, and Ks, with sequence or post-main sequence sources. All sources re- the 2MASS passbands queried using the precomputed ceive a binary numerical flag to indicate their evolution- Gaia DR2 crossmatch table. We also rely on parallax, ary state, with 1 assigned to all sources satisfying the as well as an approximation of extinction AV . The latter YSO criteria and 0 assigned to the remaining evolved is not necessarily the intrinsic extinction the star may sources. 4

1 ¡ modifying it to reduce various biases. Augmentation is 0 t = 7:1 dex also beneficial to improve performance beyond 1 kpc, t = 7:75 dex 1 where the training sample is highly incomplete, as few 2 Pleiades PMS stars at those distances are found in the Paper II 3 catalog due to the sensitivity limits. To augment the 4 sample, we modified the photometry of the real stars 0 5 G (both the clustered and random catalogs) to simulate M 6 7 the effects of them being found at different distances 8 (up to 5 kpc) and with different AV (up to 10 mag), 9 both of which were randomly generated, and random 10 errors in flux drawn from a normal distribution were 11 added to all passbands. The extinction coefficients for 12 different passbands were taken from the web interface 0:5 1:0 1:5 2:0 2:5 3:0 of the PARSEC isochrones (Marigo et al. 2017) as they [G G ] BP ¡ RP 0 cover a wide range of stellar masses and ages, with vari- Figure 1. A demonstration of a color-dependent cut-off for ous passbands applied to the photometry. Each real star pre-main sequence stars as a function of age in comparison was drawn multiple times, with this multiplier treated to the Pleiades. Note the presence of the binary sequence in as a hyperparameter in the model for each subset (Sec- each of the populations. tion 3.4). These ‘synthetic’ stars were passed along to the classifier alongside with the photometry of the real While the catalog from Paper II offers a comprehen- stars to improve generalization. sive coverage of PMS stars, it is incomplete at the old- est age bins for stars that are more evolved. All moving groups eventually dissolve into the Galaxy, and after this 2.2. Regression training sample happens, they are no longer recoverable through cluster- The training sample for the regression to estimate ages ing. Therefore, the distribution and properties of the red was limited only to the stars identified as PMS stars in giants that are found in the field is not represented by the catalog from Paper II from above, excluding the the stars in the Paper II catalog. This introduces a bias sources considered to be evolved. Otherwise, since the that may result in reddened red giants that scatter into main sequence stars of similar mass share the parame- PMS parameter space (or even located well above it) to ter space in fluxes, retaining them in the sample could erroneously be classified as PMS, as the classifier would introduce considerable biases to the PMS stars. not know enough about red giants to reject them. To Most of the PMS stars in Paper I and Paper II are correct for this bias we took 3 million randomly selected found in stellar strings, i.e., extended populations span- stars from Gaia DR2 (using the random index), with ning several tens or even hundreds of pc. Each individ- the same quality cuts as the ones specified in Section ual region can sustain star formation for up to ∼10 Myr. 2.3 to better train the network on the colors of older Such a duration is hardly noticeable within the errors of field stars to be able to discriminate against them. Any the ages assigned to populations older than 100 Myr, i.e, stars that happened to also be included in the clustered the colors and fluxes of a 90 Myr and 100 Myr stars are catalog were excluded from this random selection. We not going to be very different. However, the differences assume that the remaining stars in this random catalog in fluxes between the youngest and the oldest generation can all be considered to be evolved. If any true PMS of stars in the same region is more pronounced in regions stars remained in this random sample, their fraction, as that are still forming, that is to say, a 1 Myr and 10 Myr well as the overall number is expected to be so small as old stars look very different. Therefore, assigning a sin- not to make a substantial difference for the classifier, as gle age to all the stars in a single string if there is any PMS are comparatively rare relative to the older stellar sort of underlying age gradient (as is the case in Orion classes. or Sco Cen, for example) can introduce some biases. In The spatial position of the stars (i.e., l and b) was estimating ages for the moving groups in Paper II, Au- not used directly in the training sample to avoid intro- riga preferentially considers the oldest stars in a region, ducing spatial bias. However, some positions can be overestimating the ages of the younger stars. Without inferred through the combination of π and A . To pre- V correcting for this effect in the training sample, Sagitta vent spatial bias, we used augmentation, i.e., the process cannot accurately estimate ages of stars younger than a of making the training catalog larger through artificially few Myr. 5

Table 1. Ages different from Paper II assigned to young Third, individual ages of some young stars are avail- populations in the training set. able in the literature. Namely, we’ve included the sources in the catalogs from Palla & Stahler(2002); Kun Region Source Age (Myr) et al.(2009); Delgado et al.(2011); L´opez Mart´ıet al. Ara OB1a Wolk et al.(2008) 3 Carina: Tr 16 Smith & Brooks(2008) 3 (2013); Fang et al.(2013); Kumar et al.(2014); Herczeg Cep OB2a Kun et al.(2008) 7 & Hillenbrand(2014); Getman et al.(2014); Erickson Cep OB2b Kun et al.(2008) 3.7 et al.(2015); Azimlu et al.(2015); Fang et al.(2017); Cep OB3b Kun et al.(2008) 4 Su´arezet al.(2017); Prisinzano et al.(2018); Panwar Cep OB6 Kun et al.(2008) 38 Chamaeleon Luhman(2008) 2 et al.(2018) that have reliable parallaxes and that meet CrA Neuh¨auser& Forbrich(2008) 6 the age-dependent criteria for a source to be identified Cyg OB1 Reipurth & Schneider(2008) 7.5 as PMS from Section 2.1. This added 6,248 stars. Cyg OB2 Reipurth & Schneider(2008) 5 Finally, the ages have been reevaluated in the Orion Cyg OB3 Reipurth & Schneider(2008) 8.3 IC 348 Bally(2008) 2 Complex. As this region singularly contains the largest IC 1396 Walawender et al.(2008) 1 number of stars out of any other population in the cat- IC 5146 & W4 Herbig & Reipurth(2008) 1 alog, and it contains stars that span in age from <1 LK Hα 101 Andrews & Wolk(2008) 0.5 to 12 Myr, it is of particular importance for the train- Lagoon Tothill et al.(2008) 1 Lower Cen/Crux Preibisch & Mamajek(2008) 16 ing sample from Orion to achieve good accuracy across Lupus Comer´on(2008) 3.2 different age bins. The region does have a rather com- Monoceros Carpenter & Hodapp(2008) 6 plicated morphology that could not be broken into sub- NGC 1333 Walawender et al.(2008) 1 groups in Paper I. However, a different, more granular NGC 2264 Dahm(2008) 3 NGC 6383 Rauw & De Becker(2008) 2 analysis was performed by Kounkel et al.(2018), using NGC 6604 Reipurth(2008a) 4.5 hierarchical clustering of the 6-dimensional phase space NGC 6823 Prato et al.(2008) 5 to segment the Complex into 190 different groups, and Per OB2 Bally(2008) 6 an average age was estimated for each group. The sam- Rosette Nebula Rom´an-Z´u˜niga& Lada(2008) 3 Serpens Herczeg et al.(2019) 3 ple in that work, however, is limited to the stars almost Sh 2-234-Stock 8 Reipurth & Yan(2008) 2 a magnitude brighter than the sample in Paper II. Ex- Taurus/Auriga Kenyon et al.(2008) 1 cluding the fainter stars does introduce a bias in the Upper Cen/Lup Preibisch & Mamajek(2008) 17 training process. Thus, to shuffle these low mass stars Upper Sco Preibisch & Mamajek(2008) 5 ρ Oph Wilking et al.(2008) 0.3 into the most appropriate group, we created a simple fully connected neural network that has one layers with 300 neurons, taking in α, δ, µα, µδ, and π, and out- putting a probability of belonging to each one of the 190 groups. The members of the Orion Complex from Paper Thus, to compensate, several steps were taken. First, II that were not included in the catalog from Kounkel many of the strings can be subdivided into populations et al.(2018) were then assigned to the group with the more homogeneous in age. In Paper I, some of the highest probability. Then, each star was given a label of strings had to be manually assembled from smaller sub- the average age of the group it was in. groups based on the coherence in phase space. Further- Similarly to the classification training sample, to help more, the HR diagrams of them were visually exam- negating some of the potential spatial biases, the train- ined in trying to identify populations of different age ing sample for the age regression was augmented. As sequences in close proximity (e.g., as is the case with above, the synthetic sample was generated from the real ρ Oph and Upper Sco), and some attempt was made data through simulating the effects of changing the dis- to separate them into subgroups. We used Auriga on tance and extinction. these subgroups to generate a somewhat more granular distribution of ages in the training sample. 2.3. Evaluation sample Second, to achieve a better consistency with the ages To evaluate the performance of the classification and of well-studied star forming regions, we identified the regression models, we downloaded the Gaia DR2 data moving groups from Paper II that correspond to the that satisfied the following quality criteria: populations listed in the Handbook of Star Forming Re- gions (Reipurth 2008b,c), and assigned them the more phot bp rp excess factor > 1 + 0.015 × bp rp2 appropriate ages that are reported in the Handbook (see phot bp rp excess factor < 1.3 + 0.06 × bp rp2 Table1 for group identifiers and corresponding ages). ruwe < 1.4 6

phot g mean flux over error > 10 The networks were implemented using PyTorch (Paszke et al. 2017). The model’s architecture broadly phot bp mean flux over error > 10 consists of three main segments (Figure3). The first seg- ment is made up of of four 1-dimensional convolutional phot rp mean flux over error > 10 layers that gradually increase the number of channels parallax > 0.2 to a maximum size of 128. In this segment, only the first and third convolutional layers are followed by max- parallax/parallax error > 10 or parallax error < 0.1 pooling layers, but each of the convolutional layers or These criteria are adapted from the quality cuts by convolutional and max-pooling layer pairs is followed Lindegren et al.(2018). They also ensure that the by ReLU and Batch-Normalization operations. The sec- sources are nearby enough to have a complete coverage ond segment consists of a convolution layer that reduces of the volume of space over which low mass PMS stars the number of channels down from the output of the are detectable. All three fluxes from Gaia DR2 are re- first segment followed by a data reshaping operation quired to be detected with high signal to noise. 2MASS that transforms the convolution layer’s output into a fluxes can be undetected, in which case they are set to 1-dimensional list (width × channels → width only). a constant outside of their maximum range (See Section Once reshaped, the data are then fed into the third seg- 3.2). ment which consists of three fully connected linear layers The resulting sample consists of ∼139.3 million stars. where each layer is followed by a ReLU operation. These We note that the models presented in the paper can be final layers expand and then contract the processed data expected to work even if some of the selection restric- down to a single scalar output value. For the AV estima- tions are relaxed, although it is not done in this paper tion and age estimation models the output value is kept for the purposes of the computational expediency. as is (i.e., linear output activation), however in the case of the YSO classifier model this final value is then fed 3. SAGITTA NEURAL NETWORK into a logistical sigmoid function to bound the output probabilities between 0 and 1. 3.1. Network Architecture For flexibility and effectiveness in model structure, we implement a set of three convolutional neural networks 3.2. Data Handling (CNNs) where each network serves a distinct purpose. The first network generates the extinction map based In the process of passing the data through the model, on galactic position. The second network assigns each it is beneficial to first normalize all of the input param- star a probability of being pre-main sequence based on eters to a similar range with mean close to zero. This its photometry (using the training sample described in creates a more comparable dispersion of input parameter Section 2.1). And the third network approximates the magnitudes and mitigates potential issues with numer- ages of PMS stars, typically ranging from < 1 Myr to > ical stability or inherently biasing any input because of 40 Myr (using the training sample described in Section its original scaling. Thus, all the parameters for both 2.2). classification and regression were linearly scaled to the All three networks within Sagitta share a common ar- range of [−1, 1] based on the lower and upper bounds chitecture adapted from the MNIST configuration of an specified in the Table2. Although the overall distribu- online repository of models1, which was further adapted tion of either parallaxes or fluxes is not Gaussian, thus in Paper II for Auriga. However, there are important resulting in a skewed distribution that is dominated by differences between these models. The aforementioned values towards one of the ends of the normalization, the examples use 2-dimensional arrays as inputs (e.g., for overall bounds were considered to be sufficiently effec- Auriga the inputs are the the fluxes across various bands tive. for each star in a cluster). In the case of Sagitta though, A number of sources in the catalog may have one the input is only a 1-dimensional array representing pa- or more fluxes missing (most commonly in the 2MASS rameters of a single star, so the operational dimensions bands). The non-detection usually carries meaningful (such as convolution and pooling) in the network have information, as it shows that the source is fainter than been reduced. the detection limit, and limiting the catalog to only the sources for which complete data across the differ- ent bands are available would not be optimal. Nonethe- 1 https://github.com/eladhoffer/convNet.pytorch/blob/master/ less, neural networks are unable to handle null values as models/mnist.py inputs. 7

40 60 60 35

30 30 30

25

120

60 180 0 0 0

300

240

180 20 Age (Myr)

15 -30 -30

10

-60 -60 5

-90 0

6 ¡ 40 40 4 ¡ 5 2 35 ¡ 35 ¡ 0 30 30 ) ) 0 r r

2 y y

25 0 25 M G G M

4 ( ( M M e 20 e 5 20

6 g g A 8 15 A 15 10 10 10 10 12 5 5 14 15 16 0 0 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 ¡ G G [GBP GRP]0 BP ¡ RP ¡

Figure 2. Top: The distribution of PMS stars on the sky that were used as part of the training process. The sources are color coded by the age assigned to them. Bottom: HR diagram of stars in the input dataset, uncorrected (left) and corrected (right) for extinction. Grey points represent evolved stars used in the classifier training, while color coded points represent pre-main sequence stars weighted by isochrone age, used to train the age regressor model. Thus, to allow the inclusion of sources with incomplete networks. Convolutional layers in a CNN operate with data, we set the missing values of both the training and the use of a sliding filter component, wherein only data evaluation set to the upper limit specified in Table2. spatially close enough together to fit inside the filter can These upper limits are typically somewhat fainter than be used to directly detect patterns in that layer. The in- the detection limit in each band, allowing the network put features were ordered as AV , π, G, GBP , GRP , J, H, to learn to give these fluxes an appropriate weight. and K to preserve the rough ordering of the bandpasses Another aspect that was considered when structuring with the wavelength, and to keep π and G adjacent as the data was the exact ordering of the stellar input pa- they are measured in the same dataset, and attaching it rameters given to the YSO Classifier and Age Regressor at the end would have resulted with π being associated 11/28/2020 8 ReNeSe_2_WhD.da

Table 2. Normalization S P: constants used in training. A, P, G, BP, RP, J, H, K

Parameter Lower Upper C1 (5 , 32 ) π (mas) 0 29

AV (mag) 0 5 G (mag) 4 20 MP1 (2 ) GBP (mag) 0 21

GRP (mag) 0 19 RLU & BN J (mag) 0 18 H (mag) 0 18 K (mag) 0 18 C1 (3 , 64 ) Age (dex) 6 8

RLU & BN

C1 (3 , 64 ) with occasionally incomplete 2MASS data. Similarly, AV has the strongest effect in the optical portion of the MP1 (2 ) spectrum. However, as the initial width of the sliding filter is 5 elements being convolved together, the order should not have a very strong effect on the final results. RLU & BN Given that the features are not spatial in the traditional sense, we also tried using a fully connected deep neural C1 (3 , 128 ) network instead of the CNN, but found it to perform somewhat worse. RLU & BN

3.3. AV estimate D L In order to help with differentiating the PMS stars from reddened massive main sequence stars and red gi- C1 (1 , 10 ) ants that scatter into the parameter space PMS stars inhabit, one parameter that can help is an estimate of R AV along the line of sight. This estimate was obtained from the neural network

L (60, 512) used to generate the completeness map in Paper II. The model was not modified in any way; rather, it is ported into Sagitta directly as the first step. RLU The AV estimator uses the same architecture as the classifier and the age regressor. It was trained on 3 mil- L (512, 512) lion randomly chosen stars from Gaia DR2 that had the same quality constraints as the ones imposed in 2.3 and RLU with measured AG reported in the catalog. The network used l, b, and π to predict AV (scaled D L from AG by a factor of 0.859, Marigo et al. 2017) corre- sponding to the particular 3d spatial position. Although

L (512, 1) transformation extinction from one bandpass to another can be a complex process, the linear transformation was done for the sake of nomenclature. As all of the param- O A (S I) eters are also normalized through linear scaling, the net result is comparable to training on AG directly. YSO A P In training, positions were normalized from 0 to 1 for l from 0 to 360◦, b from -90 to 90◦, π from 0 to 5 mas, and Figure 3. Architecture of the model used by Sagitta. This AV from 0 to 5 mag, and the individual measurements architecture is used by each of three networks in this work, of π or AV were allowed to exceed the maximum to independently of each other. be > 1 after the normalization. The training was done

1/2 9

3:0 model to train off of. The development set, containing the next 10% of the sources, was used during training to 2:5 evaluate model performance but was never shown to the

0 model as examples. The testing set, containing the last 0 0

5 10% of sources, was only used after tuning to confirm 2:0 generalization. With the training set in place, we augmented the cat- ) V c 1:5 alog in order to increase performance. The subsets that A p 0 ( were sampled for augmentation included PMS stars from Y Paper II, the non-PMS stars from Paper II, and the ran- 1:0 domly selected 3 million star sample from Gaia DR2 (see

0 Section 2.1. Due to the difference in size of these sub- 0 0

5 sets, all the stars in each of these subsets was sampled

50 ¡ 0:5 ¡ 00 0 certain amount of times, changing π and AV to affect the 5000 X (pc) flux, to produce the augmented sample. The number of 0 times each star was sampled was treated as a hyperpa- Figure 4. A slice of sphere across the galactic plane, show- rameter in the training process3. Multiple models were ing the 3-dimensional distribution of the sources in the eval- trained on all permutations of the augmentation ratios, uation sample, color-coded by the predicted AV values. however adjusting the ratios did not appear to have a strong impact on the performance of the model. using the Adam optimizer, mean square error loss, and Finally, the evaluation sample described in section 2.3 a learning rate of 10−3. (that consisted mostly of sources for which we did not The resulting estimate of AV is consistent to within have explicit a-priori labels) was used for final testing 0.3 mag with the extinction map from Green et al. and compare performance of various models through ex- (2019) over the applicable volume, as well as to the clus- amining known star forming regions, regions of high ex- ter AV s estimated on the population level in through tinction, and other features of the solar neighborhood pseudo-isochrone fitting with Auriga (Paper II). The re- (Section 3.4.2). sulting 3-dimensional extinction map is shown in the To increase the classifier’s performance in distinguish- Figure4. ing PMS stars, during the data augmentation phase we This is sufficiently precise for the purpose of this pa- oversampled PMS stars (i.e., used stratified sampling), per, as both the classifier and the age regressor do not yielding 15% PMS stars during training (compared to depend on the absolute magnitude of AV (or AG) di- 1.6% in the training set). By improving class balance, rectly. Rather, they rely on the non-linear correlations the model had to focus more on correctly identifying and that are present in the data that can be inferred with a reducing contamination in the PMS star class. How- help of this parameter, and they can learn to compen- ever, while the initial augmentation has improved sen- sate for color-dependent systematic differences that may sitivity to more distant PMS stars compared to the be present. unaugmented sample, continuing to grow their number in training through augmentation would not necessarily 3.4. YSO Classifier result in a better classifier, as it provides diminishing 3.4.1. Training returns. Once the training had completed, the output detection threshold used for extracting the PMS stars The classification network was trained to perform a was then selected based on methods described in sec- binary classification task, with classes 0 (not PMS) and tion 3.4.2. Fine tuning of the PMS to non-PMS star 1 (PMS). It maps each stellar parameters A , π, G, V ratio was found to not have a significant impact on the G , G , J, H, and K to a single scalar output in BP RP performance of the model. (0, 1) representing the star’s probability being PMS, us- Hyperparameter tuning and early stopping were also ing a logistic sigmoid output activation. The model is employed to help improve classifier predictions. The trained to maximize the log-likelihood of the training list of possible settings for each of the hyperparameters data (equivalent to cross-entropy loss). tuned are listed in table3. The models in the hyper- The data from Paper II was partitioned into three dis- parameter sweep were trained until the development set joint sets where each served a distinct purpose in classi- loss failed to beat its best loss for 20 successive epochs, at fier development. The training set, containing the first which point that model training stopped. During each 80% of the sources, was comprised of the stars for the 10

Table 3. Classifier Hyperparameter Tuning Values

Hyperparameter Values Optimizer Adadelta, Adagrad, Adam, RMSProp, SGD Learning Rate 0.001, 0.01, 0.1 Dropout 0%, 10%, 30%, 50%, 70% Minibatch Size 5000, 10000, 25000, 50000 W eight Decay 0, 0.00001, 0.0001, 0.001 Subset Star Sample Rate PMS stars from Paper II 20, 25, 50 Non-PMS stars from Paper II 3, 5, 10 Random stars from Gaia DR2 1, 2

model’s training, only the snapshot of the model with the best development set performance was saved. Once the sweep finished, each model’s predicted outputs on the development set was used to visually confirm that the model was predicting desired values. The final model Figure 5. Identified PMS sources > 70% probability to- used in the pipeline was chosen based off its low devel- wards ρ Oph and Upper Sco, plotted over the Gaia DR2 map of the sky (Gaia Collaboration et al. 2018). As the opment set loss and qualitatively good predictions. ρ Oph dark cloud has high extinction, it is clearly visible Through the hyperparameter sweep, it was found that in this map. Note the highest confidence PMS sources are using very little or no weight decay consistently provided tracing the known regions of star formation. On the other models with the best development set performance. All hand, sources with lower probability tend to be co-located of the other hyperparameters tuned on seemed to not along the pertruding filaments that are not actively forming produce any significant improvement in model perfor- stars. mance one way or another. The configuration of hyper- parameters that produced the best model was comprised As ρ Oph is very young and deeply embedded in a of Adagrad for the optimizer, 0% dropout, a batch size dusty cloud, the line of sight extinction rises significantly of 5000, a learning rate of 0.01, and a weight decay of behind the cloud, offering an excellent test of the sen- 10−5. sitivity of contamination to reddening (Figure5). Sim- ilarly, the cloud has a very particular shape, with sev- eral long filaments protruding away from the center - 3.4.2. Classifier validation these filaments are not actively involved in star forma- The Upper Sco region, which contains the ρ Oph dark tion. Even without considering their distance, contami- cloud, is particularly useful in evaluation of how reli- nation due to extinction can be apparent if the identified ably the classifier can discriminate between bona fide PMS candidates follow the outline of the cloud too well. YSOs and the contamination from evolved stars that With this in mind, we imposed several criteria to eval- have been reddened to the PMS parameter space. It is uate the different trained classifier models with different located nearby, with π ∼ 7 mas. No other star forming hyperparameters. Ordering the model predictions from regions are known to be located behind it, nor are there the highest PMS probability to the lowest, for sources likely to be any distant undiscovered populations behind within the box of 345 < l < 360◦ and 15 < b < 25◦, it, given its high elevation above the Galactic Plane, at we identified the typical probability threshold for each b ∼ 20◦. Therefore, if any sources identified as PMS are model where the number of sources with π < 5 begins located far beyond, e.g., 200 pc, they are most likely to to match the number of sources with π > 5 in a given be false positives. The precision in distance that can be probability bin (i.e., the point where the rate contamina- inferred from the parallax decreases the further away a tion/false positives is comparable to the rate of adding star is, while the number of field stars in each parallax bona fide PMS stars/true positives). The best model bin increases. Therefore, the large parallax of Upper needs to Sco makes it particularly easy to separate the stars as- sociated with this region from the false positives, to the degree that even other star forming regions along the • Maximize the overall number of π > 5 sources that Gould’s Belt do not. the model identifies up to that point 11

1:0 1:0 4 4 ¡ ¡ 0:9 0:9 2 2 ¡ ¡ 0 0:8 0 0:8

2 0:7 2 0:7 y y t t

4 i 4 i 0:6 l 0:6 l i i b b a a b b G 6 G 6 0:5 o 0:5 o r r M M p p

8 S 8 S

0:4 M 0:4 M 10 P 10 P 0:3 0:3 12 12 0:2 0:2 14 14

16 ¼ > 1 0:1 16 0:5 < ¼ < 1 0:1

18 0 18 0 1 0 1 2 3 4 5 1 0 1 2 3 4 5 ¡ ¡ GBP GRP GBP GRP ¡ 40 ¡ 40 4 4 ¡ ¡ 2 35 2 35 ¡ ¡ 0 0 30 30 2 2

4 25 4 25 ) ) r r y y G G

6 M 6 M

20 ( 20 ( M M e e 8 g 8 g A A 15 15 10 10

12 10 12 10 14 14 5 5 16 90% probability 16 70% probability

18 0 18 0 0:5 0 0:5 1:0 1:5 2:0 2:5 3:0 3:5 4:0 4:5 5:0 0:5 0 0:5 1:0 1:5 2:0 2:5 3:0 3:5 4:0 4:5 5:0 ¡ ¡ G G G G BP ¡ RP BP ¡ RP Figure 6. HR diagram of the evaluation sample. Top: Color coded by the maximum probability of identifying a star as PMS in each 2-dimensional bin, in two distance slices. Bottom: Color coded by the mean age in each 2-dimensional bin. The bottom left panel shows the sample with PMS probability >90%, the bottom right panel shows the sample wihth > 70% probability. They are plotted over the greyscale HR diagram of a random subset of the full evaluation sample. • Minimize the ratio [number of sources with π < tive) PMS candidates that are found towards the ecliptic 5]/[number of sources with π > 5], i.e., mini- poles in almost all models. Such false positive are usu- mize the overall contamination fraction up to that ally fainter stars that do not have 2MASS photometry. point. Depending on the exact limiting threshold, this usually amounts to a few hundred stars. Therefore, in evalua- Although these criteria have been optimized for the tion of the best model we also consider maximizing the selection of the Upper Sco sources, they generally yield number of sources with π < 1 and |b| < 15 and mini- a good selection of PMS candidates in other nearby star mizing the number of sources with π < 1 and |b| > 20. forming regions as well. With all of these considerations, we tested > 100 dif- Selection of regions beyond 1 kpc presents a bigger ferent models that were trained using different hyper- challenge, as their lower mass members tend to be too parameters and with slightly different architectures. faint to be within the sensitivity limits. Thus, members Most models had comparable performance in the evalu- of various star forming regions located at those distances ation sample, although some had a greater difficulty in tend to have overall lower probability than their more separating false positives from true positives. nearby counterparts. Because of this, it is difficult to Of all of them, however, one model had almost an or- estimate the contamination among them. der of magnitude better performance than the rest in the However, almost all of the identified YSOs beyond 1 combined evaluation metric, although it is unclear if it kpc should be located close to the Galactic plane. While was due to the most optimal tuning of the hyperparam- this is generally the case, due to the scanning law of eters, or luck in the process of the stochastic gradient Gaia, there is a slight excess of (most likely false posi- descent. Regardless, this classifier model was chosen to 12

5000 Synthetic stars 3.5. Age Regressor True positives; P > 0:9 2000 False positives; P > 0:9 Similarly to the classifier, the regression network to 1000 predict ages was trained on the six photometric bands, 500 π, and AV . In addition to the Gaia and 2MASS bands, 200 we originally considered including the photometry from 100 the AllWISE catalog as well, but it was determined to 50 be too noisy. 20 In constructing the catalog, we used 30% real data, 10 and 70% augmented data scattered across different dis- 5 tances and extinctions, for a total of a total of ∼ 187, 000 2 sources. 80% of this catalog was used as a training set, 1 20 50 100 200 500 1000 2000 5000 10% was used as the development set, and 10% was Distance (pc) withheld as a test set. The training was done using Figure 7. Recovery of PMS stars in a synthetically gener- the Adam optimizer, mean square error loss, a learning ated sample as a function of distance. A similarly smooth rate of 10−3, and a batch size of 20,000 sources. Ev- distribution (with a higher recovery fraction at a cost of more ery 10 epochs we evaluated the performance against the false positives) can be seen at lower thresholds as well. development set, to ensure that the network is not over- fitting on the training set (e.g., ‘memorizing’ it), but be implemented into Sagitta. The HR diagram showing rather that it is learning patterns that generalize to the the outputs of this model is shown in Figure6. previously unseen data. We note that while some data driven approaches (such The training continued for ∼10,000 epochs. After- as, e.g., decision trees) may be reduced to a human- wards, we continued to train the model for ∼2,000 readable set of conditions by which classification takes epochs on real data only, to minimize potential artefacts place, this is not the case with deep learning. As such, that may be present in the augmented sample. How- while the model can differentiate PMS and non-PMS ever, we note that in evaluating the ages on the test stars based on their fluxes (most likely noting in some set, there were no significant systematic differences be- fashion that PMS stars tend to be redder and/or over- tween the models with and without the additional 2,000 luminous than the main sequence stars, but not in the epochs on real data. Similarly, in various experiments, parameter space inhabited by red giants), it is difficult few combinations of hyperparameters were tested, but to express precisely how these fluxes are utilized by the they tended to have comparable outputs with few obvi- model. Although beyond the scope of the current study, ous differences in performance (in contrast to the exper- one could employ model interpretability methods (e.g. iments with hyperparameters in the classifier). Instead, Zeiler & Fergus(2014)) in an attempt to gain some in- for the age regression, the biggest gains in performance sight into model behavior. were a result of careful vetting of the labels in the train- To ensure that no residual biases in distance from mas- ing sample. sive populations propagate to the model, we examine In general, the trained model is able to qualitatively a synthetically generated sample of stars based on the reproduce the average ages of the populations in which sample from Paper II, similar to the one described in stars are found (Section 4.2). It does improve on the Section 2.1. All of the synthetic stars have randomly ability to infer the star forming histories of different re- drawn distance and extinction, and include both the gions compared to the training sample, where usually stars labeled as PMS as well as those that were more only a single age per population was available. evolved. There are some difference in the fraction of We are able to benchmark the estimates of ages for sources recovered at the distances between 20 to 5000 some of the stars that have been previously observed pc relative to the input sample (e.g., smaller fraction of by APOGEE spectrograph. A recent study by Olney more distant sources is recovered at the same probabil- et al.(2020) has been able to extract calibrated log g ity threshold, in part due to a smaller fraction of low estimates for the pre-main sequence stars, which can be mass stars). However, in a uniform sample the model used as a proxy of age. The overall trend in Figure8 does not systematically favor a specific set of distances does show that, as expected, log g is increasing as stars corresponding to, e.g., the distance of the Orion Com- evolve and approach the main sequence. plex or Sco Cen OB2, either in a form of better recovery of true positives than for stars at other distances, or in form of contamination from evolved stars (Figure7). 3.6. Uncertainties 13

5:0 When comparing the classification outputs for the un- 4:8 altered sample vs the mean classification from 1000 al- 4:6 tered iterations from considering the uncertainties, the 4:4 mean classifications do appear to be somewhat more ac- )

x 4:2 e curate, better able to filter out the suspicious sources d ( 4:0 g (such as those described in Section 3.4.2 as likely false g

o 3:8 l positives). The mean classification probability is also 3:6 typically lower – thus we are not likely to miss a signifi- 3:4 cant number of sources for which the mean classification 3:2 returns a higher confidence. 3:0 As expected, the scatter/errors increase for sources 0:2 0:5 1 2 5 10 20 Age (Myr) with fainter and more uncertain fluxes, as well as for those that are more distant and have more uncertain Figure 8. Comparison of the estimated ages from Sagitta vs parallaxes. The scatter in the classification outputs also log g inferred from the APOGEE spectra from Olney et al. increases for sources with lower certainties (as a conse- (2020). The overdensities correspond to specific discrete clusters targeted by APOGEE. The lines show the theoret- quence, those that are older). ical PARSEC isochrones (Marigo et al. 2017) for stars with The computed uncertainties in age are typically on mass from 0.4 (purple) to 1 M (red). the order of 0.1 dex (Figure9). They do not strongly depend on age or color of a star, although there is a slight Convolution neural networks by default do not ac- dependence on distance, with the nearby populations count for the uncertainties in the data, nor do they having somewhat lower uncertainties. output the corresponding uncertainties in the predic- tions. Although some other machine learning architec- 4. EVALUATION AND VALIDATION tures may be able to better positioned to learn the av- 4.1. Overall performance erage uncertainty in the data and provide a resulting Table4 contains the catalog of the sources in the eval- Bayesian posterior distribution, even they struggle to uation catalog that can be identified as PMS sources accurately parse individual source-by-source input-by- with at least 70% confidence. This catalog consists of input uncertainty, such that can be present in e.g., pho- 197,315 sources. Figure 10 shows the distribution of the tometry or in parallaxes. identified sources along the sky, according to the differ- As an alternative, we utilize the method used by Olney ent cuts in confidence levels, color-coded by their esti- et al.(2020) and that was used in Paper II, by generat- mated ages. Figure 11 shows the spatial distribution of ing 1000 samples per each star for the sources in Table stars at different age slices, to better highlight the star 4, where all the inputs are scattered by adding errors to forming history of the solar neighborhood. them drawn from a normal distribution with the width The evaluation catalog extends up to 5 kpc in parallax, corresponding to the reported uncertainty. Each one of however, at larger distances, increasingly fewer low mass these realizations of the same star was passed through stars can be detected. Therefore, 90% of all sources the network. Uncertainties were estimated by calculat- classified as PMS sources are located within 1 kpc, and ing the standard deviation of the outputs. By using 70% are located within 500 pc. The difference becomes this method, our uncertainties are indicative of both the more extreme at particular age ranges. Only lower mass model’s stability at any given photometric regime, and stars can be still be identified as PMS at older ages (e.g., the underlying photometric errors present in the input >30 Myr), thus, the ability to identify them at larger data. If photometric errors were not available in any distances is suppressed compared to younger (e.g., < 5 bandpass (this is occasionally common for 2MASS data, Myr) stars. Similarly, the confidence with which PMS even if fluxes themselves are available), they were as- stars can be identified tends to be lower both for sources signed an uncertainty of 0.1 mag. that are older as well as for sources that are more distant Despite the efficiency of neural networks, it is still (compared to younger nearby sources), as both of them time consuming to process the entire Gaia catalog even are located preferentially closer to the main sequence once, let alone several times, particularly on the ma- and/or the red giant branch. chines without GPU acceleration. Thus the statistics We note that only ∼30,000 out of ∼200,000 candidate were generated only on the subset of the evaluation sam- PMS stars presented here were also used in the training ple that has been classified with unaltered inputs with and testing sample from Paper II. Furthermore, while probability of being PMS > 70%. some of them may have also been previously included in 14

1 1 1 0:5 0:5 0:5 ) ) ) x x x e e e d d 0:2 0:2 d 0:2 ( ( ( r r r o o o

r r 0:1 0:1 r 0:1 r r r E E E

e e 0:05

0:05 e 0:05 g g g A A A d d 0:02 0:02 d 0:02 e e e t t t c c c i i i

d d 0:01 0:01 d 0:01 e e e r r r P P 0:005 0:005 P 0:005

0:002 0:002 0:002 5:0 5:5 6:0 6:5 7:0 7:5 8:0 0:2 0:5 1 2 5 10 20 0:5 1:0 1:5 2:0 2:5 3:0 3:5 4:0 4:5 5:0 Age (dex) Parallax (mas) G G BP ¡ RP

0:5 0:5 0:5

0:2 0:2 0:2 r r 0:1 0:1 r 0:1 o o o r r r r r 0:05 0:05 r 0:05 E E E y y y t t t i i i l l 0:02 0:02 l 0:02 i i i b b b a a 0:01 0:01 a 0:01 b b b o o o r r 0:005 0:005 r 0:005 P P P O O O S S 0:002 0:002 S 0:002 Y Y Y 0:001 0:001 0:001 5e 4 5e 4 5e 4 ¡ ¡ ¡ 0:70 0:75 0:80 0:85 0:90 0:95 1:00 0:2 0:5 1 2 5 10 20 0:5 1:0 1:5 2:0 2:5 3:0 3:5 4:0 4:5 5:0 YSO Probability Parallax (mas) G G BP ¡ RP ) ) 0:1 0:1 ) 0:1 g g g a a a m m m ( ( ( r r r o o 0:01 0:01 o 0:01 r r r r r r E E E V V V A A 0:001 0:001 A 0:001 d d d e e e t t t c c c i i i d d d e e e r r 1e 4 1e 4 r 1e 4 P P ¡ ¡ P ¡

1e 5 1e 5 1e 5 ¡ ¡ ¡ 0:5 1:0 1:5 2:0 2:5 3:0 0:2 0:5 1 2 5 10 20 0:5 1:0 1:5 2:0 2:5 3:0 3:5 4:0 4:5 5:0 AV (mag) Parallax (mas) G G BP ¡ RP

Figure 9. Errors produced for sources with YSO probability >70% by scattering model inputs by their reported uncertainties. other studies of pre-main sequence stars (most notably 70% threshold there is a hint of overdensity from the bi- in Zari et al. 2018, see Section 4.3.1), the bulk of the nary stars in Praesepe, which is a 600 Myr cluster that catalog are new identifications. does not contain PMS stars. At 80% threshold, this Few notes of caution should be given to the unre- overdensity disappears. solved binaries. The classifier largely avoids the binary The reason for this is that the sources that are on the sequence of evolved stars, particularly at higher proba- binary sequence have been included in the training sam- bilities (Figure 12). No confusion occurs in the sources ple, thus the classifier has learned that the colors that younger than ∼40–60 Myr in the sources, as they are lo- correspond to these main sequence binaries are likely cated above the binary sequence. However some of con- false positives. It cannot effectively separate true single tamination from binaries can be present at lower prob- PMS stars older than 60 Myr that overlap with the bi- abilities (e.g., at thresholds <70–80%). For example, at nary sequence. However, as such young stars are rare in 15

Upper Sco ρ Oph Serpens Vela Taurus & Perseus

Orion

90% Probability 60 60 40

35 30 30 30

25

120

60 180 0 0 0

300

240 180 20 Age (Myr)

-30 -30 15

10 -60 -60 5

-90 0

85% Probability 60 60 40

35 30 30 30

25

120

60 180 0 0 0

300

240 180 20 Age (Myr)

-30 -30 15

10 -60 -60 5

-90 0

Figure 10. The distribution stellar positions of the evaluation sample up to PMS probabilities of 95%, 90%, and 85%. The plots are in the Galactic coordinates and are color coded by the predicted age, and the first panel has been annotated to indicate notable star forming regions. Note that older sources are more apparent at lower certainty thresholds, as they are located closer to the main sequence. 16

Table 4. Sample of output catalog based on DR2, with age, classification, and predicted extinction.

Gaia DR2 l b Predicted Age PMS Predicted AV

Source ID (deg.) (deg.) (dex) Probability (3d pos., mag.)

25220048169856 176.382 -48.739 7.369 ± 0.177 0.740 ± 0.100 0.382 ± 0.008 2194034672517070720 97.309 9.949 6.817 ± 0.028 0.762 ± 0.131 0.822 ± 0.008 5224626096939825024 298.281 14.145 6.106 ± 0.032 0.955 ± 0.009 0.378 ± 0.001

Only a portion shown here. Full table is available in an electronic form.

Age < 2 Myr 20 2 Myr < Age < 5 Myr 20

60 60 10 60 60 10

5 5 30 30 30 30

2 ) 2 ) s s a a m m ( (

1 1

6 6

0 1 0 1 1 2 0 1 2 0

0 0

0 0 x 0 0 x 0 0

0 0

8 0 4 0 8 0 4 0 a a

3 3

0 2 8 0 2 8 l l l l 1 1 a a

0:5 r 0:5 r a a P P 30 30 30 30 ¡ ¡ 0:2 ¡ ¡ 0:2

0:1 0:1 60 60 60 60 ¡ ¡ ¡ 0:05 ¡ 0:05 90 90 ¡ ¡

5 Myr < Age < 10 Myr 20 10 Myr < Age < 16 Myr 20

60 60 10 60 60 10

5 5 30 30 30 30

2 ) 2 ) s s a a m m ( (

1 1

6 6

0 1 0 1 1 2 0 1 2 0

0 0

0 0 x 0 0 x 0 0

0 0

8 0 4 0 8 0 4 0 a a

3 3

0 2 8 0 2 8 l l l l 1 1 a a

0:5 r 0:5 r a a P P 30 30 30 30 ¡ ¡ 0:2 ¡ ¡ 0:2

0:1 0:1 60 60 60 60 ¡ ¡ ¡ 0:05 ¡ 0:05 90 90 ¡ ¡

16 Myr < Age < 25 Myr 20

60 60 10

5 30 30

2 ) s a m (

1

6

0 1 1 2 0

0

0 0 x 0

0

8 0 4 0 a

3

0 2 8 l l 1 a

0:5 r a P 30 30 ¡ ¡ 0:2

0:1 60 60 ¡ ¡ 0:05 90 ¡

Figure 11. Distribution of stars at various age ranges for sources with 85% confidence threshold, color coded by parallax, to demonstrate the various features that emerge for different epochs of star forming history. Note the ring-shaped structure apparent in the last panel, discussed in section 5.2. 17

3 1:00

4 2000 0:95 5 1000 6 0:90 y t i

l 500 i b

7 a b G

0:85 o r M 8 p 200 S M 9 0:80 P 100

10 50 This work 0:75 11 SB2s 20 12 0:70 1:0 1:5 2:0 2:5 3:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 1:0 G G SB2 PMS Probability BP ¡ RP Figure 12. Left: HR diagram of the evaluation sample, color-coded by the average probability of a source being PMS in each 2-dimensional bin. On top of it, in blue, are the stars identified as double lined spectroscopic binaries in the APOGEE data (M. Kounkel, et al, in prep). These sources preferentially trace the photometric binary sequence. Right: Distribution of probabilities of classifying these SB2s as pre-main sequence. Note, although fields associated with notable star forming regions (e.g, Orion, Upper Sco) have been excluded form the sample, some bona-fide PMS stars may have still persist in other fields.

Table 5. Sample of output catalog based on EDR3, with age, classification, and predicted extinction.

Gaia EDR3 l b Predicted Age PMS Predicted AV

Source ID (deg.) (deg.) (dex) Probability (3d pos., mag.)

110951890948096 176.087 -48.388 0.871±0.019 7.421±0.035 0.415±0.008 265532058523264 175.988 -47.522 0.767±0.054 7.435±0.018 0.436± 0.027 313670051618560 174.933 -48.774 0.831±0.102 7.872±0.215 0.646± 0.050

Only a portion shown here. Full table is available in an electronic form.

comparison to main sequence stars, the classifier down- for a cluster is apparent, which can be seen in regions as weights them both equally to minimize the loss. Thus, young as 8 Myr with a mono-age population e.g., Bouma minimizing contamination from main sequence binaries et al. 2020, and especially in the younger regions without results in a lack of stars >60 Myr included in Table a clearly defined binary sequence) should be treated with 4. Nonetheless, with independently derived member- care. ship (such as from analyzing the distribution of stars in Recently released Gaia EDR3 (Gaia Collaboration the phase space in older regions regions with bona fide et al. 2020) has changed the definition of the bandpasses PMS stars, like in α Per), it is nonetheless possible to compared to DR2. The parallax has also been improved estimate their ages without relying on the classifier. by ∼30%. There are no strong systematics in the perfor- Unfortunately, however, Sagitta is unable to separate mance of Sagitta when applied to EDR3, and the mea- pre-main sequence unresolved binaries from single PMS surement of age and classification compared to DR2 is stars stars. In young populations where there is a clearly generally consistent with each other within 1σ according defined binary sequence for the cluster, the stars on that to the reported errors. Applying the pipeline to the full binary sequence get assigned preferentially younger ages EDR3 that meet the same quality checks as described in by ∼0.1-0.15 dex than the age of the single stars in the Section 2.3 results in a larger catalog of stars that can be same cluster. This is consistent with the relative ages identified as likely PMS, but that is primarily driven by for single and binary stars that could be estimated with these sources previously not meeting the required pre- traditional isochrone fitting. Thus, the ages of sources cision in parallax and/or fluxes. The sources that are that could be suspected to be unresolved binaries or newly identified as PMS in EDR3 tend to be fainter and tertiaries (such as in the cases where binary sequence be located at preferentially larger distances, extending 18 the sensitivity limits of the survey. The resulting catalog Lupus clouds (such as III and IV) get recovered with the is included in the Table5. We note that while the anal- characteristic age of ∼3 Myr. The southern portion of ysis in this paper, including the subsequent sections, is CrA that is still associated with the molecular gas has limited to the DR2 data, overall conclusions are consis- a typical age of ∼4 Myr; the northern portion that has tent in the sample derived from EDR3. since cleared its gas has an age of ∼6–8 Myr. Nearby, there is also ∼20 Myr population that also appears to 4.2. Properties of known star forming regions be related to Sco–Cen. Figure 13 shows the zoom-in view of various star form- While the average ages we measure for these regions ing regions that were used as a benchmark in evaluating are consistent to the literature values (as the literature the measured stellar ages. Figure 14 shows the corre- values were originally used for training), we note that sponding distribution of ages for each individual SFR are able to go from discrete region-specific estimates to view. a more homogeneous map of star forming history. 4.2.1. Orion Complex 4.2.3. Vela The Orion Complex is the closest region of ongoing Towards Vela there are two unrelated populations massive clustered star formation, containing > 10, 000 found in a similar volume of space. One is Vela OB2, stars with ages of < 1 Myr to > 10 Myr. The youngest which is associated associated with γ Velorum, and has stars are found in the Orion A and B molecular clouds a typical age of ∼10 Myr. The other populations has an and older stars are found in Orion D and λ Ori (Kounkel age of ∼30–35 Myr, and it contains an NGC et al. 2018). Sources from Orion represent a significant 2547 (Jeffries et al. 2009, 2014; Beccari et al. 2020). fraction of the input training set, and provide a valuable Due to its youth, we recover Vela OB2 at higher con- evaluation for the performance of both the classifier and fidence – the bulk of the members can be identified at the regression model. the threshold of >95%, containing ∼2,700 stars. On the The model recovers ∼7,500 sources with >95% thresh- other hand, NGC 2547 becomes apparent only with the old, and ∼9,000 within >90% threshold. The age gra- threshold lowered to ∼85%, containing ∼10,000 stars. dients and the typical ages that have previously been We recover the average ages of both these populations. observed by Kounkel et al.(2018) are well recovered by Vela OB2 in particular shows curious star forming his- the model. tory. The stars that are located towards the northern We note that the evaluation catalog has a hole near group H (Cantat-Gaudin et al. 2019) are preferentially Trapezium in the , as nebulosity degrades younger than those near the central part. the photometric quality. Thus, few sources met the cuts specified in Section 2.3. Applying Sagitta on a separate 4.2.4. Taurus and Perseus catalog that has not been as constrained by the quality of inputs, it is able to recover both the members and the Due their proximity and youth, the Taurus molecular appropriate ages. clouds contain some of the best studied young stars. This region does not contain any clusters, rather, it is 4.2.2. ρ Oph and Sco–Cen a collection of several diffuse clouds, some of which are The ρ Oph star-forming region has already been dis- up to 30 pc apart. The most up-to date membership of cussed as a means of evaluating the performance of the this region is presented in Luhman(2018). We are able classifier, but both ρ Oph and the surrounding, slightly to recover most of these members within our evaluation older Upper Scorpius region are also notable for their catalog, as well as add a number of new candidates. peculiar star forming history. We are able to recover the typical age of ∼1–3 Myr We recover the typical age of < 1 Myr for ρ Oph, for much of the previously known members. There have and ∼5 Myr for Upper Sco (Preibisch & Mamajek 2008; also been a suggestion of an older nearby ∼16 Myr pop- Pecaut & Mamajek 2016). Although the transition be- ulation (Kraus et al. 2017), which we are also able to tween the two region is rather sharp, without a signifi- recover. As is the case with the younger stars, this pop- cant age spread larger than ±1 Myr, there is some over- ulation is an assembly of diffuse clumps of stars, resem- lap between the two populations, furthermore, the east- bling evolving cirrus clouds. ern part of Upper Sco contains somewhat younger stars. Along a similar sight line as Taurus (but at a some- From the Upper Sco, along the rest of Sco–Cen, we re- what larger distance) lies Perseus. We are able to re- cover a relatively smooth gradient in age 15–20 Myr to- cover the age of ∼6–8 Myr for Per OB2 (with some wards the Lower Centaurus Crux that has previously substructure), as well as ages of 1–3 Myr for younger been observed in other works. Members of the younger clusters IC 348 and NGC 1333 (Azimlu et al. 2015). 19

λ Ori Upper Sco

Orion B

Orion A Orion D ρ Oph

18 Vela 5 17

16

0

15 ) r

y Per OB2 M

14 ( Taurus e g

5 A ¡ 13 d e t IC 348 c 12 i d e 10 r ¡ 11 P

10 15 NGC ¡ 9 1333

8 270 265 260 255 250 245 240

Cygnus 16 15 14 5 Serp 13

12 )

Main r 11 y M (

10 e 0 g 9 A Serp Serp Far d e t

NE South 8 c i d e

7 r P 5 6 ¡ 5 4 3 10 ¡ 2 95 90 85 80 75

Figure 13. Distribution of stars identified by Sagitta towards various star-forming regions, color-coded by stellar ages. Proba- bilities have been selected between 85% and 95% to best represent each region, based on the age and distance to each. To remove contamination from other nearby regions, the ages for Taurus and Vela have been restricted to < 15 Myr, and to highlight the more distant region of Cygnus, distance was restricted to > 500 pc. 20

1300 Orion Complex 300 ½ Oph and Upper Sco 1200 1100 250 1000 900 200 800 700 150 600 500 100 400 300 200 50 100 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 Age (Myr) Age (Myr)

250 Vela 80 Taurus and Perseus

70 200 60

150 50

40 100 30

20 50 10

0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 Age (Myr) Age (Myr)

70 Serpens Cygnus 250 60

200 50

40 150

30 100 20

50 10

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 2 4 6 8 10 12 14 16 Age (Myr) Age (Myr)

Figure 14. Distribution of ages for star-forming regions in Figure 13. Note that the membership of each region was not assessed beyond the position on the sky and some distance and/or age ranges, therefore contamination from unrelated stars is present in each field of view. 21

4.2.5. Serpens is possible that some of them do trace some patterns in Serpens contains several young clusters located at a the star forming history through massive stars (to which distance of ∼450 pc. Similarly to the work of Herczeg we lose sensitivity) that we cannot trace with lower mass et al.(2019), we are able to recover stars towards Ser- counterparts. pens Main (with the age of ∼ 2 Myr), as well as Serpens Northeast, and Serpens far-South (with the age of ∼3– 4.3.2. Marton et al. (2016, 2019) 5 Myr). There also appears to be substantial diffuse Marton et al.(2019) used Random Forest to classify population 5–8 Myr population surrounding them. Un- pre-main sequence sources using Gaia DR2 and WISE fortunately, W40, likely the youngest region in this star photometry. Their catalog includes classifications for forming complex, is too deeply embedded to be seen in 101,838,724 Gaia sources, with 1,509,781 located within the optical regime. 5 kpc and having greater than 90% likelihood in being We recover ∼500 sources towards Serpens up to the classified as YSOs. The classification was done only on threshold of >95%, and ∼1500 the threshold of >90%. the areas of the sky above a given opacity threshold us- ing the Planck dust opacity map. On the surface level, 4.2.6. Cygnus this catalog recovers the underlying shape of various star The star-forming regions in Cygnus, (particularly forming regions (such as Orion A & B, ρ Oph, Taurus- Cygnus OB2) are located at much larger distance than Auriga, and others). However, this is somewhat mis- other regions discussed in this work. Furthermore, it is leading, as the opacity threshold pre-selects molecular located behind a considerable layer of extinction (Wright clouds, and masks out young populations that are no et al. 2016). Because of this, it can only be recovered in longer associated with the molecular gas. Within that full at lower thresholds from the classifier. Nonetheless, mask, however, the catalog is prone to contamination, we are able to recover ages of ∼4–10 Myr for Cyg OB2. even at a very high level of reported certainty. As mentioned in Section 3.4.2, ρ Oph is a particularly 4.3. Comparison to Other Catalogs useful region for evaluating contamination. We com- 4.3.1. Zari et al. (2018) pare the performance of Sagitta versus the Marton et al. Zari et al.(2018) identified pre-main sequence stars (2019) catalog in Figure 16. The left panel shows the younger than 20 Myr within 500 pc using Gaia DR2 distribution of parallaxes of both models identified with data. The available catalog provides three confidence likelihood > 95% towards that star forming region. No- intervals for pre-main sequence sources based on their tably, while the classification from Marton et al.(2019) kinematic distributions. Their total sample contains does recover some sources at the appropriate parallax 43,719 sources, with 23,686 satisfying the strictest kine- for the region (∼7.5 mas), the vast majority of the stars matic threshold. classified with high confidence in their Gaia-ALLWISE In examining full catalog from Zari et al.(2018), model as YSOs have distances that are more consistent Sagitta confirms 18,488 of the objects they identify as with reddened background stars. In contrast, with the pre-main sequence with >90% probability, and 23,115 same confidence threshold, our classification identifies with >80% (we note that in the full evaluation sample a larger number of bona-fide YSOs in the appropriate there are 30,689 stars younger than 20 Myr at 90% prob- parallax range overall, with only a small degree of con- ability within 500 pc, and 44,225 at 80% probability). tamination. Examining the Sagitta classifications us- There is a particularly good agreement in the identified ing different confidence thresholds, and using the par- stars located within ∼200 pc within the appropriate age allax information to assess the reliability of the classi- bounds. The sources to which we assign lower confi- fication, we see the number of likely contaminants de- dence in classifications with Sagitta (<70%) are pref- crease and the number of bona fide PMS candidates in- erentially located at the distance of >400 pc (close to crease as the threshold increases. On the other hand, in the maximum distance of the Zari et al.(2018) study), the Gaia-ALLWISE catalog, the overall fraction of con- and they tend to be bluer (GBP − GRP < 2). Most of taminants to bona fide YSO candidates remains mostly these sources do not appear to trace known star forming flat throughout the entire probability distribution, and regions, rather they trace the extinction patterns. Sim- a marginal degree of confidence is not achieved until ilarly, they do not have much coherence in their radial > 99%. velocities, as would be expected in young populations. A similar situation persists in the other star form- It is likely that these sources are contamination from the ing regions. The Gaia-ALLWISE catalog recovers many main sequence in their sample (single or unresolved vi- more candidate members of these regions than prior cen- sual binary) due to extinction (Figure 15). However, it suses have found, spread mostly uniformly within the 22

4:5 4 All 60 60 5 YSO < 70% 4:0

6 30 30 3:5 7 ) s a m G (

1

6

8 0 1 2 0

0

0 0 x M 0

0 8 0 4 0 a

3 0 2 8 3:0 l l 1 a

9 r a P 10 30 30 ¡ ¡ 11 2:5

60 60 12 ¡ ¡ 1:0 1:5 2:0 2:5 3:0 3:5 4:0 4:5 G G 90 BP ¡ RP ¡

Figure 15. The top panel shows a Hertzsprung-Russel Diagram constructed from young stars in Zari et al.(2018), analyzed with Sagitta. Red points represent sources that Sagitta assigns < 70% likelihood of being pre-main sequence. The bottom panel shows the spatial distribution for the aforementioned sources with < 70% likelihood of being young, color coded by parallax. outline of the clouds, regardless of the intrinsic underly- in the optical regime, making them difficult to evalu- ing density distribution of those populations. As such, ate. For the sources for which parallaxes are available, it is likely that their machine learning approach shows a significant fraction of them do appear to be reddened overreliance to extinction as a proxy for youth. supergiants. Although, unsurprisingly, even in the sub- In total, our evaluation catalog recovers only 11,654 set of sources with optical emission, contamination in sources at >70% confidence threshold in common with Class I & II sources is somewhat less prominent than the catalog of Marton et al.(2019) for the sources they it is in Class III, as the former tend to have peculiar classify as YSOs at 90% confidence threshold. colors from the protoplanetary disks, compared to the We note that in large part, the contamination in the latter which are just naked photospheres. In total we Gaia-ALLWISE catalog is driven by noisy (and possi- only recover 3,722 sources from this catalog compared bly partially mislabeled) data in the training sample, to our evaluation catalog with the threshold of >70%, which then propagates to the noise in the predictions or 17,472 without any quality checks on the data within (to a lesser degree, this is also the issue in the work of the same threshold. Chiu et al. 2020, for distant sources). The issue is fur- ther compounded in applying a trained model to very 4.3.3. Vioque et al. (2020) uncertain data. When we analyze the full catalog of Vioque et al.(2020) used machine learning techniques YSOs from Marton et al.(2019) that they classify at to search Herbig Ae/Be stars using Gaia, 2MASS, Sloan, 90% confidence, out of 1.7 million stars, Sagitta would IPHAS, VPHAS+, and WISE photometry. They iden- assign 1/4 of them > 70% confidence, and 1/10 of them tify 8,470 candidate PMS stars, 693 classical main se- > 90% confidence, and the outputs from Sagitta would quence Be stars, as well as providing a list of 1,309 also be strongly contaminated by red giants with a large sources that could belong to either type with above 50% parallax error. As various data-driven classifiers are be- probability. coming more commonplace, this highlights the impor- Their selection criteria is preferentially sensitive to the tance of rigorous vetting of the data that are processed stars that are more massive than the sources we are able by machine learning algorithms (both in training and to identify as PMS candidates in this work. Further- in evaluating), to fully understand the limitations and more, based on the availability of IPHAS and VPHAS+ ensure that these algorithms are not applied indiscrim- data, they are restricted to only ∼ 1◦ within the Galac- inately. tic plane. Because of this, our classifier identifies only Similarly, we examine the catalog from Marton et al. a few sources in common with this catalog. From their (2016), where they used Support Vector Machine meth- catalog we classify 3500 as PMS with Sagitta within 80% ods to identify YSO candidates from WISE data alone, threshold. These sources tend to be very young, with an as that work predates the release of Gaia DR2. This cat- average predicted age of approximately 5 Myr. alog contains 133,980 Class I & II and 608,606 Class III YSOs, most of which are too reddened to be detected 4.3.4. Kuhn et al. (2020) 23

1:0 1:00 Marton (2019) 95% ½ Oph 500 Sagitta 95% ½ Oph 0:98 y y

400 t t i 0:9 i l l i i

b b 0:96 a a b b

300 o o r r P P 0:94 O O

200 S 0:8 S Y Y

100 0:92 Sagitta Marton (2019) 0 0:7 0:90 2 0 2 4 6 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 ¡ Parallax (mas) Parallax (mas) Parallax (mas)

Figure 16. Comparison of distribution of parallaxes for sources towards ρ Oph and Upper Sco star-forming region as identified by Marton et al.(2019) and Sagitta (this work) at different thresholds. Sources with small parallaxes are likely contamination due to extinction. 400 40 Recently, Kuhn et al.(2020) have performed a data- driven selection of dusty YSOs in the Spitzer data across 300 35 the Galactic plane. As their catalog is primarily focused 200 30 on very reddened sources that do not necessarily have 100 25 good Gaia astrometry, the overlap of their selection with ) x ) e c d p our evaluation catalog is minimal, only 456 stars. Sim- ( ( 0 20 e ilarly, of 36,423 sources that do have optical counter- Y g 100 15 A parts, regardless of data quality, our classifier would flag ¡ only 37% of these sources as likely pre-main sequence 200 10 ¡ with >70% confidence. Nonetheless, the catalog does 300 5 appear to be robust and complementary to our selection, ¡ identifying preferentially younger stars and providing a 400 0 ¡ 400 200 0 200 400 ¡ ¡ more complete selection, particularly in the more dis- X (pc) tant star forming regions, including some that we only barely recover (such as, e.g., Sco OB1). Figure 17. Distribution of PMS stars (up to a confidence threshold of 80%) in the heliocentric rectangular reference Nonetheless, the age estimator part of Sagitta does frame, color coded by age, as a demonstration of the young appear to work well on this catalog, resulting in an av- stars tracing the outline of the Local Bubble. In X-Z and Y- erage age of ∼4 Myr. Furthermore, it does appear to Z projections, the distribution of PMS stars is largely planar, reveal some coherent age gradients in these star forming following the tilt of the Gould’s Belt (See also Figure 18). regions.

5. DISCUSSION the Galactic plane, up to 160 pc in amplitude. However, no interpretation has been given as a cause for these 5.1. Local Bubble & Gould’s belt ripples. Gould’s belt has been a long recognizable feature of Zari et al.(2018) have analyzed the distribution of the solar neighborhood, showing the apparent tilt of stars younger than 20 Myr and found no evidence for star forming regions, such as Sco-Cen, Orion, Taurus, a fully connected Gould’s belt, rather that all of the and Perseus relative to the Galactic plane. Over the individual populations appeared to be unrelated. Simi- years, there have been a number of interpretations to larly Bouy & Alves(2015) have used a census of nearby the causes of this tilt, whether it is caused by a series of OB stars and arrived to similar conclusions. With an supernovae eruptions (P¨oppel & Marronetti 2000), or a improved census of PMS stars that extend towards the collision of some sort with the disk (Comeron & Torra older ages we seek reexamine this. 1994; Bekki 2009). Figure 17 shows the top down map of PMS stars. Recently, some of the populations on one side of the There is a clear ring-like structure with the radius of Gould’s belt (such as Orion) have been associated with ∼100–150 pc that connects Sco–Cen and Taurus as well the Radcliffe Wave (Alves et al. 2020) - a 2.7 kpc long as some of the older populations, such as α Per/Cas Tau structure that has a number of ripples protruding from 24

OB association. Although hints of a complete ring are The Local Bubble does not immediately explain the seen in stars < 20 Myr old (particularly tracing the in- lack of star forming regions at the distances of ∼200– ner rim), it is the older stars that define this ring most 250 pc. Such a gap can be seen in the distribution of clearly, which may be part of the reason why Zari et al. present day molecular clouds (Zucker et al. 2020), and (2018) have not identified it in their data. it is also present in the catalog from Zari et al.(2018). Furthermore, beyond this ring, there appears to be a Although it is not impossible, it would be surprising for gap in the 3d distribution of PMS stars at the distances two unrelated events to occur in the vicinity of the Sun of ∼200–250 pc, with other populations becoming more to form two separate rings at ∼ 100 pc and at ∼300 pc. prominent at distances of >300 pc. Instead, it may be possible that they are produced by a This structure does not appear to be artificial, per- related event, possibly caused by multiple shock fronts. sisting even at higher confidence levels. Furthermore, This may suggest a common origin for the populations as discussed in Section 3.4.2, it cannot be attributed to that are a part of the Gould’s belt. the classifier favoring a particular set of distances in a 5.2. 30 Myr Bubble truly uniform distribution of stars as the classifier does not intrinsically favor any specific distance in either the There is another peculiar bubble-like structure that recovery fraction or in contamination (Figure7.) can be identified in the data. It can be seen in the last Outside of very low mass populations such as TW panel of Figure 11, primarily traced by the stars older Hya (which we do recover), a lack of significant young than 25 Myr, towards the direction of the Galactic cen- ◦ star forming regions within 100 pc has long since been ter, and it is ∼ 40 in radius. This bubble appears to known. Indeed, the Sun is located near the center of the begin at the distance of ∼200 pc and forming a hemi- Local Bubble, a cavity of significantly lower density of sphere ∼400 pc in diameter (Figure 18, top). Unfortu- neutral hydrogen compared to what is typically found nately the distant edge of this bubble is difficult to de- in the interstellar medium. Using various tracers, such tect as at those distances we lose sensitivity to low mass as diffuse interstellar bands (that can trace the X-ray stars that can be used as tracers for this age range. dissociation within the Bubble Farhang et al. 2019) or This bubble is not caused by extinction towards the extinction (Lallement et al. 2018; Leike & Enßlin 2019) Galactic center. Although there are a number of opti- to trace the morphology of the local bubble, it fits well cally thick clouds in the volume of space associated with within the identified ring of stars. it (e.g., the Aquilla rift, for which we do recover a sizable It has been noted Ophiuchus and Taurus have veloci- population of stars in the 2–10 Myr range), the edge of ties that are comparable in magnitude, but opposite in the bubble is located far outside of those clouds. direction, and they would have been in proximity of one In analyzing the proper motions of the stars located another ∼20-25 Myr ago (Rivera et al. 2015). This trace on the other edge of the bubble we identify a somewhat back age is comparable to the average age of the stars in peculiar pattern: corrected for the average velocity of all the ring, although there are also a number of ∼40 Myr the stars in the sample, there is a strong preference for old stars that compose it. Unfortunately, so far only them to move either directly radially inwards towards ◦ ◦ a few sources have radial velocities available to confirm the center (at l ∼ 6 , b ∼ 6 ) or radially outward away a global expansion along other sight lines, or estimate from it, with next to no tangential component in the the expansion age. At a glance this appears to be an proper motions (Figure 18). It is unclear what could be effect of a supernova explosion. It should be noted that the cause of such a signature. different populations do appear to have a somewhat dif- Based on the velocities of stars that are just expand- ferent peculiar velocity relative to one another, at least ing, we would estimate the expansion age of ∼13 Myr, in the proper motion space. Thus most likely, instead of or approximately half of their age. shockwave clearing a gas of a particular population (as 6. CONCLUSIONS has been the case of supernovae in young star forming re- One of the outstanding questions in the star formation gions such as Orion, and potentially Vela; Kounkel 2020; community is how can post T Tauri stars be identified. Großschedl et al. 2020; Cantat-Gaudin et al. 2019) the We present an automated method of identifying PMS shock front associated with the expanding bubble may stars and estimating their ages (up to ∼70 Myr) through have rammed into the neighboring clouds, not dissimilar Gaia DR2 and 2MASS photometry using a neural net- to a scenario described by Inutsuka et al.(2015) in which work framework. This allows for a homogeneous analy- molecular clouds trace interaction regions between even sis of large volumes of data characterizing star forming shorter lived bubbles. history of the solar neighborhood. Furthermore, this ap- proach is not reliant on a kinematic selection, making it 25

600 possible to search for kinematically peculiar young stars, 500 All such as runaways (e.g., McBride & Kounkel 2019). 400 30 Myr Bubble 300 Applying a classifier to a curated subset of Gaia DR2 200 data with most reliable astrometry and photometry, we 100 identify 197,315 stars as likely PMS sources with confi- ) c

p dence of > 70%, and 496,488 stars in Gaia EDR3 data.

( 0

Z 2 100 The code is made available on GitHub , to enable the ¡ 200 usage of Sagitta outside of this curated subset. ¡ 300 Sagitta is robust against contamination, especially ¡ 400 when compared a number of previous studies that also ¡ 500 aimed to identify young stars using optical and near ¡ 600 infrared data. The precise confidence threshold that ¡ 600 400 200 0 200 400 600 ¡ ¡ ¡ X (pc) should be used to select candidate PMS stars in a par- ticular region depends on the distance to and the age of

3 0 the population that one seeks to characterize, however. 0 30 30 The estimated ages that we measure are consistent with what has been previously measured in some of the better studied star forming regions. Furthermore, in many cases, they allow for a more granular look at

0 0 the evolution of various populations than what was pre-

6 0 0 viously available. It should be noted, however, that caution should be expressed regarding the pre-main se- quence binaries, as they may appear to be systematically younger than they are.

0 In examining the distribution of stars in the solar 30 30 0 ¡ ¡ 3 neighborhood, we identify various features. Most no- tably we identify a ring of stars at ∼100 pc with ages ¡ ¡ ¡ ¡ ¡ 0 0 0 0 0 1

: : : : : of up to ∼40 Myr, tracing the outer edges of the Local 1 0 0 0 0 2 4 6 8 0 : : : : : 0 8 6 4 2 Bubble. It is likely that the formation of this bubble Radial motion have lead to the formation of the Gould’s belt. We also 180 find a second bubble consisting of ∼30 Myr old stars in the direction towards the Galactic center. 160 In future, a follow up of the sample presented in this 140 work by large spectroscopic surveys (such as SDSS-V 120 APOGEE) would be of great benefit to confirming the t

n 100 u candidates, as well as allowing for a better understand- o C 80 ing of the dynamical and chemical evolution of PMS 60 stars. 40 Software: TOPCAT (Taylor 2005), Pytorch (Paszke 20 et al. 2017) 0 1:0 0:5 0 0:5 1:0 ¡ ¡ Radial Motion 2 https://github.com/hutchresearch/Sagitta Figure 18. Top: 3d distribution of stars of the 30 Myr bubble relative to the other stars in the catalog. Middle: distribution of the sources towards the bubble in the plane of the sky, with proper motions, in the local standard of rest, corrected for the average velocity of the stars, with the thick part of the arrow showing the current position of the sources. Arrows are color coded by the cosine of angle proper motions have relative to the center of the bubble, with red (-1) show- ing sources moving radially inward and blue (+1) showing those moving radially outward. Bottom: a histogram show- ing the distribution of the aforementioned radial component of motion. 26

ACKNOWLEDGMENTS We thank Eleonora Zari and Keivan Stassun for giving feedback on the manuscript. A.M, M.K., and K.C. acknowledge support provided by the NSF through grant AST-1449476. This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), pro- cessed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/ gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement. The authors thank the Nvidia Corporation for their donation of GPUs used in this work.

REFERENCES

Alves, J., Zucker, C., Goodman, A. A., et al. 2020, Nature, Cohen, M., & Kuhi, L. V. 1979, ApJS, 41, 743, doi: 10.1038/s41586-019-1874-z doi: 10.1086/190641 Andrews, S. M., & Wolk, S. J. 2008, Bo Reipurth, I, 1. Comer´on,F. 2008, Handbook of Star Forming Regions, I, https://arxiv.org/abs/arXiv:0809.2276v1 295. http://www.aspmonographs.org/a/volumes/ Azimlu, M., Mart´ınez-Galarza,J. R., & Muench, A. A. article details/?paper id=43 2015, AJ, 150, 95, doi: 10.1088/0004-6256/150/3/95 Comeron, F., & Torra, J. 1994, A&A, 281, 35 Bally, J. 2008, Overview of the Orion Complex, ed. Covey, K. R., Lada, C. J., Rom´an-Z´u˜niga,C., et al. 2010, B. Reipurth, 459 ApJ, 722, 971, doi: 10.1088/0004-637X/722/2/971 Baraffe, I., Chabrier, G., Allard, F., & Hauschildt, P. H. Da Rio, N., Robberto, M., Hillenbrand, L. A., Henning, T., 1998, A&A, 337, 403 & Stassun, K. G. 2012, ApJ, 748, 14, Baraffe, I., Homeier, D., Allard, F., & Chabrier, G. 2015, doi: 10.1088/0004-637X/748/1/14 A&A, 577, A42, doi: 10.1051/0004-6361/201425481 Dahm, S. E. 2008, The Young Cluster and Star Forming Beccari, G., Boffin, H. M. J., & Jerabkova, T. 2020, Region NGC 2264, ed. B. Reipurth, 966 MNRAS, 491, 2205, doi: 10.1093/mnras/stz3195 Damiani, F., Prisinzano, L., Pillitteri, I., Micela, G., & Bekki, K. 2009, MNRAS, 398, L36, Sciortino, S. 2019, A&A, 623, A112, doi: 10.1111/j.1745-3933.2009.00702.x doi: 10.1051/0004-6361/201833994 Bouma, L. G., Winn, J. N., Ricker, G. R., et al. 2020, D’Antona, F., & Mazzitelli, I. 1994, ApJS, 90, 467, arXiv e-prints, arXiv:2005.10253. doi: 10.1086/191867 https://arxiv.org/abs/2005.10253 Delgado, A. J., Alfaro, E. J., & Yun, J. L. 2011, A&A, 531, Bouy, H., & Alves, J. 2015, A&A, 584, A26, A141, doi: 10.1051/0004-6361/201116491 doi: 10.1051/0004-6361/201527058 Erickson, K. L., Wilking, B. A., Meyer, M. R., et al. 2015, Brice˜no,C., Calvet, N., Hern´andez,J., et al. 2019, AJ, 157, AJ, 149, 103, doi: 10.1088/0004-6256/149/3/103 85, doi: 10.3847/1538-3881/aaf79b Evans, II, N. J., Dunham, M. M., Jørgensen, J. K., et al. Cantat-Gaudin, T., Mapelli, M., Balaguer-N´u˜nez,L., et al. 2009, ApJS, 181, 321, doi: 10.1088/0067-0049/181/2/321 2019, A&A, 621, A115, Fang, M., Kim, J. S., van Boekel, R., et al. 2013, ApJS, doi: 10.1051/0004-6361/201834003 207, 5, doi: 10.1088/0067-0049/207/1/5 Carpenter, J., & Hodapp, K. 2008, I, 1. Fang, M., Kim, J. S., Pascucci, I., et al. 2017, AJ, 153, 188, https://arxiv.org/abs/0809.1396 doi: 10.3847/1538-3881/aa647b Chiu, Y.-L., Ho, C.-T., Wang, D.-W., & Lai, S.-P. 2020, Farhang, A., van Loon, J. T., Khosroshahi, H. G., Javadi, arXiv e-prints, arXiv:2007.06235. A., & Bailey, M. 2019, Nature Astronomy, 3, 922, https://arxiv.org/abs/2007.06235 doi: 10.1038/s41550-019-0814-z Choi, J., Dotter, A., Conroy, C., et al. 2016, ApJ, 823, 102, Farias, J. P., Tan, J. C., & Eyer, L. 2020, ApJ, 900, 14, doi: 10.3847/0004-637X/823/2/102 doi: 10.3847/1538-4357/aba699 27

Fischer, W. J., Megeath, S. T., Furlan, E., et al. 2017, ApJ, Kounkel, M., & Covey, K. 2019, AJ, 158, 122, 840, 69, doi: 10.3847/1538-4357/aa6d69 doi: 10.3847/1538-3881/ab339a Gaia Collaboration, Brown, A. G. A., Vallenari, A., et al. Kounkel, M., Covey, K., & Stassun, K. G. 2020, arXiv 2020, arXiv e-prints, arXiv:2012.01533 e-prints, arXiv:2004.07261. —. 2018, A&A, 616, A1, doi: 10.1051/0004-6361/201833051 https://arxiv.org/abs/2004.07261 Galli, P. A. B., Loinard, L., Bouy, H., et al. 2019, A&A, Kounkel, M., Covey, K., Su´arez,G., et al. 2018, AJ, 156, 630, A137, doi: 10.1051/0004-6361/201935928 84, doi: 10.3847/1538-3881/aad1f1 Getman, K. V., Feigelson, E. D., Kuhn, M. A., et al. 2014, Kraus, A. L., Herczeg, G. J., Rizzuto, A. C., et al. 2017, ApJ, 787, 108, doi: 10.1088/0004-637X/787/2/108 ApJ, 838, 150, doi: 10.3847/1538-4357/aa62a0 Green, G. M., Schlafly, E., Zucker, C., Speagle, J. S., & Ksoll, V. F., Gouliermis, D. A., Klessen, R. S., et al. 2018, Finkbeiner, D. 2019, ApJ, 887, 93, MNRAS, 479, 2389, doi: 10.1093/mnras/sty1317 doi: 10.3847/1538-4357/ab5362 Kuhn, M. A., de Souza, R. S., Krone-Martins, A., et al. Greene, T. P., & Meyer, M. R. 1995, ApJ, 450, 233, 2020, arXiv e-prints, arXiv:2011.12961. doi: 10.1086/176134 https://arxiv.org/abs/2011.12961 Großschedl, J. E., Alves, J., & Meingast, S. 2020, arXiv Kumar, B., Sharma, S., Manfroid, J., et al. 2014, A&A, e-prints, arXiv:2007.07254. 567, A109, doi: 10.1051/0004-6361/201323027 https://arxiv.org/abs/2007.07254 Kun, M., Balog, Z., Kenyon, S. J., Mamajek, E. E., & Gutermuth, R. A. 2009, ApJS, 185, 451, Hartmann, L. 2009, Accretion Processes in Star Formation: doi: 10.1088/0067-0049/185/2/451 Second Edition (Cambridge University Press) Kun, M., Balog, Z., Mizuno, N., et al. 2008, MNRAS, 391, Herbig, G., & Reipurth, B. 2008, Handbook of Star 84, doi: 10.1111/j.1365-2966.2008.13898.x Forming Regions, Volume I, II, 108 Lallement, R., Capitanio, L., Ruiz-Dern, L., et al. 2018, Herczeg, G. J., & Hillenbrand, L. A. 2014, ApJ, 786, 97, A&A, 616, A132, doi: 10.1051/0004-6361/201832832 doi: 10.1088/0004-637X/786/2/97 Leike, R. H., & Enßlin, T. A. 2019, A&A, 631, A32, —. 2015, ApJ, 808, 23, doi: 10.1088/0004-637X/808/1/23 doi: 10.1051/0004-6361/201935093 Herczeg, G. J., Kuhn, M. A., Zhou, X., et al. 2019, ApJ, Lindegren, L., Hern´andez,J., Bombrun, A., et al. 2018, 878, 111, doi: 10.3847/1538-4357/ab1d67 A&A, 616, A2, doi: 10.1051/0004-6361/201832727 Hillenbrand, L. A., Bauermeister, A., & White, R. J. 2008, L´opez Mart´ı,B., Jim´enez-Esteban, F., Bayo, A., et al. 2013, in Astronomical Society of the Pacific Conference Series, A&A, 556, A144, doi: 10.1051/0004-6361/201321217 Vol. 384, 14th Cambridge Workshop on Cool Stars, Luhman, K. L. 2008, Chamaeleon, ed. B. Reipurth, Vol. 5, Stellar Systems, and the Sun, ed. G. van Belle, 200. 169 https://arxiv.org/abs/astro-ph/0703642 —. 2018, AJ, 156, 271, doi: 10.3847/1538-3881/aae831 Inutsuka, S.-i., Inoue, T., Iwasaki, K., & Hosokawa, T. Marigo, P., Girardi, L., Bressan, A., et al. 2017, ApJ, 835, 2015, A&A, 580, A49, doi: 10.1051/0004-6361/201425584 77, doi: 10.3847/1538-4357/835/1/77 Jackson, R. J., Deliyannis, C. P., & Jeffries, R. D. 2018, Marton, G., T´oth,L. V., Paladini, R., et al. 2016, MNRAS, MNRAS, 476, 3245, doi: 10.1093/mnras/sty374 458, 3479, doi: 10.1093/mnras/stw398 Jeffries, R. D., Naylor, T., Walter, F. M., Pozzo, M. P., & Marton, G., Abrah´am,P.,´ Szegedi-Elek, E., et al. 2019, Devey, C. R. 2009, MNRAS, 393, 538, MNRAS, doi: 10.1093/mnras/stz1301 doi: 10.1111/j.1365-2966.2008.14162.x McBride, A., & Kounkel, M. 2019, ApJ, 884, 6, Jeffries, R. D., Jackson, R. J., Cottaar, M., et al. 2014, doi: 10.3847/1538-4357/ab3df9 A&A, 563, A94, doi: 10.1051/0004-6361/201323288 Megeath, S. T., Gutermuth, R., Muzerolle, J., et al. 2012, Kenyon, S. J., G´omez,M., & Whitney, B. A. 2008, Low AJ, 144, 192, doi: 10.1088/0004-6256/144/6/192 Mass Star Formation in the Taurus-Auriga Clouds, ed. Neuh¨auser,R., & Forbrich, J. 2008, II, 1. B. Reipurth, 405 https://arxiv.org/abs/0808.3374 Koenig, X. P., Leisawitz, D. T., Benford, D. J., et al. 2012, Olney, R., Kounkel, M., Schillinger, C., et al. 2020, AJ, ApJ, 744, 130, doi: 10.1088/0004-637X/744/2/130 159, 182, doi: 10.3847/1538-3881/ab7a97 Kos, J., Bland-Hawthorn, J., Asplund, M., et al. 2019, Palla, F., & Stahler, S. W. 2002, ApJ, 581, 1194, A&A, 631, A166, doi: 10.1051/0004-6361/201834710 doi: 10.1086/344293 Kounkel, M. 2020, arXiv e-prints, arXiv:2007.09160. Panwar, N., Pandey, A. K., Samal, M. R., et al. 2018, AJ, https://arxiv.org/abs/2007.09160 155, 44, doi: 10.3847/1538-3881/aa9f1b 28

Paszke, A., Gross, S., Chintala, S., et al. 2017, in NIPS-W Siess, L., Dufour, E., & Forestini, M. 2000, A&A, 358, 593 Pecaut, M. J., & Mamajek, E. E. 2016, MNRAS, 461, 794, Smith, N., & Brooks, K. J. 2008, II, 1. doi: 10.1093/mnras/stw1300 https://arxiv.org/abs/0809.5081 P¨oppel, W. G. L., & Marronetti, P. 2000, A&A, 358, 299 Su´arez,G., Downes, J. J., Rom´an-Z´u˜niga,C., et al. 2017, Prato, L., Rice, E. L., & Dame, T. M. 2008, Where are all AJ, 154, 14, doi: 10.3847/1538-3881/aa733a the Young Stars in Aquila?, ed. B. Reipurth, Vol. 4, 18 Taylor, M. B. 2005, in Astronomical Society of the Pacific Preibisch, T., & Mamajek, E. 2008, The Nearest OB Conference Series, Vol. 347, Astronomical Data Analysis Association: Scorpius-Centaurus (Sco OB2), ed. Software and Systems XIV, ed. P. Shopbell, M. Britton, B. Reipurth, Vol. 5, 235 & R. Ebert, 29 Prisinzano, L., Damiani, F., Guarcello, M. G., et al. 2018, T´oth,L. V., Marton, G., Zahorecz, S., et al. 2014, PASJ, A&A, 617, A63, doi: 10.1051/0004-6361/201833172 66, 17, doi: 10.1093/pasj/pst017 Prusti, T., Adorf, H. M., & Meurs, E. J. A. 1992, A&A, Tothill, N. F. H., Gagn´e,M., Stecklum, B., & Kenworthy, 261, 685 M. A. 2008, II, 1. https://arxiv.org/abs/0809.3380 Rauw, G., & De Becker, M. 2008, II, 1. Vioque, M., Oudmaijer, R. D., Schreiner, M., et al. 2020, https://arxiv.org/abs/0808.3887 Reipurth, B. 2008a, The Young Cluster NGC 6604 and the arXiv e-prints, arXiv:2005.01727. Serpens OB2 Association, ed. B. Reipurth, Vol. 5, 590 https://arxiv.org/abs/2005.01727 —. 2008b, Handbook of Star Forming Regions, Volume I: Walawender, J., Bally, J., Di Francesco, J., Jørgensen, J., & The Northern Sky, ASP Monographs No. v. 1 Getman, K. 2008, Bo Reipurth, I, 1. (Astronomical Society of the Pacific). http://www.ifa.hawaii.edu/publications/preprints/ http://aspmonographs.org/a/volumes/table of contents? 08preprints/Walawender{ }08-206.pdf book id=1 Wilking, B., Gagne, M., & Allen, L. 2008, II, 1. —. 2008c, Handbook of Star Forming Regions, Volume II: https://arxiv.org/abs/0811.0005 The Southern Sky, ASP Monographs No. v. 2 Wolk, S. J., Bourke, T. L., & Vigil, M. 2008, Handbook of (Astronomical Society of the Pacific). Star Forming Regions, II, 124. http://aspmonographs.org/a/volumes/table of contents? https://arxiv.org/abs/arXiv:0808.3385v1 book id=2 Wright, N. J., Bouy, H., Drew, J. E., et al. 2016, MNRAS, Reipurth, B., & Schneider, N. 2008, Handbook of Star 460, 2593, doi: 10.1093/mnras/stw1148 Forming Regions, Volume I, I, 36 Zari, E., Brown, A. G. A., & de Zeeuw, P. T. 2019, A&A, Reipurth, B., & Yan, C. H. 2008, Star Formation and 628, A123, doi: 10.1051/0004-6361/201935781 Molecular Clouds towards the Galactic Anti-Center, ed. Zari, E., Hashemi, H., Brown, A. G. A., Jardine, K., & de B. Reipurth, Vol. 4, 869 Zeeuw, P. T. 2018, A&A, 620, A172, Rivera, J. L., Loinard, L., Dzib, S. A., et al. 2015, ApJ, doi: 10.1051/0004-6361/201834150 807, 119, doi: 10.1088/0004-637X/807/2/119 Zeiler, M. D., & Fergus, R. 2014, in Proc. ECCV, 818–833, Rom´an-Z´u˜niga,C. G., & Lada, E. A. 2008, I, 1. doi: 10.1007/978-3-319-10590-1 53 https://arxiv.org/abs/0810.0931 Zucker, C., Speagle, J. S., Schlafly, E. F., et al. 2020, A&A, Schoettler, C., de Bruijne, J., Vaher, E., & Parker, R. J. 633, A51, doi: 10.1051/0004-6361/201936145 2020, MNRAS, 495, 3104, doi: 10.1093/mnras/staa1228