Photometric Estimation: An Active Learning Approach

R. Vilalta∗, E. E. O. Ishida†, R. Beck‡, R. Sutrisno∗, R. S. de Souza§ A. Mahabal¶ ∗Department of Computer Science, University of Houston 3551 Cullen Blvd, 501 PGH, Houston, Texas 77204-3010, USA Email: {rvilalta,rasutrisno}@uh.edu †Laboratoire de Physique Corpusculaire, Universite´ Clermont-Auvergne 4 Avenue Blaise Pascal, TSA 60026, CS 60026, 63178, Aubiere` Cedex, France Email: [email protected] ‡Department of Physics of Complex Systems, Eotv¨ os¨ Lorand´ University Budapest 1117, Hungary Email: [email protected] §Department of Physics & Astronomy, University of North Carolina Chapel Hill NC, 27599-3255, USA Email: [email protected] ¶Department of Astronomy, California Institute of Technology Pasadena CA, 91125, USA Email: [email protected]

Abstract—A long-lasting problem in astronomy is the accurate Pan-STARRS [2], the (DES) [3], and the estimation of distances based solely on the information Large Synoptic Survey Telescope (LSST) [4], among others. contained in photometric filters. Due to observational selection Due to improved efficiency, cost, and capacity to probe effects, the spectroscopic (source) sample lacks coverage through- out the feature space (e.g. colors and magnitudes) compared to distant objects, an alternative widespread tool for galaxy the photometric (target) sample; this results in a clear mismatch redshift estimation is to use photometric techniques (photo- in terms of photometric measurement distributions. We propose z) [5]–[7]. Photo-z are relevant for a variety of astronomical a solution to this problem based on active learning, a machine and cosmological applications such as gravitational lensing learning technique where a sampling strategy enables us to [8], baryon acoustic oscillations [9], and for studies of Type select the most informative instances to build a predictive model; specifically, we use active learning following a Query by Committee Ia supernovae [10]. Photometric measurements capture the approach. We show that by making wisely selected queries in the intensity of electromagnetic radiation through a small number target domain, we are able to increase our predictive performance of broad wavelength windows (filters). Compared to spectro- significantly. We also show how a relatively small number scopic techniques (spec-z), photometric measurements exhibit of queries (spectroscopic follow-up measurements) suffices to lower resolution and higher estimation uncertainty, but there improve the performance of photometric redshift estimators significantly. are also major gains: we can substantially expedite the process and analysis of this type of observations. A major aim in astronomy is to infer spectroscopic properties from purely photometric data. The challenge behind the construction of accurate photomet- I.INTRODUCTION ric galaxy redshift estimators is twofold. One major obstacle is Galaxy redshift estimation is critical to probe the evolution the (near-)degeneracy in photometric feature space of different of the Universe, as it provides the means to measure distances galaxy types at differing , exacerbated by photometric at cosmic scales via Hubble’s law [1]. Precise redshifts can measurement errors [11]. The second obstacle is the marked be determined through the high-resolution decomposition of disparity in distribution between spectroscopic (training) and electromagnetic radiation as a function of wavelength, i.e., photometric (target) observations, such that the naive approach through spectroscopy, which provides absorption or emission- of building a model on spectroscopic data and applying it line features in the raw optical and/or near- spectrum. on photometric data is prone to failure [12]. This scenario However, the observational cost of this procedure is prohibitive may seem a perfect fit for domain adaptation techniques in even for medium-size surveys. Acquiring spectroscopic mea- machine learning [13]–[17], where the training set is drawn surements for all cataloged objects is not a viable approach, from a source domain, while the desired estimated set is and this will only become increasingly difficult with the advent drawn from a target domain, and both exhibit different distri- of new instruments and associated large scale surveys, e.g. butions. The main challenge is to adjust the model obtained in the source domain to account for differences exhibited to a fixed but unknown joint probability distribution, P (x, y), in the new target domain. In our case, the source domain over the input-output space Rn ×R. The outcome of the algo- corresponds to spectroscopic observations, whereas the target rithm is a hypothesis or function fθ(x|T ) (parameterized by domain corresponds to photometric observations. However, θ) mapping the input space to the output space, f : Rn → R. under a strong distributional discrepancy between the two We are interested in choosing the hypothesis that minimizes domains, domain adaptation techniques alone are inadequate, an expected error: and a better solution is to combine domain adaptation with Z 2 active learning techniques [18]–[24]. E[(fθ(x|T ) − y(x)) |x ] g(x) dx (1) Active learning (AL) suggests which instances lacking a x value for the response variable should be queried to fill in the where E[] is the expectation over the posterior distribution missing response value, and subsequently be incorporated into P (y|x) and training set T . We assume g(x) is the marginal the training set [25]–[27]. In our case, such instances corre- distribution over the input space, and fθ(x|T ) is the estimation spond to in the photometric (target) sample that can be of the response for x induced by the regression algorithm. In queried for subsequent spectroscopic follow-up. The idea has what follows, ET [] refers to the expectation over the training been applied to astronomical problems for the determination set, and EP [] as the expectation over the posterior distribution of stellar population parameters [28], periodic variable stars P (y|x). [29], and supernova photometric classification [18]. A recent The expected error in Eq, 1 can be decomposed into three approach used a self-organized map to pinpoint regions in main components: the target domain not covered in the source domain [30], but unlike AL, the methodology does not directly optimize 2 empirical performance and number of required queries. E[(fθ(x|T ) − y(x)) |x ] = (2) 2 In this study, we provide an empirical assessment of the noise EP [(y(x) − EP [ y|x]) ]+ effectiveness of AL on photometric redshift estimation. The 2 2 bias (ET [ fθ(x|T )] − EP [ y|x ]) + idea is to build an initial regression model using a source variance E [(f (x|T ) − E [ f (x|T ) ])2 ] (spectroscopy) dataset and to refine that model by sampling T θ T θ examples from the target () dataset. The central The first term on the right side of Eq. 2 is independent goal is to evaluate the efficiency of AL under the assumption of our estimator fθ and captures the inherent randomness or that querying galaxy redshifts through spectroscopic analysis noise in the data; it is unavoidable as long as the data are is expensive; it is then important to minimize the number of represented using the same feature (input) space. The second queries, while improving the accuracy of the regression model. term is known as the squared bias, and denotes how well our Our experiments use datasets specifically constructed to model matches the data distribution. The last term quantifies capture the distributional discrepancy between photometric the instability of our model as it wiggles around a central and spectroscopic observations [12]. Results show that using value; this term is known as the model variance. There is a AL through a Query by Committee approach yields significant well known trade-off between variance and bias [31]: ideally gains in performance. The comparison is made with similar we would like to have low bias and variance, but the terms models built with and without the use of AL. Moreover, are commonly inversely correlated1. results show that a few hundred instances (∼ 300) suffice to generate a good model to predict redshifts on thousands B. Active Learning for Regression of spectroscopic observations. The proposed methodology Active learning provides a mechanism to query those unla- leverages modern regression techniques with the ability to beled examples deemed relevant to build an accurate predictive adapt to new domains under the presence of distributional model [25]–[27]. Specifically, in (pool-based) AL, we assume m changes. the existence of two training sets: T = {(xi, yi)}i=1, and l This paper is organized as follows. Section II explains basic TU = {(xj)}j=1; where the latter set lacks values for the concepts in classification and AL. Section III explains our response variable. To incorporate set TU into the regression approach to galaxy redshift estimation using AL. Section IV task, we selectively identify examples from TU and query their shows our empirical results. Lastly, Section V gives a summary response value to incorporate the new pair into the training set and conclusions. T . An assumption is made that querying instances is costly, and that we should try to minimize the number of queries II.PRELIMINARY CONCEPTS while improving on model performance. A. Basic Notation in Regression In this paper we follow an approach to AL known as Query We assume a regression algorithm receives as input a by Committee (QBC) [32], [33]. The use of QBC originated in m set of training examples T = {(xi, yi)}i=1, where x = the realm of classification (where the output variable y takes on n (x1, x2, ··· , xn) is a vector in the input space R , and y is an instance of the output space, y ∈ R, also referred to as y(x). 1Notice that both terms can be reduced if the space of hypotheses or functions is intentionally designed to contain a good approximation to the We assume the training dataset T consists of independently target function, which demands extra-evidential information lying outside the and identically distributed (i.i.d.) examples obtained according training data. Z X X k 2 Γ = wkEk = wk (f (x) − y(x)) g(x)dx (6) k k x and Λ is the weighted average of the ambiguities:

Z X X k 2 Λ = wkAk = wk (f (x) − F (x)) g(x)dx (7) k k x The relevant fact about Eq. 5 is the trade-off between bias and variance on the QBC approach, which contrasts with the Figure 1. Active learning for regression focuses on regions of high variance one defined in Section II-A. If the regression models are to select new training instances. Once instances are queried and incorporated highly biased (e.g., linear models) the ambiguity is small, and to the training set, the model is re-computed to identify new regions of high error is shaped by the weighted average of the error of the variance. regression models Γ. But if the average ambiguity Λ is high, generalization error is reduced; this is achieved by intently categorical values), but has also been used in regression [34], building a model made of a mixture of markedly different [35]. In essence, the algorithm builds a committee of different regression models. models over training set T ; it then selects those instances on III.METHODOLOGY TU where most of the committee models disagree. This is straightforward to apply in a classification problem, where Our goal is to build accurate galaxy redshift estimators the notion of disagreement can be easily decided under a using photometric data. The two major obstacles to attain categorical response [36]–[38]. For the regression problem, AL this goal, namely the lack of spectroscopic redshifts in new is based on the notion of model variance. As an illustration, surveys, and the marked disparity in distribution between Figure 1 shows a sequence of two steps where examples are spectroscopic and photometric samples, can be tackled in two queried based on regions of high variance over the feature steps: 1) build an initial model using the (relatively small) space. The left panel shows a distribution of examples ap- set of galaxies with measured spectroscopic redshifts, and 2) proximated using a composition of two linear models. Regions iterate using AL by querying redshifts from the (large) set of of high variance are queried to yield a better approximating galaxies with unknown redshifts. The first step is important function (right panel). to avoid the “cold start” problem [39] in AL, where the In the context of QBC, variation is captured by quantifying efficiency of AL is contingent on an initial accurate model, the level of disagreement among the committee members (or but such model requires a representative labeled sample to regression models) through a metric called ambiguity [34]. be constructed in the first place; this circularity problem is Specifically, let the committee be made of a set of regression broken when prelabeled source examples can participate in models2 {f k(x)} . Then the ambiguity at instance x, A(x), is the model building process. The second step helps alleviate defined as follows: the distributional discrepancy by sampling directly from the target domain (photometric sample) using AL. X k 2 A(x) = wk(f (x) − F (x)) (3) k Algorithm 1 QBC with initial source model where F (x) is the average estimation of x over all the Input : Labeled training data T , unlabeled data TU , active ensemble of regression models: learning sample size q Output : Regression model F (x) X k F (x) = wkf (x) (4) 1: Train model F on set T using QBC k 2: while TU 6= ∅ do 3: Rank T by A(x) (ambiguity, Eq. 3) The weights {wk} are the same in Eqs 3 and 4 and denote U ∗ our degree of belief or importance for each model. We assume 4: Let TU contain the top q instances with highest rank P 5: Query redshift values on T ∗ wk ≥ 0 and that wk = 1. U k ∗ ∗ It can be shown that the generalization error for the whole 6: T = T ∪ TU , TU = TU \ TU ensemble can be defined as follows: 7: Train model F on set T using QBC return model F Z E[(F (x) − y(x))2 ] g(x) dx = Γ − Λ (5) x Algorithm 1 shows the main steps behind our approach. where Γ is the weighted average of the generalization error of A committee of models F is built using a sample with the regression models: known redshifts (T ); this will be referred to as the source model. An iterative process then selects the q instances with 2For simplicity, we omit parameter θ in the notation of function estimators. highest ambiguity, and queries their redshift. Training set T is Figure 2. Distributions for r-band magnitude and 4 SDSS colors in our source (blue line) and target (green region) galaxy samples, i.e. samples A and D of the TEDDY catalog from [12]. augmented with the set of queried instances (while the same Figure 2 shows the SDSS photometric color and r-band instances are removed from TU ). A new model is then built magnitude distributions of our source and target datasets. The with the new training set. The process repeats until set TU cuts in parameter space in case of the former clearly lead to is exhausted; the output is regression model F ; this will be a significantly narrower coverage than that of the latter. The referred to as the target model. Results in Section IV quantify five features (r magnitude, u−g, g −r, r −i, and i−z colors) the advantages of our active-learning approach applied to comprise the 5D photometric feature space. Spectroscopic photometric redshift estimation. redshifts of all galaxies in the sample are also available, readily available for training or validation. IV. EXPERIMENTS Clearly, our setup models the task of spectroscopic target For the purpose of testing the applicability of AL to the selection with the goal of optimizing photo-z estimation area of photometric redshift estimation, our experiments use results. The datasets were explicitly chosen this way to mirror data specifically designed to capture discrepancies between the scenario of using a limited spectroscopic sample to make spectroscopic and photometric observations as reported by inferences on an extended photometric sample. It has to be [12]. We select the TEDDY galaxy catalogue of [12] as a noted, though, that our target set uses spectroscopic-quality starting point, which explicitly deals with extrapolation in observations, therefore the photometric error distributions of feature space, an issue that regularly comes up in real world our two sets can in fact be assumed similar. A full analysis, photo-z applications [40]. taking into account that photometric samples generally have A. Dataset higher photometric errors, as illustrated by the HAPPY cata- logue of [12], will be presented in future work. The TEDDY catalogue3 [12] was assembled from spectro- scopic galaxy measurements of the SDSS Data Release 12 B. Experimental Strategy (DR12, [41]). There are four different subsets available, of Our experimental strategy is as follows. We begin by which we utilize TEDDY A and TEDDY D, as detailed below. randomly sampling |T | = 10, 000 instances from source set The TEDDY A sample consists of galaxies that are within (Teddy A) to make up the initial training set. This number is the bounds of a rectangular parallelepiped in the space of the large enough to guarantee an accurate model, while avoiding four SDSS colors (1.0 < u − g < 2.2, 1.2 < g − r < 2.0, significant computational cost. The result is an initial ensemble 0.1 < r − i < 0.8, and 0.2 < i − z < 0.5), with an extra model F embedding the following regression algorithms: constraint on r-band magnitude, r < 21. From the SDSS Linear regression model (LM), Local linear Regression [12] DR12 spectroscopic sample, ≈ 75, 000 objects that satisfy with (neighborhood size) k = 100 (LLR 100), Local Linear these criteria were randomly selected to form TEDDY A. We Regression with k = 1000 (LLR 1000), Random Forest (RF), refer to this sample as our source dataset, where redshifts and Support Vector Machines (SVMs). have been accurately determined, but coverage over the feature On the target set (Teddy D), we sample 5, 000 instances to space is limited. use as the pool for AL (TU ). A separate (mutually exclusive) TEDDY D is disjoint from TEDDY A, and has been randomly set of 30, 000 instances is sampled to form a validation set. The sampled from the entire SDSS DR12 spectroscopic sample validation set tracks the performance of photo-z estimation to contain ≈ 75, 000 galaxies. As it is not limited by the throughout the AL phase, and is never used for training. stringent color and magnitude cuts of TEDDY A, its feature Our implementation of the Query by Committee Algorithm space coverage is much wider, therefore it represents a case proceeds in batches of q = 10 (see Algorithm 1). Each batch is when training set coverage is not available, i.e. extrapolation made up of the set of instances with highest ambiguity (Eq. 3), is needed. We refer to this sample as our target dataset. and is used to extend the size of the training set T . In this 3Available in the COINtoolbox - study we give the same importance to all models, such that 1 https://github.com/COINtoolbox/photoz catalogues wi = s , where s is the number of models in the ensemble (in Figure 3. Photo-z estimation results before (top) and after (bottom) using active learning under the Query by Committee approach.

Ensemble LM SVM RF LLR_100 LLR_1000

− − − Mean (x10 2) Std (x10 2) MAD (x10 2) Outlier rate (%) 14 0 8 4 12 −1 6 3 10 −2 8 2 4 −3 6 1

4 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Number of queries

Figure 4. An assessment of the performance of the ensemble model and its constituent models using active learning. Performance diagnostics are shown as a function of the number of queries. our experiments s = 5). The equal-weight policy was adopted Machines, Random Forests and LLR 100. LLR 1000 and the simply to test the feasibility of our proposed approach. Ensemble Model seem less affected by the discrepancy, but still produce a more accurate model after training with the C. Results selected queried examples. We first compare the quality of the predictions made by For a second round of experiments, we assess the perfor- the ensemble model, and each of the regression algorithms mance of different models using standard diagnostics adopted embedded in it, with and without the use of AL. Figure 3 in the astronomical literature [12], [42], [43], as follows: mean, shows different performance plots. Each panel shows the pho- standard deviation and median absolute deviation (MAD) of tometric redshift (photo-z) estimates versus their spectroscopic the normalized redshift error ∆znorm = (zspec − zphoto)/(1 + redshift (spec-z) values; an error-free model corresponds to zspec), where zphoto and zspec are the photometric and spec- the dashed diagonal line. Columns correspond to different troscopic redshifts, respectively, and the outlier rate. Following regression methods. Rows capture one of two scenarios: the [12], [42], [43], outliers are defined as those galaxies where first row shows the performance of the models with no AL; |∆znorm| > 0.15, and the mean ∆znorm and standard devia- each model is trained on the source dataset and directly applied tion σ (∆znorm) are computed with the outliers removed from to the target dataset. The second row shows the performance the samples. of each of the regression models after 250 queries using Our experiments compare the evolution of the three diag- the QBC approach (Section III). Without the use of AL, nostics described above using the ensemble model and the predictions appear markedly biased due to the distributional other committee models. In Figure 4, the dashed (orange) line discrepancy between source and target domains. The effect shows results from the ensemble model as a function of the of such disparity is more accentuated with Support Vector number of queries, while other lines represent the different Labeling strategies Current Passive Learning Active Learning−QBC

− − − Mean (x10 2) Std (x10 2) MAD (x10 2) Outlier rate (%) 6.5 0.9 4.0

−0.5 6.0 0.8 3.5 5.5 0.7 −1.0 3.0 0.6 5.0

2.5 0.5 −1.5 4.5 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Number of queries

Figure 5. Evolution of photometric redshift estimation diagnostics for 3 different labeling strategies, all using the ensemble model.

committee models. The left-most panel shows the mean value of ∆znorm (i.e. overall bias), which converges quickly with Table I the number of queries (∼ 200). Convergence is even faster for RESULTS FROM APPLYING QUERY BY COMMITTEETOTHE TEDDY CATALOG. other diagnostics, namely σ (∆znorm), MAD and the outlier rate (from left to right). The corresponding numerical values, mean std MAD outlier model n for each diagnostic and model, are shown in Table I. queries (x 10−2) (x 10−2) (x 10−2) rate (%) 0 -1.35 6.45 4.04 0.91 Clearly, AL helps reduce the number of catastrophic errors Ensemble 50 -0.25 4.61 2.56 0.58 considerably, and lowers the scatter and bias for all competitive 100 -0.34 4.96 2.51 0.48 models after just a couple hundred queries. While such results 150 -0.27 4.74 2.46 0.49 200 -0.27 4.67 2.44 0.48 are encouraging, we must also take into account that the 250 -0.25 4.62 2.43 0.5 increase itself in the number of points in the training set, 0 -0.31 5.45 3.67 1.35 even if randomly selected, might impact the diagnostic results. LM 50 -0.34 5.4 3.63 1.32 In Figure 5, we show the evolution of the diagnostics as 100 -0.31 5.36 3.62 1.29 a function of the number of queries for 3 distinct labeling 150 -0.32 5.32 3.59 1.27 200 -0.34 5.29 3.56 1.25 strategies. The current strategy (dot-dashed, blue) illustrates 250 -0.3 5.25 3.55 1.26 the effect of randomly selecting points following the initial 0 -2.23 8.07 6.03 4.46 training distribution4. Passive learning is defined as random SVM 50 -0.22 4.92 3.04 1.1 sampling from the target distribution (dashed-orange), and 100 -0.46 4.89 2.78 0.6 150 -0.37 4.71 2.65 0.52 AL via query by committee (full-green) as described in the 200 -0.33 4.57 2.58 0.51 previous sections. From the figure, we see that increasing the 250 -0.29 4.44 2.53 0.51 size of the training sample following the same labeling strategy 0 -1.2 8.28 8.09 1.86 produces only a marginal effect on the final diagnostics. On the RF 50 -0.03 4.93 3.44 0.66 100 -0.21 4.7 2.89 0.47 other hand, passive learning improves all diagnostics almost as 150 -0.14 4.5 2.74 0.48 effectively as AL, with the exception of bias. This is expected, 200 -0.15 4.38 2.64 0.49 since in passive learning we are randomly sampling from the 250 -0.16 4.28 2.6 0.51 target distribution, thus directly sampling over regions of the 0 -3.72 13.94 5.39 3.87 LLR 100 50 -0.44 7.27 2.83 0.8 feature space poorly covered by the source distribution (Figure 100 -0.55 9.3 2.68 0.63 2). 150 -0.44 8.27 2.55 0.65 In order to clarify the differences between passive and 200 -0.44 8.06 2.51 0.6 250 -0.43 8.1 2.48 0.62 AL strategies, we show in Figure 6 the evolution of the r- 0 -1.39 7.13 2.83 0.51 band magnitude distribution in the query set for each strategy, LLR 1000 50 -0.68 5.8 2.59 0.56 superimposed on the corresponding distribution in the training 100 -0.5 6.03 2.54 0.51 (spectroscopic) and target (photometric) sets. In both panels, 150 -0.38 5.79 2.49 0.52 200 -0.37 5.72 2.47 0.51 the light colored curves show early query sets and darker 250 -0.35 5.7 2.46 0.52

4In the context of the TEDDY catalog, this means querying randomly objects from TEDDY B [12]. Figure 6. Distribution of r-band magnitude in the training/spectroscopic (black-dotted) and target/photometric (blue-dashed) samples. Also shown are the evolution of the queried sample when adopting a passive (green) or active (red) learning approach via QBC. In both panels, the light colored curves correspond to earlier queries and the dark colored ones to late queries. We show here the query set evolution from 10 to 250 queries. colored curves indicate later query sets from 10 to 250 queries. rate) for all competitive models after a few hundred queries. It is clear that when sampling from the target distribution (left Overall, we observe strong evidence in favor of AL as a means panel) all regions of the training feature space are sampled to compensate for the distributional discrepancy between spec- homogeneously, not taking into account the fact that there troscopic and photometric observations. are already many training points at intermediate r-magnitude As future work we plan to look for alternative approaches values. AL, on the other hand, focuses on querying objects to AL in regression. This is an area poorly explored in the with higher r-band magnitudes in order to compensate for machine learning community, where most of the attention the imbalance between training and target samples in that has focused on the classification task. Additionally, we plan region. This difference is responsible for the disparity in the to look for patterns across the set of queried instances that diagnostics shown in Figure 5, particularly along bias. would enable us to anticipate which galaxies are in need of spectroscopic analysis in real time, just as observations are V. SUMMARY AND CONCLUSIONS being recorded. Finally, we plan to estimate photo-z for all We propose active learning (AL) as the next step in the quest SDSS galaxies as proposed here, so as to build luminosity for improving galaxy photometric redshift estimation. Our functions, and to construct a more accurate map of the large- proposed methodology comes as a response to observational scale structure of the universe. constraints that prevent us from ever obtaining a representative spectroscopic sample to be used for training. In the common ACKNOWLEDGMENTS scenario where the available spectroscopic sample is highly This work was partly supported by the Center for Advanced biased towards closer and brighter objects, we propose the use Computing and Data Systems (CACDS), and by the Texas of current spectroscopic samples (source dataset) merely as a Institute for Measurement, Evaluation, and Statistics (TIMES) resource from which to build an initial regression model; the at the University of Houston. same model can then be used as a starting point to iteratively 5 build increasingly more accurate models by adding selected We thank the IAA Cosmostatistics Initiative (COIN) - objects from the photometric sample (target dataset). Key to where our interdisciplinary research team was formed. COIN the successful application of this procedure is the use of AL to is a non-profit organization whose aim is to nourish the syn- find a relatively small number of objects in the target set that, ergy between astrophysics, cosmology, statistics and machine when added to the existing source set, yield highly predictive learning communities. EEOI and RSS thank Bruno Quint for models on the target distribution. suggesting the query evolution visualization. Using an approach to AL called Query by Committee (QBC), we show experimental results on datasets designed to REFERENCES capture the distributional discrepancy between spectroscopic [1] E. Hubble, “A Relation between Distance and Radial Velocity among and photometric observations. Results show a significant de- Extra-Galactic Nebulae,” Proceedings of the National Academy of Sci- gree of model refinement as an iterative process adds queried ence, vol. 15, pp. 168–173, Mar. 1929. [2] K. C. Chambers, E. A. Magnier, N. Metcalfe, H. A. Flewelling, M. E. examples into the training set. Specifically, we conclude that Huber, C. Z. Waters, L. Denneau, P. W. Draper, and et al., “The Pan- (1) a few hundred queried galaxies suffice to generate accurate STARRS1 Surveys,” ArXiv e-prints, Dec. 2016. regression models on the photometric (target) domain; and (2) AL lowers the number of catastrophic errors (or outlier 5http://cointoolbox.github.io/ [3] Dark Energy Survey Collaboration, T. Abbott, F. B. Abdalla, J. Aleksic,´ [24] X. Wang, T. kuo Huang, and J. Schneider, “Active transfer learning S. Allam, A. Amara, D. Bacon, E. Balbinot, and et al., “The Dark Energy under model shift,” in Proceedings of the 31st International Conference Survey: more than dark energy - an overview,” MNRAS, vol. 460, pp. on Machine Learning (ICML-14), T. Jebara and E. P. Xing, Eds. 1270–1299, Aug. 2016. JMLR Workshop and Conference Proceedings, 2014, pp. 1305–1313. [4] Z. Ivezic, J. A. Tyson, B. Abel, E. Acosta, R. Allsman, Y. AlSayyad, [Online]. Available: http://jmlr.org/proceedings/papers/v32/wangi14.pdf S. F. Anderson, J. Andrew, and et al., “LSST: from Science Drivers to [25] B. Settles, Active Learning. Morgan & Claypool, 2012. Reference Design and Anticipated Data Products,” ArXiv e-prints, May [26] M. F. Balcan, A. Beygelzimer, and J. Langford, “Agnostic active 2008. learning,” Journal of Computer and System Sciences, vol. 75, no. 1, [5] H. Hildebrandt, C. Wolf, and N. Ben´ıtez, “A blind test of photometric pp. 78 – 89, 2009. redshifts on ground-based data,” A&A, vol. 480, no. 3, pp. 703–714, [27] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with 2008. statistical models,” Journal of Artificial Intelligence Research, vol. 4, [6] A. Krone-Martins, E. E. O. Ishida, and R. S. de Souza, “The first no. 1, pp. 129–145, 1996. analytical expression to estimate photometric redshifts suggested by a [28] T. Solorio, O. Fuentes, R. Terlevich, and E. Terlevich, “An active machine,” MNRAS, vol. 443, pp. L34–L38, Sep. 2014. instance-based machine learning method for stellar population studies,” [7] J. Elliott, R. S. de Souza, A. Krone-Martins, E. Cameron, E. E. O. MNRAS, vol. 363, pp. 543–554, Oct. 2005. Ishida, and J. Hilbe, “The overlooked potential of Generalized Linear [29] J. W. Richards, D. L. Starr, H. Brink, A. A. Miller, J. S. Bloom, Models in astronomy-II: Gamma regression and photometric redshifts,” N. R. Butler, J. B. James, J. P. Long, and J. Rice, “Active Learning to A&C, vol. 10, pp. 61–72, Apr. 2015. Overcome Sample Selection Bias: Application to Photometric Variable [8] A. Petri, M. May, and Z. Haiman, “Cosmology with photometric weak Star Classification,” ApJ, vol. 744, p. 192, Jan. 2012. lensing surveys: Constraints with redshift tomography of convergence [30] D. Masters, P. Capak, D. Stern, O. Ilbert, M. Salvato, S. Schmidt, peaks and moments,” Physical Review D, vol. 94, no. 6, p. 063534, G. Longo, J. Rhodes, and et al, “Mapping the Galaxy Color-Redshift Sep. 2016. Relation: Optimal Photometric Redshift Calibration Strategies for Cos- [9] C. Blake, A. Collister, S. Bridle, and O. Lahav, “Cosmological baryonic mology Surveys,” ApJ, vol. 813, p. 53, Nov. 2015. and matter densities from 600000 SDSS luminous red galaxies with [31] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the photometric redshifts,” MNRAS, vol. 374, pp. 1527–1548, Feb. 2007. bias/variance dilemma,” Neural Comput., vol. 4, no. 1, pp. 1–58, Jan. [10] R. Kessler, D. Cinabro, B. Bassett, B. Dilday, J. A. Frieman, P. M. 1992. Garnavich, S. Jha, J. Marriner, R. C. Nichol, M. Sako, M. Smith, J. P. [32] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” in Bernstein, D. Bizyaev, A. Goobar, S. Kuhlmann, D. P. Schneider, and Proceedings of the Fifth Annual Workshop on Computational Learning M. Stritzinger, “Photometric Estimates of Redshifts and Distance Moduli Theory, ser. COLT ’92. New York, NY, USA: ACM, 1992, pp. 287– for Type Ia Supernovae,” ApJ, vol. 717, pp. 40–57, Jul. 2010. 294. [11] N. Ben´ıtez, “Bayesian Photometric Redshift Estimation,” ApJ, vol. 536, [33] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, “Selective sampling pp. 571–583, Jun. 2000. using the query by committee algorithm,” Mach. Learn., vol. 28, no. [12] R. Beck, C. A. Lin, E. E. O. Ishida, F. Gieseke, R. S. de Souza, M. Costa- 2-3, pp. 133–168, 1997. Duarte, M. W. Hattab, and A. Krone-Martins, “On the realistic validation [34] R. Burbidge, J. J. Rowland, and R. D. King, “Active learning for of photometric redshifts,” MNRAS, vol. 468, no. 4, p. 4323, 2017. regression based on query by committee,” in Proceedings of the 8th In- [13] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of ternational Conference on Intelligent Data Engineering and Automated representations for domain adaptation,” in Proceedings of the 19th Learning. Springer-Verlag, 2007, pp. 209–218. International Conference on Neural Information Processing Systems, ser. [35] A. Krogh and J. Vedelsby, “Neural network ensembles, cross validation NIPS’06. Cambridge, MA, USA: MIT Press, 2006, pp. 137–144. and active learning,” in Proceedings of the 7th International Conference [14] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. on Neural Information Processing Systems, ser. NIPS’94. MIT Press, Vaughan, “A theory of learning from different domains,” Mach. Learn., 1994, pp. 231–238. vol. 79, no. 1-2, May 2010. [36] N. Abe and H. Mamitsuka, “Query learning strategies using boosting and [15] H. Daume,´ III and D. Marcu, “Domain adaptation for statistical bagging,” in Proceedings of the Fifteenth International Conference on classifiers,” J. Artif. Int. Res., vol. 26, no. 1, pp. 101–126, May 2006. Machine Learning, ser. ICML ’98. San Francisco, CA, USA: Morgan [Online]. Available: http://dl.acm.org/citation.cfm?id=1622559.1622562 Kaufmann Publishers Inc., 1998, pp. 1–9. [16] Y. Mansour, M. Mohri, and A. Rostamizadeh, Domain adaptation: [37] P. Melville and R. J. Mooney, “Diverse ensembles for active learning,” Learning bounds and algorithms, 2009. in Proceedings of the Twenty-first International Conference on Machine [17] Y. Shi and F. Sha, “Information-Theoretical Learning of Discriminative Learning, ser. ICML ’04. New York, NY, USA: ACM, 2004, pp. 74–. Clusters for Unsupervised Domain Adaptation,” ArXiv e-prints, Jun. [38] I. Muslea, S. Minton, and C. A. Knoblock, “Active + semi-supervised 2012. learning = robust multi-view learning,” in Proceedings of the Nineteenth [18] K. D. Gupta, R. Pampana, R. Vilalta, E. E. O. Ishida, and R. S. de Souza, International Conference on Machine Learning, ser. ICML ’02. San “Automated supernova ia classification using adaptive learning tech- Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2002, pp. 435– niques,” in 2016 IEEE Symposium Series on Computational Intelligence 442. (SSCI), Dec 2016, pp. 1–8. [39] D. Kale and Y. Liu, “Accelerating active learning with transfer learning,” [19] G. Matasci, D. Tuia, and M. Kanevski, “Svm-based boosting of active in 2013 IEEE 13th International Conference on Data Mining, 2013, pp. learning strategies for efficient domain adaptation,” IEEE Journal of 1085–1090. Selected Topics in Applied Earth Observations and Remote Sensing, [40] C. J. MacDonald and G. Bernstein, “Photometric Redshift Biases from vol. 5, no. 5, pp. 1335–1343, Oct 2012. Galaxy Evolution,” Publications of the Astronomical Society of Pacific, [20] C. Persello and L. Bruzzone, “Active learning for domain adaptation in vol. 122, pp. 485–489, Apr. 2010. the supervised classification of remote sensing images,” IEEE Transac- [41] S. Alam, Albareti, C. Allende Prieto, F. Anders, S. F. Anderson, ´ tions on Geoscience and Remote Sensing, vol. 50, no. 11, pp. 4468–4483, T. Anderton, B. H. Andrews, E. Armengaud, E. Aubourg, S. Bailey, and 2012. et al., “The Eleventh and Twelfth Data Releases of the Sloan Digital Sky [21] P. Rai, A. Saha, H. Daume,´ III, and S. Venkatasubramanian, “Domain Survey: Final Data from SDSS-III,” ApJS, vol. 219, p. 12, Jul. 2015. adaptation meets active learning,” in Proceedings of the NAACL HLT [42] H. Hildebrandt, S. Arnouts, P. Capak, L. A. Moustakas, C. Wolf, F. B. 2010 Workshop on Active Learning for Natural Language Processing, Abdalla, R. J. Assef, M. Banerji, and et al, “PHAT: PHoto-z Accuracy ser. ALNLP ’10. Stroudsburg, PA, USA: Association for Computational Testing,” A&A, vol. 523, p. A31, Nov. 2010. Linguistics, 2010, pp. 27–32. [43] T. Dahlen, B. Mobasher, S. M. Faber, H. C. Ferguson, G. Barro, S. L. [22] A. Saha, P. Rai, H. Daume,´ S. Venkatasubramanian, and S. L. DuVall, Finkelstein, K. Finlator, A. Fontana, and et al., “A Critical Assessment “Active supervised domain adaptation,” in Proceedings of the 2011 of Photometric Redshift Methods: A CANDELS Investigation,” ApJ, European Conference on Machine Learning and Knowledge Discovery vol. 775, p. 93, Oct. 2013. in Databases - Volume Part III, ser. ECML PKDD’11. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 97–112. [23] X. Shi, W. Fan, and J. Ren, Actively Transfer Domain Knowledge. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 342–357.