A Developmental Method That Computes Optimal Networks Without Post-Selections

A Developmental Method that Computes Optimal Networks without Post-Selections Juyang Weng∗yzx ∗Department of Computer Science and Engineering, yCognitive Science Program, zNeuroscience Program, Michigan State University, East Lansing, MI, 48824 USA xGENISAMA LLC, Okemos, MI 48864 USA Abstract—This work is the theory of Post-Selection practices Last year, 2020, the author argued APFGP enables con- that have been rarely studied. Post-Selections mean selections scious machines, the first model about conscious machines of systems after the systems have been trained. Post-Selections that are based on emergent universal Turing machines [4]. Using Validation Sets (PSUVS) are wasteful and Post-Selections Using Test Sets (PSUTS) are wasteful and unethical. Both result Here, this author presented a controversial practice called in systems whose generalization powers are weak. The PSUTS Post-Selection Using Test Sets (PSUTS). This report also fall into two kinds, machine PSUTS and human PSUTS. The explains why the author’s brain model (Developmental Net- connectionist AI school received criticisms for its “scruffiness” works, DNs) avoids PSUTS. Using PSUTS means the corre- due to a huge number of network parameters and now the sponding project wastes much resource of computation and machine PSUTS; but the seemingly “clean” symbolic AI school seems more brittle because of its human PSUTS. This paper manpower and the superficial error of the reported system is analyzes why, in deep learning, error-backprop methods with misleading because the system does not give a similar error random initial weights suffer from severe local minima, why on new data sets. This paper not only raised a controversy but PSUTS violates well-established research ethics, and publications also presented a solution to avoid PSUTS. that used PSUTS should have transparently reported PSUTS. Since 2012, AI has attracted much attention from public This paper proposes a Developmental Methodology that trains and media. A large number of projects in AI has published. only a single but optimal network for each application lifetime using a new standard for performance evaluation in machine If the authors of these projects use the method of this report, learning, called developmental errors for all networks trained in they could benefit much for reducing the time and manpower a project that the selection of the luckiest network depends on, used to reach their target systems as well as improving the along with Three Learning Conditions: (1) framework restric- generalization powers of their target systems. tions, (2) training experience and (3) computational resources. Let us first discuss in Sec. II what PSUTS is. Sec. III This paper also discusses how the brain-inspired Developmental Networks (DNs) avoid PSUTS by reporting developmental errors reasons why error backprop needs PSUTS. Sec. IV explains and its maximum likelihood (ML) optimality under the Three why Developmental Networks (DNs) do not need and avoid Learning Conditions. DNs are not “scruffy” because they are ML- PUSTS. Sec. V provides details of the methodology. Sec. VI estimators of the observed Emergent Turing Machines at each outlines experiments. Sec. VII presents concluding remarks. time during their “lives”. This implies best performance given a limited amount of overall available computational resources for II. POST-SELECTIONS a project. Many machine-learning methods were evaluated without considering how much computational recourses are necessary for the development of a reported system. Thus, comparisons I. INTRODUCTION about the performance of the system have been biased toward In 2000, the author published with six (6) co-authors what is a competition of how much resources a group has, as we will called “Autonomous Mental Development (AMD)” in Science see below after we define the Post-Selection, regardless how [1]. The Science editor commented that developmental AI had many networks have been trained and discarded, and how not been not known before—the article published a new direc- large each network is. Worse still, test sets were used in a tion “task-nonspecific programs across lifetime”. All programs controversial way. Here we explicitly define a set of Three before then were task-specific, including all published neural Learning Conditions for development of an AI system: The Three Learning Conditions for developing an AI system networks due to PSUTS to be explained below. A series of are: (1) the framework restrictions, including whether task- advances has been made since 2000. The following is only specific or task-nonspecific, batch learning or incremental an outline of the most important events. For a more detailed learning, and the body’s sensors and effectors; (2) the teaching account of entire picture, the reader is referred to a book by experience; (3) the computational resources including the the author [2]. number of hidden neurons. In 2017, the author and his coworkers presented “emergent universal Turing machine” [3]. This is the first emer- A. Post-Selections gent mode about Auto-Programming for General Purposes The available data set D is divided into three mutually (APFGP) along with experimental results. disjoint sets, a training set T , a validation set V , and a test set T 0. Two sets are disjoint if they do not share any elements. Since a PSUVS procedure picks the best system based on Let us consider how the validation sets and test sets are used the errors on the validation set, the resulting system might not in experiments. do well on the test sets because doing well on validation sets A network architecture has a set of parameters represented do not guarantee to do well on the test sets. by a vector, where each component corresponds to an archi- Worse is Post-Selections Use Test Set (PSUTS). There are tecture parameter, such as convolution kernel sizes and stride two kinds of PSUTS, machine PSUTS and human PSUTS. values at each level of a deep hierarchy, the neuronal learning rate, and the neuronal learning momentum value, etc. Let k be C. Machine PSUTS a finite number of grid points along which such architecture If the test set T 0 is available which seems to be true parameter vectors need to be tried, A = fai j i = 1; 2; :::; kg. for almost all neural network publications, we define Post- If there are 10 parameters in each architecture and each Selection Using Test Sets (PSUTS): of which has 10 grid points to try, there are a total of A Machine PSUTS is defined as follows: If the test set T 0 10 0 k = 10 = 10B architecture parameter vectors to try, an is available, suppose the test error of N(ai; wj) is ei;j on 0 extremely large number. the test set T , find the best network N(ai∗ ; wj∗ ) so that it For each architecture vector ai, assume n sets of random reaches the minimum test error: weights wj, resulting in kn networks 0 0 ei∗;j∗ = min ei;j (3) 1≤i≤k;1≤j≤n fN(ai; wj) j i = 1; 2; :::; k; j = 1; 2; :::; ng 0 and report only the performance e ∗ ∗ but not the perfor- are trained each of which starts with a different set of random i ;j mances of other remaining kn − 1 trained neural networks. weights w , using error backprop that locally and numerically j Imagine that we want to remove lucks in the above expres- minimizes the fitting error f on the training set T . Graves i;j sion, by using averages like we did in Eq. (2): et al. [5] seems to have mentioned n = 20. Using the above n example of k = 10B, kn = 200B a huge number that requires 0 1 X 0 e¯i∗;j∗ = min ei;j: (4) a lot of computational resources and manpower. 1≤i≤k n Let us define the Post-Selection. Suppose that the trainer is j=1 first aware of the validation sets and the test sets. He trains But the above error is still unethical and misleading since each multiple systems using the training sets. After these systems term under minimization for ai has peeked into test sets! have been trained, he post-selects a system by searching, There are some variations of Machine PSUTS: The vali- manually or assisted by computers, among trained systems dation set V or T 0 are not disjoint with T . If T = V , we based on the validation sets or the test sets. This is called call it validation-vanished PSUTS. If T = T 0, we called it Post-Selection—selection after training. test-vanished PSUTS. Obviously, a post-selection wastes all trained systems except A distribution of fitting errors, validation errors and test the selected one. As we will see next, a system from the post- errors is defined as follows: The distributions of all kn trained selection has a weak generalization power. networks’ fitting errors ffijg, validation errors feijg, and test 0 First, consider Post-Selections Using Validation Sets: errors feijg, i = 1; 2; :::k, j = 1; 2; :::n, as well as the values B. PSUVS of k and n. 0 It is necessary to present some key statistical characteristics A Machine PSUVS is defined as follows: If the test set T of such distributions. For example, ranked errors in decreasing is not available, suppose the validation error of N(ai; wj) is order. Then given the maximum, 75%, 50%, 25%, and mini- ei;j on the validation set V , find the best network N(ai∗ ; wj∗ ) mum value of these kn values for the fitting errors, validation so that it reaches the minimum validation error: errors. and test errors, so that the research community can ei∗;j∗ = min ei;j (1) see whether error-backprop can avoid local minima in deep 1≤i≤k;1≤j≤n learning.

Load more