Volume 11 Number 2 Jun 2018

IJITAS11(2)-00 前頁.indd 1 2018/7/12 下午 05:55:36 Published by

Airiti Press Taipei office: 18F, No. 80, Sec. 1, Chenggong Rd., Yonghe Dist., New Taipei City 23452, Taiwan, R.O.C.

International Journal of Intelligent Technologies and Applied Statistics

Vol. 11, No. 2 Jun. 2018 ISSN 1998-5010

To subscribe write to: International Journal of Intelligent Technologies and Applied Statistics (IJITAS), 18F, No. 80, Sec. 1, Chenggong Rd., Yonghe Dist., New Taipei City 23452, Taiwan, R.O.C. Phone: (886) 2-2926-6006, Fax: (886) 2-2923-5151 E-mail: [email protected]

Printed in Taiwan.

IJITAS11(2)-00 前頁.indd 2 2018/7/12 下午 05:55:36 International Journal of Intelligent Technologies and Applied Statistics

Vol. 11, No. 2 Jun. 2018

CONTENTS

Lotfi Zadeh: A Pioneer in AI, a Pioneer in Statistical Analysis, a Pioneer in Foundations of Mathematics, and a True Citizen of the World 87 Vladik Kreinovich

Why Learning Has Aha-Moments and Why We Should Also Reward Effort, Not Just Results 97 Gerardo Uranga, Vladik Kreinovich and Olga Kosheleva

Why 70/30 or 80/20 Relation Between Training and Testing Sets: A Pedagogical Explanation 105 Afshin Gholamy, Vladik Kreinovich and Olga Kosheleva

Why Skew Normal: A Simple Pedagogical Explanation 113 José Guadalupe Flores Muñiz, Vyacheslav V. Kalashnikov, Nataliya Kalashnykova, Olga Kosheleva and Vladik Kreinovich

Willingness-to-pay for FTTH and the Quality of Experience from OTT Media Streaming Services: The Case Study of Thailand 121 Tatcha Sudtasan

Optimal Initial Values in Maximum Likelihood Estimation of Logistic Regression Models 143 Seksiri Niwattisaiwong and Komsan Suriya

IJITAS11(2)-00 前頁.indd 3 2018/7/12 下午 05:55:36 IJITAS11(2)-00 前頁.indd 4 2018/7/12 下午 05:55:36 International Journal of Intelligent Technologies and Appld ie Statt is ics

Vol.11, No.2 (2018) pp.87-96, DOI:10.6148/IJITAS.201806_11(2).0001 © Airiti Press

Lotfi Zadeh: A Pioneer in AI, a Pioneer in Statistical Analysis, a Pioneer in Foundations of Mathematics, and a True Citizen of the World

Vladik Kreinovich*

Department of Computer Science, University of Texas at El Paso, Texas, USA

ABSTRACT Everyone knows Lotfi Zadeh (1921–2017) as the Father of Fuzzy Logic. There have been—and will be—many papers on this important topic. What I want to emphasize in this paper is that his ideas go way beyond fuzzy logic: he was a pioneer in AI; he was a pioneer in statistical analysis; and he was a pioneer in foundations of mathematics. My goal is to explain these ideas to non-fuzzy folks. I also want to emphasize that he was a true Citizen of the World.

K eywords: Fuzzy logic; AI; Statistical analysis; Foundations of mathematics; Lotfi Zadeh

1. How Lotfi Zadeh became an AI pioneer

Practical problem. Lotfi A. Zadeh (1921–2017) was a specialist in control and systems. His textbook Linear System Theory: The State S pace Approach (Zadeh & Desoer, 2008) was a classic. It provided optimal solutions to many important control problems—optimal within the existing models. But, surprisingly, in many practical situations, “optimal” control was worse than control by human experts. Clearly, something was missing from the corresponding models. So Zadeh asked experts what is missing. Many experts explained what was wrong with the “optimal” control. However, these explanations were given in imprecise natural-language terms. For example, an expert driver can say: “if a car in front is close, and if this car slows down a little bit, then a driver should hit the breaks slightly.” Until Zadeh, engineers would try to extract precise strategy from the expert; they would ask an expert:

* Corresponding author: [email protected]

IJITAS11(2)-01 Vladik.indd 87 2018/7/10 上午 09:12:29 88 Kreinovich

●● A car is 5 m close. ●● It slows down from 60 to 55 km/h. ●● For how long and with what force should we hit the brakes?

Problem with the traditional AI approach. Most people cannot answer the above question. Those who answer give a somewhat random number—and different number every time. ●● If we implement exactly this force, we get a weird control—much worse than when a human drives. ●● If we instead apply optimization, the resulting control is optimal for the exact weight of the car, but if a new passenger enters the car—the problem changes: the previous optimal control is not longer optimal. As a result, this control can be really bad. ●● If we simply ignore expert rules, we also get a suboptimal control. Also, what we want is imprecise. For example, for an elevator, we want a smooth ride, but it is difficult to describe this in precise terms.

What Zadeh proposed to solve this problem: Main idea. Zadeh had an idea: in situations when we can only extract imprecise (fuzzy) rules from the experts, instead of ignoring these rules, let us develop techniques that transform these fuzzy rules into a precise control strategy. Zadeh invented the corresponding technique—it is the technique he called fuzzy logic.

Zadeh’s idea illustrated via a simple example. Zadeh’s technique can be illustrated on a simple example of a thermostat: ●● If we turn the knob to the right, the temperature T increases. ●● If we turn it to the left, the temperature decreases.

●● Our goal is to maintain a comfortable temperature T 0. Experts can formulate rules on how the angle u to which we rotate the knob depends on T: ●● If the temperature is practically comfortable, no control is needed. ●● If the temperature is slightly higher than desired, cool the room a little bit. ●● If the temperature is slightly lower than desired, heat up the room a little bit, etc. def In terms of the difference x T – T 0: ●● If x is negligible, u should be negligible. ●● If x is small positive, then u should be small negative. ●● If x is small negative, then u should be small positive, etc. By using abbreviations N for “negligible,” SP for “small positive,” and SN for “small negative,” we get: N(x) ⇒ N(u); SP(x) ⇒ SN(x); SN(x) ⇒ SP(u); ... A control u is reasonable for given x (R(x, u)) if one of these rules is applicable: R(x, u) ⇔ (N(x) & N(u)) ∨ (SP(x) & SN(u)) & (SN(x) & SP(u)) ∨ ...

IJITAS11(2)-01 Vladik.indd 88 2018/7/10 上午 09:12:29 Lotfi Zadeh: A Pioneer in AI 89

To translate this into precise formula, we need: ●● To translate N(x), N(u), ... into precise terms, ●● To interpret “and” and “or,” and then ●● To translate the resulting property R(x, u) into a single control value u. Since we deal with “and” and “or,” this technique is related to logic. Since we deal with imprecise (“fuzzy”) statements, Zadeh called it fuzzy logic; see, e.g., Belohlavek, Dauban, and Klir (2017), Klir and Yuan (1995), Mendel (2017), Nguyen and Walker (2006), Novák, Perfilieva, and Močkoř (1999), and Zadeh (1968).

Three stages of fuzzy logic technique. Let us explain all three stages of fuzzy logic technique. (1) First stage: How can we interpret “x is negligible”? For traditional (precise) properties like “x > 5°,” the property is either true or false. Here, to some folks, 5 degrees is negligible, some feel a difference of 2 degrees. And no one can select an exact value—so that, say 1.9° is negligible but 2.0° is not. Similarly, there is no exact threshold separating “close” from “not close.” At best, expert can mark the degree to which x is negligible on a scale from, say, 0 to 10. If an expert marks 7 on a scale from 0 to 10, we say that his degree of confidence that x is negligible is 7/10. This way, we can find the degrees of N(x), N(u), SP(x), etc. (2) Second stage: Based on the degrees obtained on the first stage, we need to estimate degrees of propositional combinations N(x) & N(u), etc. Ideally, we can ask the expert for degrees of all such combinations. However, for n basic statements, there are 2n such combinations. For n = 30, we have 230 ≈ 109 combinations. It is not possible to ask 109 questions. So, we need to be able to estimate the degree d(A & B) based on degrees a = d(A)

and b = d(B). The algorithm d(A & B) ≈ f &(a, b) for such an estimation is known as an “ and”-operation. For historical reasons, “and”-operations are also knows as t-norms. What are natural properties of “and”-operations? ●● Since A & B means the same as B & A, this operation must be commutative:

f &(a, b) = f &(b, a). ●● Since A & (B & C) means the same as (A & B) & C, the “and”-operation must be associative. ●● There are also natural requirements of monotonicity, continuity, and

requirements that f &(1, 1) = 1, f &(0, 0) = f &(0, 1) = f &(1, 0) = 0, ....

Examples of “and”- and “or”-operations. All such operations are known.

We may want to also require that A & A means the same as A: f &(a, a) = a.

In this case, we get f &(a, b) = min(a, b). This is one of the most widely used “and”- operations.

Others include f &(a, b) = a • b, etc.

Similar properties hold for “or”-operations f ∨(a, b) (a.k.a. t-conorms). For example,

IJITAS11(2)-01 Vladik.indd 89 2018/7/10 上午 09:12:29 90 Kreinovich

if we require that A ∨ A means the same as A, we get f ∨(a, b) = max(a, b). Others

include f ∨(a, b) = a + b – a • b, etc. (3) Third (final) stage and resulting success stories: By applying “and”- and “’or”- operations, we get, for each u, the degree R(x, u) to which u is reasonable. Now, we need to select a single control value u. It is reasonable to use least squares, with R(x, u) as weights:

R(x, u) • ( u – u )2 du → min. (1)

The resulting formula is known∫ as centroid def uzzif ication:

R(x, u) • u du u = . (2) R(x, u) du ∫

This technique has led to many successes∫ (Belohlavek et al., 2017; Klir & Yuan, 1995; Mendel, 2017; Nguyen & Walker, 2006; Novák et al., 1968; Zadeh, 1968): (1) Fuzzy-controlled trains and elevators provide a smooth ride. (2) Fuzzy rice cookers produce tasty rice, etc.

This was a simplified description. The above description only contains the main ideas, real-life applications are more complex. First, just like experts cannot say with what force they press the brakes, they cannot tell what exactly is their degree of confidence. An expert can say 7 or 8 on a scale of 0 to 10, but cannot distinguish between 70/100 and 71/100. Thus, a more adequate description of expert’s confidence is not a number but an interval of possible values. An expert may also say how confident she is about each degree— so we have a type-2 fuzzy degree. This leads to control which is closer to that of the expert’s—and thus, better, smoother, more stable, etc.; see, e.g., Mendel (2017) and Mendel and Wu (2010). Second, centroid defuzzification does not always work. For example, if we want to avoid an obstacle in front, we can steer to the left or to the right. The situation is completely symmetric, thus the defuzzified value is symmetric. So it leads us straight into the obstacle. Thus, we need to only select control values for which degree of confidence exceeds some threshold. Third, we also often have additional constraints—which could also be fuzzy. Finally, we often want not just to follow expert, but to optimize—thus further improving their advice. Optimization under fuzzy uncertainty can also be handled by fuzzy logic techniques.

IJITAS11(2)-01 Vladik.indd 90 2018/7/10 上午 09:12:29 Lotfi Zadeh: A Pioneer in AI 91

Fuzziness is ubiquitous. Many of our important notions are “fuzzy”: ●● No one is absolutely good or bad—it is a matter of degree. ●● It is difficult to find the cause of an event—usually, many factors have different degrees of causality. How can we describe this fuzziness?

Three levels of applied mathematics. There are three levels of applied mathematics: (1) Level 1: Most researchers are well familiar with one formalism and use it. Statisticians use statistics, others use differential equations, etc. (2) Level 2: Some researchers have mastered several mathematical techniques. These researchers select, for each practical problem, the most appropriate of these techniques. However, existing techniques are often not perfectly adequate for a practical problem. (3) Level 3: A researcher designs a new formalism, especially for the given application. Philosophers like to cite Nobelist Eugene Wigner who wrote about unexplainable efficiency of mathematics (Wigner, 1960), e.g., ●● Quantum physics is perfectly described by Hilbert spaces. ●● General Relativity is based on pseudo-Riemannian spaces. However, ●● Hilbert spaces were invented by John von Neumann explicitly to describe quantum physics. ●● Pseudo-Riemannian spaces were invented by A. Einstein explicitly to describe curved space-time. Zadeh’s ideas follow the same pattern: ●● Before Zadeh, researchers described human uncertainty by known math, e.g., probabilities. This covered some cases well, some not so well. ●● Zadeh came up with a new technique specifically designed for describing non- probabilistic uncertainty. As a result, he got many successful applications.

Comment on simplicity. From the mathematical viewpoint, his main ideas were simple. This makes it even better: if we can get good empirical results by using simpler techniques, good!

Misunderstandings. Zadeh’s AI ideas were often misunderstood. (1) Some folks falsely believed that in fuzzy logic, d( A & B) is uniquely determined by d(A) and d(B). They thought that a simple counterexample to this Straw-man belief can prove that fuzzy logic is wrong. (2) Some falsely believed that Zadeh recommended min and max only. In reality, in his very first fuzzy paper he introduced other operations as well. (3) Some believed that Zadeh wanted to replace probabilities with fuzzy logic. In reality, he always emphasized the need to have 100 flowers bloom.

IJITAS11(2)-01 Vladik.indd 91 2018/7/10 上午 09:12:29 92 Kreinovich

2. Lotfi Zadeh: A pioneer in statistical analysis

Formulation of the problem. In many practical situations,

●● We know the probabilities p1, ... , pn of individual events E1, ... , En, and ●● We would like to know the probabilities of different propositional

combinations, such as E1 & E2. To describe all such probabilities, it is sufficient to find the probabilities of all “and”-combinations

... Ei1 & & Eim. (3)

If the events are independent, the answer is easy:

...... p(Ei1 & & Eim) = p(Ei1) • • p(Eim). (4)

However, often, ●● We know that the events are not independent. ●● But we do not have enough data to find out the exact dependence. Traditional statistical approach was to assume some prior joint distribution. The problem is that different prior distributions lead to different answers; see, e.g., Rabinovich (2005) and Sheskin (2011). Which one should we select? In statistical analysis, we usually select the easiest-to-process distribution. However, real life is often complex—so why should we select the simplest method?

Zadeh’s idea. Zadeh’s revolutionary idea was to select an appropriate “and”-operation

for converting probabilities a = p(A) and b = p(B) into an estimate f &(a, b) for p(A & B). A natural requirements that estimates for A & B and B & A should be the

same lead to commutativity f &(a, b) = f &(b, a). The requirement that estimates for A & (B & C) and (A & B) & C coincide lead to associativity. The corresponding “and”- operation should be experimentally determined.

Relation to MYCIN. Zadeh’s idea, in effect, formalizes the procedure successfully used for Stanford’s MYCIN; see, e.g., Buchanan & Shortliffe (1984). This was the world’s first successful expert system—designed for diagnosing rare blood diseases. Interestingly, MYCIN’s authors first thought that their “and”-operation describes general human reasoning. However, when they tried to apply it to geophysics, they

realized that we need a different f &(a, b). This makes sense: ●● In geophysics, we start digging for oil if there is a good chance of success, even if further tests could clarify the situations.

IJITAS11(2)-01 Vladik.indd 92 2018/7/10 上午 09:12:29 Lotfi Zadeh: A Pioneer in AI 93

●● In contrast, in medicine, we do not recommend a surgery unless we have made all possible tests.

3. Lotfi Zadeh: A Pioneer in f oundations of mathematics

Main idea. From the logical viewpoint, the original fuzzy logic is simply [0, 1]-valued logic. The main formulas for this logic were proposed by Lukasiewicz in the 1920s. Zadeh succeeded in transforming this abstract theory into a successful practical tool. He also came up with an idea of how to generalize all mathematical notions into fuzzy, e.g., (1) Replace & (and ∀ – infinite &) with min, and (2) Replace ∨ (and ∃ – infinite ∨) – with max.

First example. How to extend data processing algorithm y = f(x1, ..., xn) to fuzzy

inputs μi(xi)? Main idea:

y is reasonable ⇔ ∃x , ..., ∃x (x is reasonable and ... 1 n 1 (5) and xn is reasonable, and y = f(x1, ..., xn)).

The above transformation leads to

max μ(y) = min(μ1 (x1), ..., μn (xn)). (6) x1, ..., xn: f(x1, ..., xn) = y

This is known as Zadeh’s extension principle. Instead of min, we can use other “and”-operations. Important point is this is not as arbitrary as it seems to some authors, this is a particular case of a general algorithm.

Second example. Another example is intuitive continuity: if x and x′ are close, then y = f(x) and y′ = f(x′ ) should be close.

Let μin(x′ – x) describe closeness of inputs. Closeness of outputs may be described

in a different scale: μout(y – y′ ) = μin(K • ( y – y′ )). Implication A → B can be understood as d(A) ≥ d(B). Thus, we get the condition

μin(x – x′) ≥ μout(f(x) – f(x′)) = μin(K • f(x) – f(x′ )). (7)

This condition is equivalent to |x – x'| ≤ K • |f(x) – f(x' )|, i.e., to

IJITAS11(2)-01 Vladik.indd 93 2018/7/10 上午 09:12:30 94 Kreinovich

|f(x) – f(x' )| ≤ L • |x – x'|, ( 8 )

def for L K –1. Thus, we get Li pschitz condition

Warning. Warning (emphasized by Elkan 1993; Elkan et al. 1994; and Nguyen, Kosheleva, & Kreinovich 1996): we need to use the original logical formulation of the property. Indeed, e.g., A ∨ ¬ A is not always true:

max(d(A), d(¬A)) = max(d(A), 1 – d(A)) 1. (9)

Thus, a classically equivalent logical formula can lead to a different translation.

4. Lotfi Zadeh: A true citizen of the world

True citizen of the world. Who was he? ●● For Azeris, Zadeh is a national hero who passionately cared about the country of his birth. ●● To Iranians, he is a great Iranian who knew a lot and cared a lot about their country. ●● People of Russia knew him as passionate and well-informed about Russian events. ●● He was passionate about US politics—and with his wife Fay, he spoke English (although they could communicate in many other languages). ●● To many people, he was their own. He was a true citizen of the world. He was not coldly above struggle, no way. He was passionate about everyone; his heart bled about all the injustices of the world—as if they were his own. And he was passionately happy about the successes of everyone—as if they were his own. He was the true embodiment of Apostle Paul’s famous statement: “There is no longer Jew or Greek, [...]; for all of you are one” (Galatians 3:28). This was his attitude to nations; this was his attitude to people.

A good person. ●● Zadeh’s “take everything as a compliment” life stance helped him remain calm, cheerful—and successful. ●● He promoted his fuzzy ideas—but never at the expense of others. Vice versa, he always emphasized the need to combine them with others—probabilistic, neural, etc. He inspired a combination of different AI directions—fuzzy, neural, etc., into a single soft computing direction, with successful conferences, journals, and applications. The world needs more people like Lotfi Zadeh!

IJITAS11(2)-01 Vladik.indd 94 2018/7/10 上午 09:12:30 Lotfi Zadeh: A Pioneer in AI 95

Acknowledgments

This work was supported in part by the National Science Foundation grant HRD-1242122 (Cyber-ShARE Center of Excellence). The author is thankful to all the participants of the Symposium on Fuzzy Logic and Fuzz y Sets (Berkeley, California; February 5, 2018) for valuable suggestions.

References

Belohlavek, R., Dauban, J. W., & Klir, G. J. (2017). Fuzzy logic and mathematics: A historical perspective. New York, NY: Oxford University Press. doi:10.1093/ oso/9780190200015.001.0001

Buchanan, B. G., & Shortliffe, E. H. (1984). Rule-based ex pert systems: The MYCI N ex periments of the stanf ord heuristic programming pro ject. Reading, MA: Addison-Wesley.

Elkan, C. (1993). The paradoxical success of fuzzy logic. In K. Ford (Ed.), Proceedings of the Eleventh National Conf erence on Artificial Intelligence (pp. 698–703). Menlo Park, CA: AAAI Press.

Elkan, C. (1993). The paradoxical success of fuzzy logic. Proceedings of the 1993 Annual Conf erence of the American Association f or Artificial Intelligence AAAAI’93, W ashington, DC, 698–703.

Elkan, C., Berenji, H. R., Silva, C. J. S., Attikiouzel, Y., Dubois, D., Prade, H., ... Zadeh, L. A. (1994). Fuzzy logic symposium: The paradoxical success of fuzzy logic. IEEE Ex pert, 9(4), 3–49. doi:10.1109/64.336150

Klir, G. J., & Yuan, B. (1995). Fuzzy sets and fuzzy logic: Theory and applications. Upper Saddle River, NJ: Prentice Hall.

Mendel, J. M. (2017). Uncertain rule-based fuzzy systems: Introduction and new directions (2nd ed.). Cham, Switzerland: Springer. doi:10.1007/978-3-319-51370-6

Mendel, J. M., & Wu, D. (2010). Perce ptual computing: Aiding people in making sub jective jud gments. New York, NY: IEEE Press.

Nguyen, H. Y., Kosheleva, O. M., & Kreinovich, V. (1996). Is the success of fuzzy logic really paradoxical?: Towards the actual logic behind expert systems. International Journal of Intelligent Systems, 11, 295–326. doi:10.1002/(SICI)1098- 111X(199605)11:5<295::AID-INT4>3.0.CO;2-J

Nguyen, H. T., & Walker, E. A. (2006). A f irst course in f uzzy logic (3rd ed.). Boca Raton,

IJITAS11(2)-01 Vladik.indd 95 2018/7/10 上午 09:12:30 96 Kreinovich

FL: Chapman and Hall/CRC.

Novák, V., Perfilieva, I., & Močkoř, I. (1999). M athematical principles of fuzzy logic. Boston, MA: Kluwer Academic. doi:10.1007/978-1-4615-5217-8

Rabinovich, S. G. (2005). Measurement errors and uncertainty: Theory and practice, (3rd ed.). New York, NY: Springer.

Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures (5th ed.). Boca Raton, FL: Chapman and Hall/CRC. doi:10.1201/9781420036268

Wigner, E. P. (1960). The unreasonable effectiveness of mathematics in the natural sciences. Communications on Pure and A pplied M athematics, 13, 1–14. doi:10.1002/cpa.3160130102

Zadeh, L. A. (1968). Fuzzy sets. Inf ormation and Control, 8, 338–353. doi:10.1016/S0019- 9958(65)90241-X

Zadeh, L. A., & Desoer, C. A. (2008). Linear system theory: The state space approach. New York, NY: Dover.

IJITAS11(2)-01 Vladik.indd 96 2018/7/10 上午 09:12:30 International Journal of Intelligent Technologies and Appld ie Statt is ics

Vol.11, No.2 (2018) pp.97-103, DOI:10.6148/IJITAS.201806_11(2).0002 © Airiti Press

Why Learning Has Aha-Moments and Why We Should Also Reward Effort, Not Just Results

Gerardo Uranga1, Vladik Kreinovich1* and Olga Kosheleva2

1Department of Computer Science, University of Texas at El Paso, Texas, USA 2Department of Teacher Education, University of Texas at El Paso, Texas, USA

ABSTRACT Traditionally, in machine learning, the quality of the result improves steadily with time (usually slowly but still steadily). However, as we start applying reinforcement learning techniques to solve complex tasks—such as teaching a computer to play a complex game like Go—we often encounter a situation in which for a long time, there is no improvement, and then suddenly, the system’s efficiency jumps almost to its maximum. A similar phenomenon occurs in human learning, where it is known as the aha-moment. In this paper, we provide a possible explanation for this phenomenon, and show that this explanation leads to the need to reward students for effort, not only for their results.

K eywords: Machine learning; Reinforcement learning; Aha-moment; Human learning; Rewards to students

1. Formulation of the problem

Need f or machine learning. In many practical situations, we want to be able to

find the values of a quantity y based on the values of some related quantities x1,

..., xn. For this, we can use situations k = 1, ..., N when we know the values of both (k) (k) (k) x1 , ..., xn and y . Determining the dependence y = f(x1, ..., xn) is known as machine learning; see, e.g., Bishop (2006) and Goodfellow, Bengio, and Courville (2016). For example, we want to be able to predict the volcano eruptions based on the seismic activity preceding this eruption; see, e.g., Parra, Fuentes, Anthony, and K reinovich (2017a, 2017b).

Need for reinforcement learning. In some cases, the existing patterns are not sufficient for learning; in this case, it is desirable to come up with additional

* Corresponding author: [email protected]

IJITAS11(2)-02 Vladik.indd 97 2018/7/10 上午 09:15:35 98 Uranga, Kreinovich and Kosheleva

patterns—and, ideally, a computer should tell us which patterns to look for to get the best learning. Such a situation is known as reinf orced learning; see, e.g., Duan et al. (2016), Sutton and Barto (1998), and Wang et al. (2017). Reinforcement learning is especially important if we want to teach a computer to play a complex game like Go. In this case, we want to find the best move y based

on the values xi that describe the current state (and, if needed, the past states and moves). In such situations, in addition to relying on the record of previous games, it is often desirable to test some new possible moves—by using a computer simulation of the corresponding game.

Aha-moments. Traditionally, in the process of machine learning, the efficiency of the resulting dependence increases as we continue learning. For example, as learning (k) (k) continues, the values f cur x1 , ..., xn obtained by the current state of the system (k) get closer and close to the� desired values� y . If we are learning how to play a game,

then the quality of the corresponding strategy f cur(x1, ..., xn) steadily increases as we continue learning. This improvement may be slow, it may have temporary setbacks, but overall, we tend to observe a steady improvement. However, as the learning tasks become more and more complex—e.g., to the level of playing Go—while the system eventually learns, it does not show a steady increase at all: ●● For a long time, there is no visible increase in the efficiency of the resulting system; ●● Then, in a relatively short period of time, the system’s efficiency jumps to its maximum; see, e.g., Juliani (2017) and references therein. This is similar to the so-called aha-moment observed during human learning (see, e.g., Thompson, 2010), when: ●● A student first does not grasp a concept (and thus, cannot solve related problem). ●● Until suddenly, he or she gets a good understanding of it.

Why aha-moment? Why do we often observe such aha-moments? In this paper, we provide a possible explanation for this phenomenon—and we also show how this explanation can effect human learning.

2. Analysis of the problem

Case study. Let us consider a typical game-learning environment, where we want to

select the best strategy. Let a1, ..., am be parameters that describe a strategy. We want to come up with a strategy that works best “on average”, to be applied

IJITAS11(2)-02 Vladik.indd 98 2018/7/10 上午 09:15:35 Why Learning Has Aha-Moments 99

in the real world. Of course, for individual games, with individual opponents, other strategies may work better—strategies that take into account the individual features of the corresponding players.

(0) (0) (0) Analysis of the case study. Let a = a1 , ..., am be the desired strategy which is the best on average. This strategy works� the best �against the “average” player, i.e., a player with average values of appropriate characteristics—such as skill, memory, desire to take risks, etc. A deviation of any characteristic of player from its average

value, in general, leads to the fact that the strategy a = (a1, ..., am) optimal for (0) (0) players with this deviation will be slightly different from a : ai = ai + ∆ai for some

small ∆ai. In general, we have many different characteristics describing a player. They can be viewed as reasonably independent ones. So, for a randomly selected player, the deviation of the strategy optimal for this player from a(0) is a joint effect of all the differences from this person’s characteristics and the average values of all these characteristics.

Resulting distribution. It is known that in general, the probability distribution of a joint effect of many similar random variables is close to Gaussian; the corresponding mathematical result is known as the Central Limit Theorem; see, e.g., Sheskin (2011). Thus, we can conclude that the strategy which is optimal for a given randomly selected player is normally distributed around a(0).

How the quality of a given strategy a = (a1, ..., am) depends on the parameters

ai. A natural way to gauge the quality of a strategy is by measuring how successful it is when playing against a randomly selected opponent. In other words, as a quality of a strategy a, it is reasonable to take the probability p(a) that this strategy defeats a randomly selected opponent. Since we have shown that the corresponding distribution is Gaussian (normal),

the dependence of this probability on ai has the Gaussian form:

p(a) = const • exp(–Q(∆a)), (1)

def (0) (0) where ∆a a – a (i.e., ∆ai = ai = ai ), and Q(∆a) is a quadratic form:

m m ∆ ∆ ∆ Q( a) = ∑ i = 1 ∑ j = 1 qij • ai • a j. (2)

Comment. In a complex game like Go, a random strategy never wins. So:

IJITAS11(2)-02 Vladik.indd 99 2018/7/10 上午 09:15:35 100 Uranga, Kreinovich and Kosheleva

●● For most of the strategies a, the value p(a) of the corresponding objective function is close to 0, and ●● There is a small area for which p(a) is significantly different from 0.

3. Why learning has aha-moments: A possible explanation

What we plan to do. The above analysis shows how the corresponding objective

function p(a) depends on the parameters ai that describes a strategy a = (a1, ..., am). Let us describe how this shape of the objective function affects the learning process.

How the objective functions are optimized in machine learning algorithms. In most machine learning algorithms, the objective function is optimized by using the gradient descent; see, e.g., Bishop (2006) and Goodfellow et al. (2016) (in this case, since we are maximizing the probability of success, it is gradient accent). Crudely speaking,

at each step, instead of the original values a = (a1, ..., am), we have new values ai + δai, where

∂p δai = λ • . (3) ∂ai

So how does this work f or our objective f unction. In the beginning, we do not have a good strategy of playing Go or a similar game. Thus, we start with a strategy

ainit for which p(ainit) ≈ 0. ∂p For the Gaussian function p(a), when p(a) ≈ 0, the partial derivatives are ∂ai also close to 0, and for the resulting values a + δa, we still get p(a + δa) ≈ 0. So, indeed, in the beginning, there is no visible improvement of the objective function. This does not mean, however, that the gradient method does not work: it definitely brings us closer to a(0) and eventually, we will get close enough to a(0) to have p(a) > 0. But we will see an improvement only when we get very close to a(0)— within 2σ or so of the corresponding probability distribution.

Example. Let us illustrate the above behavior on the 1-D example, when a = a1, 2 (0) Q(a) = k • a – a , and � � 2 (0) p(a) = const • exp –k • a – a . (4) � � � � (For multi-D cases, the formulas are similar.) In this case,

IJITAS11(2)-02 Vladik.indd 100 2018/7/10 上午 09:15:35 Why Learning Has Aha-Moments 101

2 ∂p (0) (0) = const • exp –k • a – a • (–2k) • a – a , (5) ∂a � � � � � � hence

2 ∂p (0) (0) δa = λ • = λ • const • exp –k • a – a • (–2k) • a – a , (6) ∂a � � � � � � i.e.,

(0) δa = λ • p(a) • (–2k) • a – a . (7) � � Since p(a) ≈ 0, we get δa ≈ 0 and thus, still p(a + δa) ≈ 0 as well. This means that we do not see any visible improvement. On the other hand, for the new value a + δa, we have

(0) (0) (0) (0) (a + δa) – a = a – a + δa = a – a – 2k • λ • p(a) • a – a (0) (8) = a – a • (1 –� 2k • λ •� p(a)). � � � � � � This shows that we do get closer to a(0). Since we get closer to a(0), the value p(a) increases, so on the next iteration, we get an even better improvement, and eventually, we will reach a(0).

This explains the ubiquity of aha-moments. The above analysis explains the ubiquity of aha-moments: ●● For a long time, we do not see any visible improvement of the objective function, ●● Until we get close to a(0), at which point we will see an improvement. This is exactly what we observe as the aha-moment.

4. Why we should also reward effort, not just results.

What happens in human learning. The above analysis has an important consequence to human learning—for which similar aha-moments are also ubiquitous. What this analysis shows is that in the beginning, while the student is on his/her way to learning, there is no visible improvement in the student’s ability to solve the corresponding problems.

IJITAS11(2)-02 Vladik.indd 101 2018/7/10 上午 09:15:35 102 Uranga, Kreinovich and Kosheleva

What if we only award results. If we gauge the students’s success—as often happens—only by the results of learning, i.e., by the student’s ability to solve the corresponding problems, we will not see any improvement—in spite of the fact that the student actively works. In this case, the resulting bad grade would discourage the student from further attempts to learn—although in reality, the learning process is on the right track.

Conclusion: We also need to reward ef forts. A natural conclusion is that to avoid such discouragement, we need to reward ef f orts, not only results—otherwise students will be discouraged from learning and mastering complex topics.

Acknowledgments

This work was supported in part by the US National Science Foundation grant HRD-1242122.

References

Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY: Springer.

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., & Abbeel, P. (2016). RL2: Fast reinforcement learning via slow reinforcement learning. Retrieved from https://arxiv.org/abs/1611.02779

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Dee p leaning. Cambridge, MA: MIT Press.

Juliani, A. (2017). Learning policies for learning policies―Meta reinforcement learning ( RL2) in tensorflow. Retrieved from https://hackernoon.com/learning- policies-for-learning-policies-meta-reinforcement-learning-rl%C2%B2-in- tensorflow-b15b592a2ddf

Parra, J., Fuentes, O., Anthony, E., & Kreinovich, V. (2017a). Prediction of volcanic eruptions: Case study of rare events in chaotic systems with delay. In Proceedings of the 2017 IEEE Conference on Systems, Man, and Cybernetics (pp. 351–356). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

Parra, J., Fuentes, O., Anthony, E., & Kreinovich, V. (2017b). Use of machine learning to analyze and—hopefully—predict volcano activity. Acta Politechnica Hungarica, 14, 209 – 221. d o i:10.12700 / A PH.14.3.2017.3.12.

Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures (5th ed.). Boca Raton, FL: Chapman and Hall/CRC.

IJITAS11(2)-02 Vladik.indd 102 2018/7/10 上午 09:15:35 Why Learning Has Aha-Moments 103

Sutton, R. S., & Barto, A. G. (1998). Reinf orcement learning: An introduction. Cambridge, MA: MIT Press.

Thompson, E. (2010). Mind in life: Biology, phenomenology, and the sciences of mind. Cambridge, MA: Belknap Press.

Wang, J. X., Kirth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., ... Botvinick, M. (2017). Learning to reinf orcement learn. Retrieved from https:// arxiv.org/abs/1611.05763

IJITAS11(2)-02 Vladik.indd 103 2018/7/10 上午 09:15:35 IJITAS11(2)-02 Vladik.indd 104 2018/7/10 上午 09:15:35 International Journal of Intelligent Technologies and Appld ie Statt is ics

Vol.11, No.2 (2018) pp.105-111, DOI:10.6148/IJITAS.201806_11(2).0003 © Airiti Press

Why 70/30 or 80/20 Relation Between Training and Testing Sets: A Pedagogical Explanation

Afshin Gholamy1, Vladik Kreinovich2* and Olga Kosheleva3

1 Department of Geological Sciences, University of Texas at El Paso, Texas, USA 2 Department of Computer Science, University of Texas at El Paso, Texas, USA 3 Department of Teacher Education, University of Texas at El Paso, Texas, USA

ABSTRACT When learning a dependence from data, to avoid overfitting, it is important to divide the data into the training set and the testing set. We first train our model on the training set, and then we use the data from the testing set to gauge the accuracy of the resulting model. Empirical studies show that the best results are obtained if we use 20–30% of the data for testing, and the remaining 70–80% of the data for training. In this paper, we provide a possible explanation for this empirical result.

K eywords: Training set; Testing set; Overfitting; Learning dependence from data

1. Formulation of the problem

Training a model: A general problem. In many practical situations, we have a model for a physical phenomenon, a model that includes several unknown parameters. These parameters need to be determined from the known observations; this determination is known as training the model.

Need to divide data into training set and testing set. In statistics in general, the more data points we use, the more accurate are the resulting estimates. From this viewpoint, it may seem that the best way to determine the parameters of the model is to use all the available data points in this determination. This is indeed a good idea if we are absolutely certain that our model adequately describes the corresponding phenomenon. In practice, however, we are often not absolutely sure that the current model is indeed adequate. In such situations, if we simply use all the available data to

* Corresponding author: [email protected]

IJITAS11(2)-03 Vladik.indd 105 2018/7/10 上午 09:17:56 106 Gholamy, Kreinovich and Kosheleva

determine the parameters of the model, we often get overf itting—when the model describes all the data perfectly well without being actually adequate. For example, if we observe some quantity x at n different moments of time, then it is always 2 n – 1 possible to find a polynomial f(t) = a0 + a1 • t + a2 • t + ... + an – 1 • t that will fit all the data points perfectly well—to find such a polynomial, it is sufficient to solve

the corresponding system of n linear equations with n unknowns a0, ..., an – 1:

2 n – 1 a0 + a1 • t1 + a2 • ti + ... + an – 1 • ti , i = 1, ..., n. (1)

This does not mean that the resulting model is adequate, i.e., that the resulting polynomial can be used to predict the values x(t) for all t: one can easily show that if we start with noisy data, the resulting polynomial will be very different from the actual values of x(t). For example, if n = 1 and the actual value of x(t) is a constant,

then, due to noise, the resulting polynomial x(t) = a0 + a1 • t will be a linear function

with a1 ≠ 0. Thus, for large t, we will have x(t) → ∞, so the predicted values will be very different from the actual (constant) value of the signal. To avoid overfitting, it is recommended that we divide the observations into training and testing data: ●● First, we use the training data to determine the parameters of the model. ●● After that, we compare the model’s predictions for all the testing data points with what we actually observed, and use this comparison to gauge the accuracy of our model.

Which proportion of data should we allocate for testing? Empirical analysis has shown that the best results are attained if we allocate 20–30% of the original data points for testing, and use the remaining 70–80% for training. For this division, we get accuracy estimates which are (1) valid—in the sense that they do not overestimate the accuracy (i.e., do not underestimate the approximation error), and (2) are the most accurate among the valid estimates—i.e., their overestimation of the approximation error is the smallest possible.

What we do in this paper. In this paper, we provide a possible explanation for this empirical fact.

2. Formal description and analysis of the problem

Training and testing: Towards a formal description. Our goal is to find the

dependence of the resider quantity y on the corresponding inputs x1, ..., xn. To be more specific, we assume that the dependence has the form

IJITAS11(2)-03 Vladik.indd 106 2018/7/10 上午 09:17:56 Why 70/30 Relation Between Training and Testing 107

y = f(a1, ..., am, x1, ..., xn), (2)

for some parameters a1, ..., am. For example, we can assume that the dependence is linear, in which case m = n + 1 and

y = a1 • x1 + ... + an • xn + an + 1. (3)

We can assume that the dependence is quadratic, or sinusoidal, etc. To find this dependence, we use the available data, i.e., use N situations k = 1, (k) (k) ..., N in each of which we know both the values of the inputs x1 , ..., xn and the corresponding output y(k). Let p denote the fraction of the data that goes into the training set. This means (k) (k) (k) that out of the original N patterns x1 , ..., xn , y : ●● N • p patterns form a training� set, and � ●● The remaining (1 – p) • N patterns form a testing set.

We use the training set to find estimates aˆ 1, ..., aˆ m of the parameters a1, ..., am. (k) (k) (k) Then, for each pattern x1 , ..., xn , y from the testing set, we compare the desired (k) output y with the result� �

(k) (k) (k) yˆ = f aˆ 1, ..., aˆ m, x1 , ..., xn (4) � � of applying the trained model to the inputs. Based on the differences

def (k) (k) dk y – yˆ . (5)

we gauge the accuracy of the trained model.

How do we gauge the accuracy of the model. Many different factors influence the fact that the resulting model is not perfect, such as measurement errors, approximate character of the model itself, etc. It is known that under reasonable assumptions, the distribution of a joint effect of many independent factors is close to Gaussian (normal)—the corresponding mathematical result is known as the “central limit theorem”; see, e.g., Sheskin (2011).

Thus, we can safely assume that the differences dk are normally distributed. It is known that a 1-D normal distribution is uniquely determined by two parameters: mean value μ and standard deviation σ. Thus, based on the differences

dk, we can estimate: ●● The mean value (bias) of the trained model, and ●● The standard deviation σ describing the accuracy of the trained model.

IJITAS11(2)-03 Vladik.indd 107 2018/7/10 上午 09:17:56 108 Gholamy, Kreinovich and Kosheleva

A general fact from statistics: Reminder. In statistics, it is known that when we use M values to estimate a parameter, the standard deviation of the estimate decreases by a factor of M .

Example. The factor-of- M decrease is the easiest to explain on the simplest example when have a single quantity q, and we perform several measurements of this quantity by using a measuring instrument for which the standard deviation of

the measurement error is σ0. As a result, we get M measurement results q1, ..., qM. As an estimate for q, it is reasonable to take the arithmetic mean

q + ... + q qˆ = 1 m . (6) M

Then, the resulting estimation error qˆ – q, i.e., the difference between this estimate and the actual (unknown) value q of the quantity of interest has the form

q + ... + q (q – q) + ... + (q – q) qˆ – q = 1 M – q = 1 M . (7) M M

By definition, for each difference qi – q, the standard deviation is equal to σ0. and 2 thus, the variance is equal to σ0. Measurement errors corresponding to different measurements are usually independent. It is known that the variance of the sum of independent random variables is equal to the sum of the variances. Thus, the variance of the sum 2 (q1 – q) + ... + (qM – q) is equal to M • σ0 , and the corresponding standard deviation 2 2 is equal to M • σ0 = M • σ0 . When we divide the sum by M, the standard deviation also divides by the same factor. So, the standard deviation of the

M • σ0 σ difference qˆ – q is equal to = 0 . M M

Let us use the general f act f rom statistics. We estimate the parameters of the model based on the training set, with p • N elements. Thus, the standard deviation 1 of the corresponding model is proportional to . p • N When we gauge the accuracy of the model, we compare the trained model with the data from the testing set. Even if the trained model was exact, because of the measurement errors, we would not get the exact match. Instead, based on (1 – p) • N 1 measurements, we would get the standard deviation proportional to . (1 – p) • N

IJITAS11(2)-03 Vladik.indd 108 2018/7/10 上午 09:17:56 Why 70/30 Relation Between Training and Testing 109

We want to estimate the difference dk between the trained model and the testing data. It is reasonable to assume that, in general, the errors corresponding to the training set and to the testing set are independent—we may get positive correlation in some cases, negative correlation in others, so, on average, the correlation is 0. For independence random variables, the variance is equal to the sum of the variances. Thus, on average, this variance is proportional to

2 2 1 1 1 1 1 + = + = . (8) p • N (1 – p) • N p • N (1 – p) • N (p • (1 – p)) • N � � � � Thus, to get the smallest possible estimate for the approximation error, then, out of all possible values p, we need to select the value p for which the product p • (1 – p) is the largest possible.

Which values p are possible? The only remaining question is now: which values p are possible? Our requirement was that we should select p for which the gauged accuracy is guaranteed not to overestimate the accuracy. In precise terms, this means that the standard deviation of the trained model—i.e., the standard deviation of the estimate (k) yˆ —should be smaller than or equal to the standard deviation of the difference dk by which we gauge the model’s accuracy:

(k) σ yˆ ≤ σ[dk]. (9) � �

(k) (k) Here, dk = yˆ – y is the difference between: ●● The estimate yˆ(k) whose inaccuracy is cased by the measurement errors of the training set and ●● The value y(k) whose inaccuracy is cased by the measurement errors of the testing set. So, we must have

σ yˆ(k) ≤ σ[yˆ(k) – y(k)]. (10) � �

In general, for two random variables r1 and r2 with standard deviations σ[r1]

and σ[r2], the smallest possible value of the standard deviation of the difference is

|σ[r1] – σ[r2]| (see, e.g., Sheskin, 2011):

σ[r1 – r2] ≥ |σ[r1] – σ[r2]|. ( 1 1 )

IJITAS11(2)-03 Vladik.indd 109 2018/7/10 上午 09:17:56 110 Gholamy, Kreinovich and Kosheleva

(k) (k) In particular, for the difference dk = yˆ – y , the smallest possible value of its standard deviation σ yˆ(k) – y(k) is � � |σ[yˆ(k)] – σ[y(k)]|. (12)

Thus, to make sure that we do not underestimate the measurement error, we must guarantee that

σ yˆ(k) ≤ σ yˆ(k) – σ y(k) , (13) � � � � � � �� def def i.e., that a ≤ |a – b|, where we denoted a σ yˆ(k) and b σ y(k) . In principle, we can have two different� cases:� a ≤ b and� b� ≤ a. Let us consider these two cases one by one. ●● If a ≥ b, then the desired inequality takes the form a ≤ a – b, which for b > 0 is impossible. ●● Thus, we must have b ≤ a. In this case, the above inequality takes the form a ≤ b – a, i.e., equivalently, 2a ≤ b. Thus, we must have

2σ yˆ(k) ≤ σ yˆ(k) . (14) � � � � Since the inaccuracy of the estimate yˆ(k) comes only from measurement errors of the training set, with p • N elements, we have

σ σ yˆ(k) = 0 . (15) p • N � �

(k) for some σ0. Similarly, since the inaccuracy of the estimate y comes only from measurement errors of the testing set, with (1 – p) • N elements, we have

σ σ y(k) = 0 . (16) (1 – p) • N � �

Thus, the above inequality takes the form

IJITAS11(2)-03 Vladik.indd 110 2018/7/10 上午 09:17:56 Why 70/30 Relation Between Training and Testing 111

σ σ0 2 • 0 ≤ . (17) p • N (1 – p) • N

Dividing both sides of this inequality by σ0 and multiplying by N , we conclude that

2 1 ≤ . (18) p 1 – p

Squaring both sides, we get

4 1 ≤ . (19) p 1 – p

By bringing both sides to the common denomination, we get 4 – 4p ≤ p, i.e., 4 ≤ 4p + p = 5p and p ≥ 0.8. Thus, to make sure that our estimates do not overestimate accuracy, we need to select the values p ≥ 0.8.

Towards the final conclusion. As we have mentioned earlier, out of all possible values p, we need to select a pone for which the product p • (1 – p) is the largest possible. For p ≥ 0.8, the function p • (1 – p) is decreasing. Thus, its largest values is attained when the value p is the smallest possible—i.e., when p = 0.8. So, we have indeed explained why p ≈ 80% is empirically the best division into the training and the testing sets.

Acknowledgments

This work was supported in part by the US National Science Foundation grant HRD-1242122.

References

Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures (5th ed.). Boca Raton, FL: Chapman and Hall/CRC. doi:10.1201/9781420036268

IJITAS11(2)-03 Vladik.indd 111 2018/7/10 上午 09:17:56 IJITAS11(2)-03 Vladik.indd 112 2018/7/10 上午 09:17:57 International Journal of Intelligent Technologies and Appld ie Statt is ics

Vol.11, No.2 (2018) pp.113-120, DOI:10.6148/IJITAS.201806_11(2).0004 © Airiti Press

Why Skew Normal: A Simple Pedagogical Explanation

José Guadalupe Flores Muñiz1, Vyacheslav V. Kalashnikov2,3, Nataliya Kalashnykova1,4, Olga Kosheleva5 and Vladik Kreinovich6*

1 Department of Physics and Mathematics, Universidad Autónoma de Nuevo León, San Nicolás de los Garza, Mexico 2 Department of Systems and Industrial Engineering, Tecnológico de Monterrey, Monterrey, Mexico 3 De partment of Ex perimental Economics, Central Economics and M athematics Institute, M oscow, Russian F ederation 4 De partment of Computer Science, Sumy State University, Sumy, Ukraine 5 Department of Teacher Education, University of Texas at El Paso, Texas, USA 6 Department of Computer Science, University of Texas at El Paso, Texas, USA

ABSTRACT In many practical situations, we only know a few first moments of a random variable, and out of all probability distributions which are consistent with this information, we need to select one. When we know the first two moments, we can use the Maximum Entropy approach and get normal distribution. However, when we know the first three moments, the Maximum Entropy approach does not work. In such situations, a very efficient selection is a so-called skew normal distribution. However, it is not clear why this particular distribution should be selected. In this paper, we provide an explanation for this selection.

K eywords: Skew normal distribution; Maximum entropy; Normal distribution; Skewness; Third moment

1. Formulation of the problem

General problem: Need to select a probability distribution under uncertainty. Most traditional statistical techniques assume that we know the corresponding probability distribution or at least that we know a finite-parametric family of distributions that contains the given distribution; see, e.g., Sheskin (2011). However, often, the only information that we have about the probability distribution def k of a quantity X is its few first moments M k E[X ]. In such a situation, there are many possible distributions consistent with this information. To apply the traditional statistical

* Corresponding author: [email protected]

IJITAS11(2)-04 Vladik.indd 113 2018/7/10 上午 09:27:12 114 Flores Muñiz, Kalashnikov, Kalashnykova, Kosheleva and Kreinovich

techniques to such situations, it is therefore necessary to select, out of all possible distributions, one single distribution (or a finite-parametric family of distributions). Ideally, we should select the distribution which is either, in some sense, the most realistic for a given situation, and/or leads to the simplest data processing techniques.

The simplest case when we know the first two moments. In uncertain situations when we the only information that we have are the first two moments, then, out of all possible distributions with these two moments, it is reasonable to select the distribution that maximally preserves uncertainty—i.e., the one for which the entropy S = – ρ(x) • ln(ρ(x))dx is the largest possible, where ρ(x) is the probability density function; see, e.g., Jaynes and Bretthorst (2003). By applying the Lagrange multiplier method to the corresponding constraint optimization problem of maximizing S under the constraints ρ(x)dx = 1, 2 x • ρ(x)dx = M 1, and x • ρ(x)dx = M 2, we can reduce this problem to the unconstrained optimization problem of maximizing the expression

– ρ(x) • ln(ρ(x))dx + λ0 • ρ(x)dx – 1 + λ1 • x • ρ(x)dx – M 1 + (1) 2 λ2 • x • ρ(x)dx – M 2 . � � � �

� � Differentiating this expression with respect to each unknown ρ(x) and equating the resulting derivative to 0, we conclude that

2 –ln(ρ(x)) – 1 + λ0 + λ1 • x + λ2 • x = 0, (2)

hence

2 ln(ρ(x)) = (λ0 – 1) + λ1 • x + λ2 • x , (3)

and thus, ρ(x) = exp(–Q(x)) for some quadratic expression Q(x). This is the well-known Gaussian (normal) distribution.

What if we also know the third moment? What if, in addition to the first two

moments M 1 and M 2, we also know the third moment

3 M 3 = x • ρ(x)dx? (4)

IJITAS11(2)-04 Vladik.indd 114 2018/7/10 上午 09:27:12 Why Skew Normal: A Simple Pedagogical Explanation 115

At first glance, it may seem that in this case, we can also select, out of all possible distributions with these three moments, the distribution with the largest possible value of the entropy. In this case, the corresponding constraint optimization problem 2 of maximizing S under the constraints ρ(x)dx = 1, x • ρ(x)dx = M 1, x • ρ(x)dx = M 2, 3 and x • ρ(x)dx = M 3 can be reduced to the unconstrained optimization problem of maximizing the expression

– ρ(x) • ln(ρ(x))dx + λ0 • ρ(x)dx – 1 + λ1 • x • ρ(x)dx – M 1 + . (5) 2 3 λ2 • x • ρ(x)dx + λ3 • �x • ρ(x)dx –� M 3 . � �

� � � � Differentiating this expression with respect to each unknown ρ(x) and equating the resulting derivative to 0, we conclude that

2 3 –ln(ρ(x)) – 1 + λ0 + λ1 • x + λ2 • x + λ • x = 0, (6)

hence

2 3 ln(ρ(x)) = (λ0 – 1) + λ1 • x + λ2 • x + λ3 • x , (7)

and thus,

ρ(x) = exp(C(x)), (8)

where we denoted

def 2 3 C(x) (λ0 – 1) + λ1 • x + λ2 • x + λ3 • x . (9)

However, the problem with this formula is that

●● when λ3 > 0, we get C(x) → ∞ when x → ∞, thus

ρ(x) = exp(C(x)) → +∞, (10)

IJITAS11(2)-04 Vladik.indd 115 2018/7/10 上午 09:27:12 116 Flores Muñiz, Kalashnikov, Kalashnykova, Kosheleva and Kreinovich

and therefore, we cannot have ρ(x)dx = 1;

●● similarly, when λ3 < 0, we get C(x) → ∞ when x → ∞, thus

ρ(x) = exp(C(x)) → +∞, (11)

and therefore, we also cannot have ρ(x)dx = 1; see, e.g., Dumrongpokaphan and Kreinovich (2017).

So, the only possible case when we have ρ(x)dx = 1 is when λ3 = 0. However, in this case, we simply get a normal distribution, and normal distributions are uniquely determined by the first two moments and thus, do not cover all possible combinations of three moments.

So what do we do? In the case of three moments, there is a widely used selection, called a skew normal distribution (see, e.g., Azzalini & Capitanio, 2013; Li, Shi, & Wang, 2009), when we choose a distribution with the probability density function

1 x – η x – η ρ(x) = • φ • Φ α • , (12) 2ω ω ω � � � � where 2 def 1 x ●● φ(x) • exp – is the probability distribution function (PDF) of the 2π 2 standard Gaussian� distribution,� with mean 0 and standard deviation 1, and ●● Φ(x) is the corresponding cumulative distribution function

x Φ(x) = φ(t)dt. (13) -∞

Comment. For this distribution, ∫ 2 ●● The first moment M is equal to M = μ = η + ω δ , where 1 1 • • π

def α δ (14) 1 + α2

●● The second central moment σ2 = E[(X – μ)2] is equal to

IJITAS11(2)-04 Vladik.indd 116 2018/7/10 上午 09:27:12 Why Skew Normal: A Simple Pedagogical Explanation 117

2δ2 σ2 = ω2 – , (15) • π �1 � 3 ●● The third central moment m3 = E[(X – μ) ] is equal to

3 δ π 4 – π 3 • 2/ m3 = • σ • 3/2 . (16) 2 2 1� – 2δ / π� � � Why? The skew normal distribution has many applications, but it is not clear why it is selected.

What we do in this paper. In this paper, we provide a pedagogical explanation for the skew normal distribution.

2. Why skew normal: Analysis of the problem and the resulting selection

Meaning of probability density f unction: Reminder. In the above formula, the skew normal distribution is described in terms of the probability density function. To see how we can explain the above formula, let us recall the meaning of the probability density function. By definition, the probability density is equal to the limit

Prob(X ∈ [x, x]) ρ(x) = lim . (17) x → x, x → x x – x

Limit means that when the width x – x of the corresponding interval is small, we have

Prob(X ∈ [x, x]) ρ(x) ≈ , (18) x – x

and the smaller the width, the more accurate this formula. In particular, for small ε > 0, we have

Prob(X ∈ [x – ε, x + ε]) ρ(x) ≈ , (19) 2ε

IJITAS11(2)-04 Vladik.indd 117 2018/7/10 上午 09:27:12 118 Flores Muñiz, Kalashnikov, Kalashnykova, Kosheleva and Kreinovich

i.e.,

Prob(X ∈ [x – ε, x + ε]) ≈ ρ(x) • 2ε. (20)

Thus, if we interpret X ∈ [x – ε, x + ε], or, equivalently, |X – x| ≤ ε as “X and x are

ε-equal”—and denote it by X =ε x —then

Prob(X =ε x) ≈ ρ(x) • 2ε (21)

What does such ε-equality mean? In practice, all the values are only measured ~ with some accuracy δ. Thus, even if two values x1 and x2 are absolutely equal, all ~ ~ ~ ~ we get is their δ-approximate value x1 and x2, for which |x1 – x1| ≤ δ and |x2 – x2| ≤ δ ~ ~ ~ ~ ~ ~ imply that |x1 – x2| ≤ |x1 – x1| + |x2 – x2| ≤ 2 δ. Conversely, if |x1 – x2| ≤ 2 δ, then it ~ ~ is possible that the values x1 and x2 come from measuring the same value x1 = x2: ~ ~ x1 + x2 ~ ~ namely, if we take x1 = x2 = , we get |x1 – x1| ≤ δ and |x2 – x2| ≤ δ. 2 From this viewpoint, the ε-equality is the practically checkable version of equality. Thus, modulo a multiplicative factor, the probability density ρ(x) is the probability that the random value X is practically equal to x.

Let us start with the normal distribution. Let us start with the case when we know the first two moments and thus, get a normal distribution, with probability

density ρ0(x). For simplicity, we can consider the case when the mean of the normal distribution is 0.

We want asymmetry. Normal distribution with 0 mean is symmetric with respect to change of sign x → –x. As a result, for the normal distribution, the third moment

M 3 is 0. To cover possible non-zero values of M 3, we thus need to “add” asymmetry to the normal distribution.

What we mean by “adding.” A natural interpretation of adding is that, instead of considering a simple condition X = x, we consider a modified condition “X = x and ....” How can we describe the probability of such a combined statement? We have no reason to believe that the newly added condition is positively or negatively correlated with the event X = x. Thus, it is reasonable to consider these events to be independent—this is, by the way, what the maximum entropy principle implies in such a situation; see, e.g., Jaynes and Bretthorst (2003).

Examples of such “adding.” One possibility is to add, to the original condition X = x, a somewhat modified condition X = α • x, for some constant α.

IJITAS11(2)-04 Vladik.indd 118 2018/7/10 上午 09:27:12 Why Skew Normal: A Simple Pedagogical Explanation 119

Interestingly, this addition does not change much. Indeed, the original probability x2 density function—corresponding to X = x —has the form const • exp – . Thus, σ2 2 (α • x�) � the additional condition X = α • x has the form const • exp – , and the σ2 �(1 + α2) • x2� product of these two probabilities has the form const • exp – , i.e., the σ2

x2 def σ � � form const • exp – , where σ' . So, we still get a normal distribution. (σ')2 1 + α2 � � What is a natural asymmetric version of equality. As we have mentioned, the probability density function of the normal distribution describes the probability of equality X = x (or, as we have just learned, X = α • x). A natural way to get asymmetry is to consider a natural asymmetric version of equality: inequality X ≤ α • x. The probability of this inequality is equal to the corresponding cumulative distribution function. So, if we interpret “adding” this additional condition as multiplying, we get the product of: ●● The original probability (i.e., the probability density function) and ●● The new additional probability—which is described by the cumulative distribution function. So, we get exactly the above formula for the skew normal distribution. Thus, we have indeed explained this formula.

Acknowledgments

This work was supported by grant CB-2013-01-221676 from Mexico Consejo Nacional de Ciencia y Tecnologia (CONACYT). It was also partly supported by the US National Science Foundation grant HRD-1242122 (Cyber-ShARE Center of Excellence). This work was partly performed when José Guadalupe Flores Muñiz visited the University of Texas at El Paso.

References

Azzalini, A., & Capitanio, A. (2013). T he skew-normal and related f amilies. Cambridge, MA: Cambridge University Press. doi:10.1017/CBO9781139248891

Dumrongpokaphan, T., & Kreinovich, V. (2017). Why cannot we have a strongly consistent family of skew normal (and higher order) distributions. In: V. Kreinovich, S. Sriboonchitta, & V.-N. Huynh. (Eds.), Robustness in Econometrics (pp. 69–78). Cham, Switzerland: Springer. doi:10.1007/978-3-319-50742-2_4

IJITAS11(2)-04 Vladik.indd 119 2018/7/10 上午 09:27:12 120 Flores Muñiz, Kalashnikov, Kalashnykova, Kosheleva and Kreinovich

Jaynes, E. T., & Bretthorst, G. L. (2003). Probability theory: The logic of science. Cambridge, UK: Cambridge University Press.

Li, B.-K., Shi, D.-M., & Wang, T.-H. (2009). Some applications of one-sided skew distributions. International Journal Intelligent Technologies and Applied Statistics, 2, 13–27. doi:10.6148/IJITAS.2009.0201.02

Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures (5th ed.). Boca Raton, FL: Chapman and Hall/CRC. doi:10.1201/9781420036268

IJITAS11(2)-04 Vladik.indd 120 2018/7/10 上午 09:27:12 International Journal of Intelligent Technologies and Appld ie Statt is ics

Vol.11, No.2 (2018) pp.121-141, DOI:10.6148/IJITAS.201806_11(2).0005 © Airiti Press

Willingness-to-pay for FTTH and the Quality of Experience from OTT Media Streaming Services: The Case Study of Thailand

Tatcha Sudtasan*

Graduate School of Asia-Pacific Studies, Waseda University, Tokyo, Japan

ABSTRACT The objective of this study is to investigate the impact of over-the-top (OTT) media streaming services on consumers’ demand for fiber-to-the-home (FTTH) service. A double-hurdle model is applied to decompose consumers’ willingness-to-adopt (WTA) and willingness-to-pay (WTP) for FTTH by OTT media streaming services, individual characteristics, experience of connection problem, and expectation of media consumption through the Internet. A respondent is allowed to express his or her WTP in an interval in order to truly reflect the willingness and flexibility to pay. The results show that consumers are willing to adopt and pay for FTTH for their quality of experience from OTT media streaming consumption, particularly movie services. This study contributes to the projection of market value and the effect of OTT media streaming services along the diffusion of the FTTH market. Policy implications are discussed to set regulations and incentives that increase broadband penetration at the right price.

K eywords: OTT media streaming services; Fiber-to-the-home (FTTH); Willingness-to-adopt (WTA); Willingness-to-pay (WTP); Double-hurdle model

1. Introduction

The delivery of services over the Internet, which ride on Internet Service Providers’ (ISPs) networks, is defined as over-the-top (OTT) (Boby of European Regulators for Electorinic Communications [BEREC], 2016; Knoben, 2014; Mnakri, 2015). The continuous growth in consumption of OTT services according to their new values created to societies is expected to boost the network bandwidth even more over the next few years (Ericsson, 2015a, 2015b). For ISPs, there is a strong impact on their traditional telecommunications revenues and traffic over the Internet (Baldry, Steingröver, & Hessler, 2014; Seixas, 2015). The primary and largest impact of OTT

* Corresponding author: [email protected]

IJITAS11(2)-05 Sudtasan.indd 121 2018/7/17 下午 05:58:28 122 Sudtasan

services is on the Internet traffic created by streaming media such as YouTube, Netflix, and TV program on the Internet. This is because viewers are shifting towards the services that are easy to use and respond well to their needs. These shifts motivate the development of cross-platform access to video content. Streaming viewing allows users to watch the contents with data processing procedure through the Internet on a real-time basis, content download is not needed (Time Consulting, 2017). OTT media streaming services have been rising and raising demands for higher Internet capacity (Baldry et al., 2014; Detecon International GmbH, 2014; Ericsson, 2015b; Greenough, 2016; International Telecommuniccations Union [ITU], 2016; Parks Associates, 2015). Many literatures mention the negative impacts of OTT services on network traffic and revenues of ISPs. While OTT consumes a lot of bandwidth, however, it is no doubt that OTT services are stimulating penetration of the Internet (Knoben, 2014). Therefore, this study attempts to the link these issues to shift in consumer’s preferences in order to identify the underlying factors that drive demand for higher Internet capacity towards the services. The optical fiber broadband is recognized to provide the highest capacity and speed among all broadband services. Optical fiber-based access network providing connection to individual end-users refers as fiber-to-the-home (FTTH). With its standard of connectivity, FTTH supports a wide range of new services and applications, both for entertainment and productivity, delivered right to the home or the office. The services include video communication, video-on-demand, online gaming, teleworking, e-Health services and much more (Fiber to the Home Council Europe [FTTH Council Europe], 2016b). FTTH is expected to be a potential solution to access an increasing amount of data and deal with forecasted enormous Internet traffic, which would further enhance a long-run competitiveness for ISPs (Crosby, 2008; FTTH Council Europe, 2016a). This study takes Thailand as its case because of the rapid growth of the total bandwidth usage in the country driven by a lot of OTT services currently available to consumers. In 2005, Thailand is in the top 10 markets worldwide for the total hours spent on YouTube, 70% growth in hours of content uploaded, and around 50% of the country’s Internet users access to YouTube everyday (Yozzo, 2015). Moreover, through many telecommunications policies and frameworks initiated by the Thai government, optical fiber network will be heavily invested. Since broadband market in Thailand has been concentrating very much on mobile broadband rather than fixed broadband, it is critical to carefully examine the demand for optical fiber as a mean of the Internet access for individual users. Assuming mobile broadband as the foundation of connection, the objective of this paper is to investigate the impact of OTT media streaming services on consumer’s willingness-to-adopt (WTA) and willingness-to-pay (WTP) for FTTH connection. A double-hurdle model is applied to the data retrieved from FTTH non-adopters in Thailand through interview-based questionnaires. The originality of this study is at the framework decomposing consumers’ WTA and WTP for FTTH by OTT

IJITAS11(2)-05 Sudtasan.indd 122 2018/7/17 下午 05:58:28 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 123

media streaming services, individual characteristics, experience of connection problem, and expectation of media for future consumption through the Internet. Besides, respondents are allowed to express their WTP for FTTH in a range in order to truly reflect their willingness and flexibility to pay. In addition, this study also asks respondents whether the mobile broadband prevents them from adopting optical fiber broadband in order to deepen the understanding on the respondent’s preferences. This study will provide a better understanding of the demand for FTTH towards non-adopters’ intention, and the WTP will contribute as a guideline for policy makers in regulating and bounding the right price to stimulate the FTTH utilization.

1.1 Broadband market in Thailand

Thailand is ranked 74th among 167 countries worldwide of Information and Communication Technology (ICT) Development Index. The country is viewed as one of the countries with fast and steady attaining ICT development (Malisuwan, Tiamnara, & Kaewphanuekrungsi, 2016). Broadband market in Thailand has relied on mobile rather than fixed access (National Broadcasting and Telecommunications Commission [NBTC], 2015). Mobile broadband is now shifting from 3G to 4G and has been concentrated in urban areas. The penetration per population has reached 95% in the 3rd quarter of 2016, as shown in Figure 1. Penetration per population(%)

Figure 1. Broadband market in Thailand. Source: ITU (2016) and NBTC (2015).

IJITAS11(2)-05 Sudtasan.indd 123 2018/7/17 下午 05:58:29 124 Sudtasan

Almost 99% of fixed-line Internet access in Thailand are broadband. Fixed broadband penetration per household has been continuously increasing since 2004. In 2016, the majority, 69.5% technological share, of broadband internet access was digital subscriber line (DSL) (Figure 2). Some central business districts accessed fixed broadband via fiber optic cables, fiber-to-the-premise (FTTP), which was 18.5%. Cable was used by approximately 8.9% while other technologies were used by 3.1% of the entire services. However, the penetration per household was only 35.37% and penetration per population was only 11.15% in October 2017. Thai people’s dependency on Internet applications has led to a strong demand for broadband access. The total bandwidth usage in the country is growing rapidly and wireless network has been gradually congested (Malisuwan & Kaewphanuekrungsi, 2016). Fixed-line broadband infrastructure becomes an urgent need for offloading mobile data traffic. There are also initiatives by the Thai government to offer mainstream FTTH providing the speed up to 100 Mbit/s in Thailand by 2020. Through many telecommunications policies and frameworks, optical fiber network will be heavily invested. In the near future, fixed broadband infrastructure and services will shift from ADSL to FTTH countrywide in order to provide a bandwidth that meets the increasing demand. Moreover, it is expected that the FTTH will boost the broadband access of the Thais for the sake of supporting people’s lives and further enhancing the country’s competitiveness. Percentage per total fixed broadband users (%)

Figure 2. Fixed-broadband market share by technology. Source: NBTC (2016).

IJITAS11(2)-05 Sudtasan.indd 124 2018/7/17 下午 05:58:29 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 125

1.2 OTT media streaming services in Thailand

There are many types of OTT services currently available to consumers in Thailand. There are free services such as Facebook, YouTube and Line, and paid services such as Vimeo and Hulu Plus. Specifically, OTT media streaming services categorized by revenue earning methods include “subscription video-on-demand” (SVoD) which offers unlimited access to content by making monthly or annually subscription and “transactional video-on-demand” (TVoD) which provides limited individual content or program to each paying customer, such as AIS TV, Netflix, HOOQ and Hollywood HDTV. Moreover, there are “advertising video-on-demand” (AVoD) which consumers can watch free-view with pop-up advertisement banners and/or video, such as Daily Motion, Youtube, and Line TV, and “freemium” which offers free advertising-driven contents, subscription-based ad-free content, and add- on payment for specific content, such as Youtube Red and Spotify. OTT media streaming services categorized by types of provider include individual providers, OTT from free TV providers, OTT from pay TV providers, and Telecom providers (DemystifyAsia, 2016; Pornwasin, 2017; Time Consulting, 2017). Netflix is one of the competitive OTT streaming service providers in global market. However, in 2016, the competition in Asia is more intensive due to more regional competitors (Cunningham, 2017; Nambiar, 2017). Thailand shows some similar trends of the competition. Time Consulting (2017) and Srirasa (2017) from the Director of Digital Broadcasting Bureau (DAAT) report that Facebook gains the highest revenue among AVoD OTT providers while Hollywookd HDTV is the biggest individual providers. There are many rivals of global providers such as Netflix, an aggressive investor (Cunningham, 2017; DemystifyAsia, 2016); HOOQ, a regional-based joint venture among Singtel, Sony Pictures Television, and Warner Bros, who focuses on regional content and allows subscribers to download up to 5 favorite content onto their registered devices for watching later which is similar to iFlix (Channel News Asia [CAN], 2016; Cunningham, 2017; DemystifyAsia, 2016). There are also Thai providers such as DOONEE, PrimeTime, and MONOMaxxx (Cunningham, 2017; Time Consulting, 2017). Viu, a Hong Kong startup, gains an optimistic potential to launch in Thailand. There are more than 800,000 likes on Viu’s Thailand Facebook page despite the company has not entered the market yet. With popularity of Korean content among Thai people, Viu promotes its freemium model engaging with South Korean celebrities to differentiate services from other providers (Cunningham, 2017; Nambiar, 2017). An overall market of OTT media streaming in Thailand is considered to be in a preliminary stage. The service has gained popularity since 2015. The main reason is the limitation of Internet infrastructure, especially home Internet, and low connection speed. Moreover, Thai users prefer free platform to SVoD (Time

IJITAS11(2)-05 Sudtasan.indd 125 2018/7/17 下午 05:58:29 126 Sudtasan

Consulting, 2017). However, the competition is expected to be more intensive in the future due to the growing trend of digital advertising. At the same time, there are also some possible issues about good viewing quality for customers, bandwidth management, laws, and regulations. To prepare for the future regulations, NBTC is determing policies and strategies to protect consumers and ensure fairness of both broadcasting players and Internet service providers (Pornwasin, 2017; Time Consulting, 2017).

2. Relevance literatures

There are several aspects among literatures on factors influencing high- speed Internet adoption and migration. Most of the studies focus on broadband in general and the impact of demographic variables on consumer’s demand for high- speed Internet access. Representative literatures such as Kwak, Skoric, Williams, and Poor (2004), Flamm and Chaudhuri (2007), Dwivedi and Lal (2007), Dwivedi, Lal, and Williams (2009), Sim, Tan, Ooi, and Lee (2011), and Srinuan and Bohlin (2013) indicate that age, income and education level have significant impact on the consumer’s intention to adopt and migrate to broadband services. However, Carare, McGovern, Noriega, and Schwarz (2015) discover that income has a lack of predictive power, which implies that non-adopters are faced with a variety of barriers to broadband adoption. Traditional approach relies on choices of experiments to analyze consumer’s preference on broadband network and estimate the demand for broadband services (Madden & Simpson, 1997; Sunada, Noguchi, Ohashi, & Okada, 2011). Rosston, Savage, and Waldman (2010) estimate random utility models of household preferences for broadband Internet service. The work indicates that household valuations for broadband service increase with online experience including experience with different Internet connection speeds, number of years online, experience with different Internet-related devices. Ida and Sakahira’s (2008) study includes motion- picture-viewing, as an independent variable to examine consumers’ intention to adopt broadband. Sunada et al. (2011) employ the Nested Logit and Mixed Logit models to specify the broadband access demand. The work conducts simulations to figure out the impact of FTTH coverage expansion on household switching probability from previous ADSL and ISDN access modes. The finding indicates that usage intensity plays a significant role in switching to FTTH when the service becomes available. With a macroperspective of network adoption and diffusion, Rixen and Weigand (2014) conclude that consumer will not migrate to the more advanced network technology without some push or intervention. With increasing accurate evaluations, a number of works involve respondents’ WTP in order to estimate the elasticity of the demand for broadband (Carare et al., 2015; Rappoport, Kridel, Taylor, Alleman, & Duffy-Deno, 2003) and support future broadband infrastructure investments (Jeffcoat, Davis, & Hu, 2012). WTP can reveal

IJITAS11(2)-05 Sudtasan.indd 126 2018/7/17 下午 05:58:29 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 127

the preferences of non-adopters’ demand on broadband service. Many studies have also identified that price is one of key success factors to drive broadband utilization and diffusion. However, Carare et al. (2015) point out that most of the literature does not identify the right price to stimulate the technology adoption where government can use to increase broadband penetration. A better understanding of the demand of non-adopters’ WTP will make it possible for policy makers to regulate and bound the optimal price. Related bodies of research which aim to estimate WTP highlight the importance of service characteristics in term of reliability and speed. For example, Savage and Waldman (2005) use conjoint analysis to identify important Internet services attributes for consumers in considering Internet access and WTP. Nakamura (2013) uses the contingent valuation method in the context of universal service of telecommunications based on functionality. The paper measures Japanese customer’s WTP to maintain combinations of five telecommunication services that are considered substitutes. Many studies have been conducted on factors influencing broadband preference. However, there are still few works on linking the effect from OTT media streaming services to consumer’s WTA and WTP for FTTH broadband access.

3. Methodology

3.1 Framework design

The framework of this study is mainly based on stated-preference of FTTH non-adopters measured by WTA and WTP. It extends the framework of previous studies that typically focus on the impact of demographic variables such as age, income, online experiences on consumer’s intention to adopt, and WTP for high- speed Internet service. This paper concerns OTT as a push of the demand for higher capacity of Internet access. Since OTT media streaming services increase consumer’s demand for bandwidth consumption and FTTH provides highest capacity among network technologies, this study presumes that with consumption of OTT services, consumer will need FTTH and be willing to pay more. The framework of this paper examines the effect of OTT media streaming and expectation on future media consumption through the Internet as well as individual characteristics and experiences of Internet connection problems on WTA and WTP for FTTH subscription. Concerning the lack of predictive ability of income on broadband adoption mentioned by Carare et al. (2015), this present paper tries to develop a variable affecting people’s purchasing power. Percentage of monthly expenditure per income is considered to be better than pure income to explain consumer’s demand in the case of fixed-broadband. This is because fixed-line Internet is an additional service and people have increased their Internet connection dependency on mobile rather than the fixed. Therefore, the model gets involved with monthly expenditure per income instead of monthly income.

IJITAS11(2)-05 Sudtasan.indd 127 2018/7/17 下午 05:58:29 128 Sudtasan

Willingness to accept a more advanced service is considered to be driven by current experience and expectation on future consumption. The framework includes consumers’ experiences of mobile and fixed-line Internet disconnection, delay, unstable quality of motion pictures, experience of reaching upper limit of mobile Internet package, and expectation on future movies, online TV, and video clip watching. For WTP, the framework concerns more on people’s cost-benefit consideration. Therefore, current OTT media streaming service consumption as well as expectation on future consumption are quantified of their effects on WTP. The models will help us to better understand the underlying factors that influence the demand for higher bandwidth, especially the impact of OTT media streaming services which further drive consumers’ WTP for FTTH.

3.2 The survey

An interview-based questionnaire was carried out with 500 respondents during July 2016. The survey was conducted at main regional airports, bus and train stations in order to cover representatives of Thai people from all age groups, genders, income levels, and regions—Northern, Northeastern, Central, Southern, Bangkok, and the vicinities. Data was retrieved from respondents who agreed on the exchange with payment after the complete of all the questions. The survey asked several key questions that allowed the investigation of the FTTH non-adopters’ willingness to adopt and pay for FTTH service. The questionnaire consisted of individual’s profile, telecommunications service experiences, OTT services consumption, intention to adopt, and willingness to pay for FTTH. Individual profile consisted of gender, age, education, occupation, monthly income, monthly expenditure, and living region. Telecommunications service experiences included current usage of Internet connection; and connection problems experience included connection drop, upload/download delay, unstable video quality during media watching, and sound or video distortion during video call. These experiences and problems were separated into those from mobile broadband and fixed broadband services. It also included the problem of the reduction of Internet speed due to an over-use of mobile broadband access reaching the upper limit of their mobile packages. The consumption of OTT media streaming services was indicated by daily time spent for movie watching and download, TV program, and video on demand. This part also identified current and expected within 3 years of consumption channel. For movie, respondents indicated their intention to watch movies through the Internet more than traditional channels in the next 1–3 years. For TV program on demand, respondents indicated their consumption channel between television and Internet on the scaling from 0 to 4. Zero indicated their preference in using only television while “4” indicated preference in using only the Internet. For video clip, respondents indicated their consumption channel between mobile and fixed connection on the

IJITAS11(2)-05 Sudtasan.indd 128 2018/7/17 下午 05:58:30 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 129

scaling from 0 to 4 where zero indicated that they totally preferred using only mobile while “4” indicated the preference in using only fixed network. Finally, consumer’s preference of FTTH was asked after the explaining about optical fiber broadband network capacities. Respondents were required to decide whether they were willing to subscribe FTTH or not. For those who had intention to subscribe, they were asked about their WTP. Fuzzy data were used to derive consumers’ WTP for optical fiber broadband connection influenced by OTT services. Instead of choosing one single answer or a certain range of the answer as in a traditional survey, respondents could express the degree of their feelings by using membership function to truly reflect their range of willingness to pay (Lin, Wang, Chen, & Wu, 2016). For those who were not using fixed network, the first answer was considered as a minimum of WTP, while those who were currently using fixed broadband, their current monthly price was used as the minimum. Adapting from the contingent valuation method, with an increment of THB 50 (approximately USD 1.5) respondents were asked until they stopped their willingness to pay. Then, the last price was recorded as a maximum of WTP.

3.3 Model

The empirical approach of this study is based on the data of consumer’s demand for optical fiber broadband connection. The approach involves analyzing data that comprise the respondents’ WTA and WTP for higher Internet capacity. Respondents are assumed to first decide whether or not to pay. Conditional on the first decision, their second decision is on the amount they will pay. In this situation, the double- hurdle model is the most appropriate because the model allows the different set of variables affecting both decisions (Martínez-Espiñeira, 2006). Double-hurdle has an advantage on its allowance for a more flexible framework in modelling the observed consumer’s behavior as a joint choice of two decisions instead of a single one (Huang, Kan, & Fu, 1999). Respondent’s decision can be explained in a tree diagram as shown in Figure 3.

Figure 3. A decision tree diagram. Source: Huang et al. (1999).

IJITAS11(2)-05 Sudtasan.indd 129 2018/7/17 下午 05:58:30 130 Sudtasan

The double-hurdle model basically deals with two separate individual choices of participation in the market and level of participation (Huang et al., 1999; Martínez- Espiñeira, 2006). The structure of the model is specified as

1 if αZ + ε > 0, D = i d (1) i 0 otherwise, � where a latent participation variable that the respondent decides whether or not

to pay is denoted by Di with Di = 1 if a respondent is willing to pay for FTTH, and

otherwise Di = 0. Z is a vector of experience of connection problems and expectation of future media streaming service consumptions and α is a vector of parameters.

Y * if Y * > 0 and D * > 0, Y * = βX + ε Y = i i i i i y (2) i 0 otherwise. � Conditional on Di = 1, the respondent decides on the level of their WTP for FTTH. X is a vector of individual characteristic, current media streaming service consumption in terms of time spending, and expectation of future consumptions of media streaming service within 1–3 years. β is a vector of parameters. The respondent of FTTH non-adopters consists of two groups. The first group is a group of respondents who do not want subscribe to FTTH at any price. The analysis assigns a WTP value of THB 0 per month to those who are unwilling to subscribe FTTH. Another group presents respondents who indicate that they are willing to adopt FTTH service at a range of prices they have considered. WTP of each respondent is calculated. This study adopts the calculation of scaling for a fuzzy number ( A) on a real number (R) proposed by (Lin et al., 2016) as for a fuzzy number A = [a, b];

A Fuzzy number on R = cx + (3) 2ln(e + cx ) ‖ ‖ ‖ ‖ where cx is a center of the interval; A is a range of the interval.

‖ ‖ The list of variables that are developed and included in the models is shown in Table 1.

IJITAS11(2)-05 Sudtasan.indd 130 2018/7/17 下午 05:58:30 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 131

Table 1. Description of variables.

Variable Description Mean Min Max SD WTP Willingness-to-pay (THB) 604.15 95.70 1,640.07 208.97 (dependent calculated by fuzzy number, [min, [505.49, 670.21] variable) ma x ], on R suggested by Lin et al. (2016), where min = consumer’s first answer of willingness to pay; max = maximum willingness to pay. Age Age of the respondent. 33.79 13.00 75.00 11.25 WTA Willingness-to-accept to subscribe 0.50 0 1.00 0.50 FTTH service. PercentExpdInc Percentage of monthly 72.08 7.14 163.33 24.01 expenditure over income. MovieHrs Daily hours spending on watching 1.51 0 5.00 1.70 movies through the Internet. TVhrs Daily hours spending on watching 1.88 0 5.00 1.69 TV programs through the Internet. ClipHrs Daily hours spending on watching 1.34 0 5.00 1.59 video clips through the Internet. MBproblem The number of mobile connection 1.54 0 4.00 0.94 problems the respondent’s has experienced including (1) Internet connection drop, (2) upload or download delay, (3) unstable video quality of motion-viewing picture, and (4) sound distortion during video call. FBproblem The number of fixed-line 1.00 0 4.00 0.95 connection problems the respondent’s has experienced including (1) Internet connection drop, (2) upload or download delay, (3) unstable video quality of motion-viewing picture, and (4) sound distortion during video call. MobileOverUse = 1, if the respondent has an 0.65 0 1.00 0.46 experience of reaching upper limit of mobile Internet packages resulting in the reduction of Internet speed; = 0, otherwise.

IJITAS11(2)-05 Sudtasan.indd 131 2018/7/17 下午 05:58:30 132 Sudtasan

Table 1. Description of variables. (Continued)

Variable Description Mean Min Max SD FutureMovie Consumer’s portion of watching 2.74 0 4.00 1.59 movies through traditional channels such as theatre, DVD and TV movie channel, over the Internet in the next 1–3 years, ranged from 0 to 4. 0 = 100% through traditional channels. 1 = a combination of 75% through traditional channels and 25% through the Internet. 2 = a combination of 50% through traditional channels and 50% through the Internet. 3 = a combination of 25% through traditional channels and 75% through the Internet. 4 = totally watch movies through the Internet. FutureTV Consumer’s portion of watching 2.60 0 4.00 1.42 TV programs through television over the Internet in the next 1–3 years, ranged from 0 to 4. 0 = 100% through television. 1 = a combination of 75% through television and 25% through the Internet. 2 = a combination of 50% through television and 50% through the Internet. 3 = a combination of 25% through television and 75% through the Internet. 4 = totally watch TV programs through the Internet.

IJITAS11(2)-05 Sudtasan.indd 132 2018/7/17 下午 05:58:30 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 133

Table 1. Description of variables. (Continued)

Variable Description Mean Min Max SD FutureClip Consumer’s portion of watching 2.18 0 4.00 1.65 video clips on demand through mobile Internet over fixed-line Internet in the next 1–3 years, ranged from 0 to 4. 0 = 100% through mobile Internet. 1 = a combination of 75% through mobile Internet and 25% through fixed-line Internet. 2 = a combination of 50% through mobile Internet and 50% through fixed-line Internet. 3 = a combination of 25% through mobile Internet and 75% through fixed-line Internet. 4 = totally watch video clips through fixed-line Internet.

4. Results

4.1 Data descriptions

After excluding incomplete questionnaires, this study obtained 484 effictive respondents. Thus, the effective response rate was 96.80%. However, data utilized for examining the influence of OTT media streaming services on consumer’s WTA and WTP for optical fiber broadband connection included 474 respondents, and 10 current FTTH users were excluded. Although the number of data was only 474, the average age of the respondents was 33.79 years old, which was very close to the average age of the Thai, 33.25 years old, in 2015 as reported by the National Statistical Office. Moreover, the distribution of respondents by region was in accordance with the percentage of Internet users in each region, as shown in Figure 4.

4.2 WTA for FTTH

The first result concerns the impact of FTTH non-adopters’ experience of connection problems and expectation on future media usage through the Internet

IJITAS11(2)-05 Sudtasan.indd 133 2018/7/17 下午 05:58:30 134 Sudtasan

Figure 4. Distribution of Thai people, Internet users, and respondents by region.

on their stated intention to adopt FTTH. Approximately half of the reporting respondents indicated that they would consider subscribing to FTTH. The reasons for non-adoption are shown in Table 2. Results shown in Table 3 indicate that respondents who spend more time on movies through the Internet and intend to watch movies through the Internet in the next 1–3 years are more likely to adopt FTTH. Moreover, experience in reaching the upper limit of mobile Internet packages and mobile quality problems appear to be strong predictors of non-adopters’ willingness to subscribe to FTTH. Quality problems of fixed-line Internet, expectation to watch online TV and video clips through fixed-line Internet are not predictors of respondents’ intention to adopt the more advanced broadband technology, FTTH.

4.3 WTP f or FTTH

Results shown in Table 3 indicate that purchasing power and current Internet consumption are keys influence on respondents’ WTP. Respondents who bare less monthly expenditures over income are willing to pay more for FTTH. Respondents

Table 2. Reasons for FTTH non-adoption.

Percentage of respondents who do not choose Reasons FTTH 1. Current mobile technology is enough. 58.30 2. Wi-fi offload is enough. 44.68 3. Current fixed-line technology is enough. 40.85 4. No need to use the Internet so often. 9.36 5. Willing to pay more for mobile. 3.40

IJITAS11(2)-05 Sudtasan.indd 134 2018/7/17 下午 05:58:30 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 135

Table 3. Double-hurdle regression results.

X , Z D Y Variable Coefficient Coefficient Age -0.00391 PercentExpdInc -0.00919 * MovieHrs 0.14727 * TVhrs 0.07936 ClipHrs -0.20170 ** MBproblem 0.10226 ** FBproblem 0.03834 MobileOverUse 0.21314 ** FutureMovie 0.08034** 0.13649 FutureTV 0.00043 0.01271 FutureClip 0.02233 0.02194 Constant -0.59237 *** 4.57055*** Dependent variable WTA WTP calculated by the reported fuzzy WTP [min, Mean = 0.51 max] (hundred baht) where min = consumer’s first answer of WTP, and max = maximum WTP. N ote: *Signif icant to 90%; **Signif icant to 95%; ***Signif icant to 99%.

who currently spend more time on movies through the Internet are likely to pay more for FTTH service. Interestingly, respondents who spend more time on watching video clips are less likely to pay for FTTH service. This result can be explained as that, on the one hand, video clips in Thailand are usually not very long so people consume information and shallow entertainment and do not need high capacity of Internet connectivity, and on the other hand, people expect higher quality for watching movies. In summary, demand for more bandwidth induced by OTT media streaming services, particularly online movie service, will drive Internet users to migrate to FTTH connection for a quality of connectivity and off-load mobile consumption. However, optical fiber broadband will not be fully penetrated countrywide in the near future since more than half of representative consumers prefer mobile network. Besides, purchasing ability restricted by expenditure per income is one of barriers to pay for FTTH service.

5. Discussion and implications

The results of this study imply that movies over the Internet is the only OTT media streaming service that directly drives consumers to adopt FTTH access.

IJITAS11(2)-05 Sudtasan.indd 135 2018/7/17 下午 05:58:31 136 Sudtasan

This study supports Knoben’s (2014) study that OTT services are consuming a lot of bandwidth and increasing the traffic over the Internet but at the same time stimulating the diffusion of Internet market. Movies over the Internet can directly contribute to the revenue of optical fiber broadband operators as well as the adoption and diffusion of the market. People will still maintain the balance between traditional channels for consumption of TV programs and video clips. Online TV and video clips in Thailand are usually not very long so people just consume information and shallow entertainment. Therefore, people do not expect high quality of experience and are less likely to pay for FTTH. In contrast, they watch movies for their pleasure and appreciate the entertainment through quality. Around 78% of the respondents show their intention to watch movie through OTT services rather than traditional channel in the next 1–3 years. Therefore, the quality of movie consumption drives respondents’ WTA and WTP for FTTH for a higher capacity of Internet connection. Another important factor that strongly affects people’s decision to adopt FTTH service is experience of problems and over-usage of mobile Internet connection. This result is in line with the findings from Savage and Waldman’s (2005) and Rosston et al.’s (2010) studies that reliability and speed of the Internet connection are important functions for high-speed Internet users. Besides, the more important factor is an ability to pay restricted by monthly expenditure per income. This adds to the finding from Carare et al. (2015) that non- adopters are faced with a variety of barriers to broadband adoption where income has a lack of predictive power. It is because fixed-broadband is a kind of additional Internet access while mobile subscription is almost equal to individual’s identification. People prefer mobile to fixed-line connection in case that they have limited ability to pay. This finding also adds to Chen and Watanabe’s (2006) finding that the diffusion of fixed-line Internet is much slower than mobile Internet due to lower subscription potential. Lessons from many countries show that governments have made large investment in developing broadband infrastructures to deliver high-speed Internet access to end users. However, an increasing consumer’s dependency on mobile connectivity makes an unclear direction for FTTH broadband market. For policy makers, the results of this study imply that the investment on optical fiber should be supported as both means of FTTH and advanced mobile network backbone. For ISPs, a bundled package of FTTH with subscription for movies seems to be attractive. Moreover, when the development of applications and services that will consume a lot of bandwidth and require reliable connection are increasingly available to consumers, the investment in optical fiber network seems to be good and worthwhile because consumers would like to use and are willing to pay for their quality of Internet usage. However, there is also a need for public and private sectors in providing FTTH service at an affordable price in order to diffuse the market and make the optical fiber network available for all people to utilize.

IJITAS11(2)-05 Sudtasan.indd 136 2018/7/17 下午 05:58:31 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 137

6. Conclusions

FTTH provides the highest capacity and speed of the broadband service. It is expected to be a potential solution for the forecasted enormous Internet traffic and a driver for long-run competitiveness for ISPs. This study takes Thailand as a target country since optical fiber broadband adoption is still in an early stage countrywide while demand for bandwidth is continuously increasing. Assuming mobile broadband as a foundation of connection, the analysis examines the impact of OTT media streaming services to consumers’ WTA and WTP for FTTH service. From consumer’s perspective, the results indicate that FTTH is an efficient offload option for the quality of movies consumption through the Internet. The demand for more bandwidth induced by OTT media streaming services, in particular online movie service, will drive to the migration to FTTH connection which will further enhance the diffusion of optical fiber broadband market. However, purchasing power restricted by expenditure over income is one of key factors that affect consumers’ WTP for FTTH service. The reported average minimum WTP is around THB 505 (USD 15.21). This requires policy makers to promote competition among ISPs for a continuous decrease in FTTH prices. As a consequence, FTTH market can realize its rapid diffusion because people are now willing to adopt FTTH due to low quality of current mobile connection, over-usage of mobile data, and expectation of shifting their movie consumption from traditional platform to OTT platforms. There is also a need for investment in network upgradation to optical fiber for being both means of FTTH and backbone of advanced mobile network. This study discovers that respondents rely very much on the mobile Internet. While the coverage of mobile broadband has spread out countrywide, fixed- broadband services are still unavailable in rural areas. In particular, FTTH service is very limited in many city centers. Therefore, the framework of this study may be generalized to developing countries where people’s demand for bandwidth grow faster than supply of advanced Internet connection. Moreover, there is a room for further study on the impact of coverage expansion on consumer’s WTA and WTP f or FTTH.

References

Baldry, S., Steingröver, M., & Hessler, M. A. (2014, June). The rise of OTT players: W hat is the a ppropriate regulatory response? Paper presented at 25th European Regional Conference of the International Telecommunications Society, Brussels, Belgium.

Boby of European Regulators for Electorinic Communications. (2016). Re port on OTT services. Retrieved from https://berec.europa.eu/eng/document_register/subject_ matter/berec/reports/5751-berec-report-on-ott-services

IJITAS11(2)-05 Sudtasan.indd 137 2018/7/17 下午 05:58:31 138 Sudtasan

Carare, O., McGovern, C., Noriega, R., & Schwarz, J. (2015). The willingness to pay for broadband of non-adopters in the U.S.: Estimates from a multi-state survey. Inf ormation Economics and Policy, 30, 19–35. doi:10.1016/ j.inf oecopol.2014.12.001

Channel News Asia. (2016). Netflix in Southeast Asia: iflix, Hooq welcome competition. Retrieved from http://www.channelnewsasia.com/news/asiapacific/ netflix-in-southeast-asia-iflix-hooq-welcome-competition-8217416

Chen, C., & Watanabe, C. (2006). Diffusion, substitution and competition dynamism inside the ICT market: The case of Japan. T echnological F orecasting and Social Change, 73, 731–759. doi:10.1016/ j.techf ore.2005.07.008

Crosby, T. (2008). H ow F iber-to-the-home broadband works. Retrieved from http:// computer.howstuffworks.com/fiber-to-the-home3.htm

Cunningham, S. (2017). I flix, Hooq, Viu, Netflix, Viki, Catchplay and more: It’s a crowded video-on-demand world. Retrieved from https://www.forbes.com/sites/ susancunningham/2017/03/01/iflix-hooq-viu-netflix-viki-catchplay-and-more-its-a- crowded-video-on-demand-world/#4e0b321d7457

DemystifyAsia. (2016). Hooq vs iflix vs Netflix: A comparison of subscription video on-demand ( SV OD ) services in Asia. Retrieved from http://www.demystifyasia. com/hooq-vs-iflix-vs-netflix-comparison-subscription-video-demand-svod-services- asia/

Detecon International GmbH. (2014). Policy and regulatory framework for governing internet a pplications. Retrieved from https://www.detecon.com/ap/ap/files/Study_ Policy_Regulatory_Framework.pdf

Dwivedi, Y. K., & Lal, B. (2007). Socio-economic determinants of broadband adoption. Industrial Management & Data Systems, 107 , 654-671. doi:10.1108/02635570710750417

Dwivedi, Y. K., Lal, B., & Williams, M. D. (2009). Managing consumer adoption of broadband: examining drivers and barriers. Industrial Management & Data Systems, 109, 357–369. doi:10.1108/02635570910939380

Ericsson. (2015a). 10 hot consumer trends 2015. Retrieved from https://www.ericsson. com/res/docs/2014/consumerlab/ericsson-consumerlab-10-hot-consumer-trends-2015. pdf

Ericsson. (2015b). 10 hot consumer trends 2016. Retrieved from https://www.ericsson. com/res/docs/2015/consumerlab/ericsson-consumerlab-10-hot-consumer-trends-2016- report.pdf

Fibre to the Home Council Europe. (2016a). FTT H handbook (7th ed.). Retrieved from http://www.ftthcouncil.eu/documents/Publications/FTTH_Handbook_

IJITAS11(2)-05 Sudtasan.indd 138 2018/7/17 下午 05:58:31 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 139

V7.pdf

Fibre to the Home Council Europe. (2016b). An introduction to the FTT H broadband technolog y: A white pa per by the De ployment & O perations Committee. Retrieved from http://www.ftthcouncil.eu/documents/Publications/FTTH_introduction_ whatisit _2016.pdf

Flamm, K., & Chaudhuri, A. (2007). An analysis of the determinants of broadband access. T elecommunications Policy, 31, 312-326. doi:0.1016/ j.telpol.2007.05.006

Greenough, J. (2016). T he to p 5 digital trends f or 2016. Retrieved from http://www. businessinsider.com/the-top-5-digital-trends-for-2016-2016-3

Huang, C. L., Kan, K., & Fu, T.-T. (1999). Consumer willingness-to-pay for food safety in Taiwan: A binary-ordinal probit model of analysis. J ournal of Consumer Af f airs, 33, 76 – 91. d oi:10.1111/ j.1745-6606.1999.t b00761.x

Ida, T., & Sakahira, K. (2008). Broadband migration and lock-in effects: Mixed logit model analysis of Japan’s high-speed Internet access services. T elecommunications Policy, 32, 615–625. doi:10.1016/ j.telpol.2008.07.009

International Telecommuniccations Union. (2016). White paper: On broadband regulation and policy in Asia-Pacific region. Retrieved from http://www.itu.int/ en/ITU-D/Regional-Presence/AsiaPacific/Documents/Events/2016/APAC-BB-2016/ Final_White_Paper_APAC-BB.pdf

Jeffcoat, C., Davis, A. F., & Hu, W. (2012). Willingness to pay for broadband access by Kentucky f armers. Journal of Agricultural & Applied Economics, 44, 323–334.

Knoben, W. (2014). The rise of OTT players—What is the appropriate response? Retrieved from https://www.itu.int/en/ITU-D/Regional-Presence/AsiaPacific/ Documents/Events/2014/ITU-ASEAN%20Forum%20-%20Indonesia%20-%20 Dec%202014/ITU%20Conf erence%20Jakarta%20-%20Day%202%20-%20Dr.%20 Werner%20Knoben%20-%20Detecon.pdf

Kwak, N., Skoric, M. M., Williams, A. E., & Poor, N. D. (2004). To broadband or not to broadband: The relationship between high-speed internet and knowledge and participation. Journal of Broadcasting & Electronic Media, 48, 421–445. doi:10.1207/s15506878jobem4803_5

Lin, H.-C., Wang, C.-S., Chen, J. C., & Wu, B. (2016). New statistical analysis in marketing research with fuzzy data. Journal of Business Research, 69, 2176 – 2181. doi:10.1016/ j.jbusres.2015.12.026

Madden, G., & Simpson, M. (1997). Residential broadband subscription demand: an econometric analysis of Australian choice experiment data. A p plied Economics,

IJITAS11(2)-05 Sudtasan.indd 139 2018/7/17 下午 05:58:31 140 Sudtasan

29, 1073–1078. doi:10.1080/000368497326462

Malisuwan, S., & Kaewphanuekrungsi, W. (2016). Analysis of roadmaps and trends for mobile communication techonology in Thailand. T echnolog y, 7, 68–79.

Malisuwan, S., Tiamnara, N., & Kaewphanuekrungsi, W. (2016). Thailand’s vision in ICT development: Analysis and recommendations for the telecommunications regulatory policy 2015–2020. International Journal Of Science Technology & M anagement, 5, 49–57.

Martínez-Espiñeira, R. (2006). A box-cox double-hurdle model of wildlife valuation: The citizen’s perspective. Ecological Economics, 58, 192–208. doi:10.1016/ j.ecolecon.2005.07.006

Mnakri, M. (2015). “Over-the-top” services: Enablers of growth & impacts on economies. Retrieved from https://www.itu.int/en/ITU-D/Regional-Presence/ ArabStates/Documents/events/2015/EFF/Pres/OTT-%20Enablers%20for%20 Growth%20%20Impacts%20on%20Economies%20m%20mnakri%20Nov%202015. pdf

Nakamura, A. (2013). Retaining telecommunication services when universal service is defined by functionality: Japanese consumers’ willingness-to-pay. T elecommunications Policy, 37 , 662–672. doi:10.1016/ j.telpol.2012.12.008

Nambiar, R. (2017). Streaming service V iu announces launch in Thailand, furthering global ex pansion. Retrieved from https://www.cnbc.com/2017/05/11/streaming- service-viu-announces-launch-in-thailand-furthering-global-expansion.html

National Broadcasting and Telecommunications Commission. (2015). Annual re port 2015. Bangkok, Thailand: Author.

National Broadcasting and Telecommunications Commission. (2016). M arket analysis on Thai fixed broadband market. Retrieved from http://www.nbtc.go.th/

Parks Associates. (2015). The connected consumer: Top trends in IoT —A parks associates whitepaper. Retrieved from http://www.parksassociates.com/bento/ shop/whitepapers/files/ParksAssoc-ConnectedConsumer-TopTrends-in-IoT-2015. pdf

Pornwasin, A. (2017). OTT regulatory model expected by September. Retrieved from http://www.nationmultimedia.com/news/business/EconomyAndTourism/30313713

Rappoport, P., Kridel, D. J., Taylor, L. D., Alleman, J., & Duffy-Deno, K. T. D.-D. (2003). Residential demand for access to the Internet. In G. Madden (Ed.), Emerging telecommunications networks: The international handbook of telecommunications economics, V ol. I I. (pp. 55–72). Northampton, MA: E. Elgar.

IJITAS11(2)-05 Sudtasan.indd 140 2018/7/17 下午 05:58:31 WTP for FTTH and the Quality of Experience from OTT Media: Thailand 141

Rixen, M., & Weigand, J. (2014). Agent-based simulation of policy induced diffusion of smart meters. Technological Forecasting and Social Change, 85, 153–167. doi:10.1016/ j.techf ore.2013.08.011

Rosston, G. L., Savage, S. J., & Waldman, D. (2010). Household demand for broadband Internet in 2010. The B. E. Journal of Economic Analysis & Policy, 10, 1–45. doi:10.2202/1935-1682.2541

Savage, S. J., & Waldman, D. (2005). Broadband Internet access, awareness, and use: Analysis of United States household data. T elecommunications Policy, 29, 615– 633. doi:10.1016/ j.telpol.2005.06.001

Seixas, P. (2015). Drivers of OTT and implications f or telecommunications operators. Retrieved from https://www.itu.int/en/ITU-D/Regional-Presence/AsiaPacific/ Documents/Events/2015/Dec-OTT/Presentations/Phnom%20Penh%20%20 Session%202%20-%20OTT%20Drivers%20Final%20PS.pdf

Sim, J. J., Tan, G. W.-H., Ooi, K. B., & Lee, V. H. (2011). Exploring the individual characteristics on the adoption of broadband: An empirical analysis. International Journal of Network and Mobile Technologies, 2, 1–14.

Srinuan, C., & Bohlin, E. (2013). Analysis of fixed broadband access and use in Thailand: Drivers and barriers. T elecommunications Policy, 37 , 615–625. doi:10.1016/ j.telpol.2013.03.006

Srirasa, O. (2017). Issues and challenges in digital broadcasting transition/ de ployment. Retrieved from https://www.itu.int/en/ITU-D/Regional-Presence/ AsiaPacific/Documents/Events/2017/June-AMS2017/S3-Orasri_Srirasa.pdf

Sunada, M., Noguchi, M., Ohashi, H., & Okada, Y. (2011). Coverage area expansion, customer switching, and household profiles in the Japanese broadband access market. Information Economics and Policy, 23, 12–23. doi:10.1016/ j.inf oecopol.2010.02.004

Time Consulting. (2017). Procedures f or competition regulation f or OTT TV . Retrieved from http://casbaa.com/wordpress/wp-content/uploads/2017/03/English- Translation-of-Summary-and-TOC-of-Time-Consulting-Report-on-OTT.docx

Yozzo. (2015). Thailand’s telecom market: End of 2015. Retrieved from https://www. slideshare.net/yozzo1/thailands-telecom-market-end-of-2015

IJITAS11(2)-05 Sudtasan.indd 141 2018/7/17 下午 05:58:31 IJITAS11(2)-05 Sudtasan.indd 142 2018/7/17 下午 05:58:31 International Journal of Intelligent Technologies and Appld ie Statt is ics

Optimal Initial Values in Maximum Likelihood Estimation of Logistic Regression Models

Seksiri Niwattisaiwong1 and Komsan Suriya2,3,*

1Faculty of Economics, Ramkhamhaeng University, Bangkok, Thailand 2Center of Excellence in Digital Socio-economy, Chiang M ai University, Chiang M ai, Thailand 3Faculty of Economics, Chiang M ai University, Chiang M ai, Thailand

ABSTRACT This study investigates an optimization for the initial values of parameters in maximum likelihood estimation of logistic regression models, questioning whether the default of initial values set at zero in statistical software packages is the best setting. By employing quadratic interpolation, a search for other initial values is performed to find some points that may yield higher log-likelihood values. Some alternative initial values that maximize the log-likelihood are discovered through the series of experiments using the Newton-Raphson algorithm for the maximum likelihood analysis of logistic regression functions without constant terms. However, the estimated parameters of independent variables are close to those estimated by statistical software. This means zero is not the optimal initial value of parameters in maximum likelihood estimation of logistic regression. Indeed, many statistical software packages, though having initial values defaulted at zero, have come with an improved inner process for the estimation algorithm; they are able to deliver the parameters for independent variables that yield accurate maximum log-likelihood values. Given the relatively reliable results, the use of statistical software in maximum likelihood estimation of logistic regression remains relevant.

K eywords: Logistic regression model; Maximum likelihood method; Optimal initial values; Parameter estimation; Optimization

1. Background and rationale

Regarding the method of maximum likelihood (ML), the selection of initial values of parameters could be problematic as different initial values result in different estimates. Statistical software packages, in general, default initial values of all parameters at zero; these programs, therefore, often provide similar estimates. However, with the Newton-Raphson iterative method, the initial values are set

* Corresponding author: [email protected]

IJITAS11(2)-06 Suriya.indd 143 2018/7/12 下午 05:47:51 144 Niwattisaiwong and Suriya

alongside with two other close values in order to calculate the slope—or the first derivative—of the log-likelihood function. The slope of the slope—or the second derivative—is also required in the Newton-Raphson equation. Having the other two initial values, thus, leads to a different result in ML estimation. This begs the question whether zero is the optimal setting for initial values in order to achieve the maximum log-likelihood, or if there are other methods that could arrive at higher log-likelihood values. The study results contribute to the improvement of initial value optimization in logistic regression models using the ML method. It also suggests whether estimates by distributed statistical software are the most reliable results, as they are often compared with parameter estimates by other methods than logistic regression.

2. Literature review

According to Cramer (2002), logistic regression was developed by Alphonse Quetelet, the Belgian astronomer turned statistician, and his student Pierre- Francois Verhulst, who named the regression function logistic. Quetelet and Verhulst observed that an exponential growth would lead to impossible values as it would evolve towards infinity. Therefore, they introduced a function which accepts exponential growth but comes with an extra term that converges towards a value below inf inity. ML estimation of logistic regression models has been affirmed a better method than others, such as discriminant function (DF) and weighted least squares (WLS). Panichakarn (1996) calculated the root mean square errors (RMSE) of the three models using ungrouped data with dichotomous dependent variables, 0 and 1. Dependent variables were differently proportionated and independent variables characterized by normal distribution, exponential distribution, and Weibull distribution. As a result, the RMSE of ML was lower than that of the other methods in almost every case. ML estimation, therefore, yielded better estimates than the other methods. Even so, there have been several attempts to improve parameter estimation using the ML method. Burutnareerat (2002) compared the method of asymptotic ML and the Monte Carlo method by employing average mean square error (AMSE) as the criteria. The study concluded that the AMSE value of asymptotic ML was less than that of the Monte Carlo method in all cases. That is, asymptotic ML was a better estimation method than the Monte Carlo method. It is also found that ML remains the most optimal method for parameter estimation in many cases when logistic regression is functioned in other forms, such as in binomial logistic regression. In Jongketkorn’s (2003) study, ML, the weighting method (WE), and prior correction (PC) were compared using average Mahalanobis distance (AMD) as the criteria. The study shows that, with the average values of probability of success in population at 0.1 and 0.3, the ML method yielded the

IJITAS11(2)-06 Suriya.indd 144 2018/7/12 下午 05:47:51 Optimal Initial Values of Logistic Regression Models 145

lowest AMD, followed by PC and WE, respectively. However, in the case that the average values of probability of success in population were equal to 0.5 and 0.8, ML yielded the highest AMD, while PC the lowest, making the latter the most efficient parameter estimation method. Nevertheless, parameter estimation through the ML method in logistic regression having outliers could be inefficient; therefore, weighted ML methods have been developed as a solution. Ketsuk (2004) compared the ASME values of parameters estimated by the weighted ML method of Croux and Haesbroeck (WMLCH) with those by Rousseeuw and Christman’s method (WMLRC). According to the study, the level of outliers, the proportion of contamination, and the sizes of sample are the factors that affected parameter estimation by all the three methods. It was found that the higher the outlier level and contamination, the more the AMSE values increased. Meanwhile, the AMSE values decreased in contrary to the growing sample sizes. In the case that all independent variables presented without outliers, ML was found to yield the lowest AMSE. With outlier-contaminated variables, WMLRC provided the lowest AMSE, followed by WMLCH and ML, respectively. There has been no prior research regarding the optimization of initial values of parameter estimation in logistic regression using ML. However, according to Cook et al. (2001), statistical computing packages may provide different estimates due to differences in adjustments, initial values, round off, and algorithms. Initial values, being one of these four reasons, thus gains importance as a cause for different results f rom ML estimation. One may be skeptical about the accuracy of parameter estimates by ML functions assigned to different initial values. This is especially problematic when the results are compared with those from other models, for example, in the comparison made to decide which between logistic regression and the Bass model is more compatible with S-curve data as studied by Kanjanatarakul and Suriya (2012, 2013). In addition, parameter values estimated through ML must be reliable when used to identify the appropriate data pre-processing method, such as in the study by Kanjanatarakul, Suriya, Hsiao, and Gourieroux (2014) which compared the methods of cumulative observations and rolling windows based on the accuracy of prediction.

3. Methodology

The process of this study was divided into five steps as follows: First, ML estimation using the Newton-Raphson algorithm in logistic regression was modeled with the initial value of all parameters set at zero—the same initial value as defaulted by statistical software packages. The model only had two

parameters which were those of the independent variables X 1 and X 2; the constant was excluded. Second, a search for β, which denoted the appropriate initial value of the

parameter for the first independent variable (X 1), was performed using quadratic

IJITAS11(2)-06 Suriya.indd 145 2018/7/12 下午 05:47:51 146 Niwattisaiwong and Suriya

interpolation. Meanwhile, γ, which denoted the initial value of parameter for the

second independent variable (X 2), remained zero. With quadratic interpolation, a set of values close to zero was randomized and log-likelihood values calculated. Normally, a range of values exists which determines certain values applicable for log-likelihood calculations. Log-likelihood values could not be calculated based on the values outside this range as the system processing would end up in divergence. Following the above process, three values that are close together were selected— all of which may have resulted in both an increase and a decrease in the log- likelihood values as this would be the range for the maximum log-likelihood. If no such values were present, the three values that tended to make the log-likelihood function continuously increased were selected, as the range that maximized the log- likelihood would slightly succeed the final value. By using quadratic interpolation, the first three initial values provided a result which was used for a log-likelihood calculation. The log-likelihood value was then recorded and compared with the original log-likelihood. Three initial values were, again, selected. The process was iterated until the value that maximized log- likelihood was identified and recorded as the most appropriate initial value of the

independent variable X 1. Third, a search for γ, the appropriate initial value of the remaining parameter, was performed by repeating the process of quadratic interpolation used to identify β. For this, β, as the initial value of the first parameter, was set at the value formerly recorded which maximized the log-likelihood. Fourth, the maximum log-likelihood value derived from the experiments was compared with such value calculated based on the zero setting of parameters. The higher value from the experiments would mean that an initial value more optimal than zero exists. Otherwise, the zero-initial value setting of parameters would be more appropriate. Fifth, the parameter values γ and β estimated from the most optimal initial values in the experiments were compared with such values estimated by statistical software. The software package used in this study was PSPP, an open-source program distributed under the GNU General Public License. In the study, the relevant functions were modeled as suggested by Greene (2003) and were based on the following technical settings (Table 1): The logistic regression model was written

(βX + γ X ) e 1 2 Pr(yi) = (βX + γ X ) . (1) 1 + e 1 2

The log-likelihood function was written

IJITAS11(2)-06 Suriya.indd 146 2018/7/12 下午 05:47:51 Optimal Initial Values of Logistic Regression Models 147

(βX + γ X ) n e 1 2 1 L = ∑ i = 1 yi ln (βX + γ X ) + (1 – yi)ln (βX + γ X ) . (2) 1 + e 1 2 1 + e 1 2 � � � � �� To find the maximum log-likelihood (L), the Newton-Raphson method used began with a process of subtracting the current parameter value from the quotient of the first and second derivatives of the log-likelihood function and the relative parameter value:

dL d2L β = β – ÷ , (3) t + 1 t dβ dβ2 �� dL d2L γ = γ – ÷ . (4) t + 1 t dγ dγ2 �� The search employing quadratic interpolation was done when L = aβ2 + bβ + c. Cramer’s rule was adopted to find a, b, and c. The maximum value of function was –b identified through the formula β* = . The function could be expressed as 2a

2 β1 β1 1 a L1 2 β2 β2 1 b = L2 . (5) β2 β 1 c L � 3 3 �� 3 �

4. Results

The study arrived at the following results:

Table 1. Technical settings of ML estimation.

No. List of technical settings Set value 1 Maximum number of iteration 20 times

2 Distance between initial value of independent variable X 1 and close value ± 0.025

3 Distance between initial value of independent variable X 2 and close value ± 0.020 4 Constant None 5 Number of samples 500 6 Number of samples where dependent variable Y equals 1 250 7 Number of samples where dependent variable Y equals 0 250 8 Software package used for calculation Oc ta v e 4.2.1

IJITAS11(2)-06 Suriya.indd 147 2018/7/12 下午 05:47:51 148 Niwattisaiwong and Suriya

4.1 Results from the experiment with the initial values set at zero

The parameter estimates of a logistic regression model with the initial values of all parameters set at zero are presented in Table 2. With a zero initial value of all parameters in the logistic regression model, the ML method using the Newton-Raphson algorithm provided the following results: the * final value β of the parameter for X 1 was calculated as 0.013238 and the final value * γ of the parameter for X 2 as 0.010590. The log-likelihood value was -347.66.

4.2 The search for the parameter value of X 1 using quadratic

interpolation with the parameter value of X 2 fixed constant at zero

In the randomization process to identify the values near the initial values of the

parameter for X 1 to be used in the log-likelihood calculations, it was found that only the values in the range of -0.3 to 0.3 resulted in convergence. Here, it was observed that the log-likelihood values continuously increased without reaching the inflection point, as presented in Table 3. Consequently, the three initial values maximizing the log-likelihood selected for quadratic interpolation were 0.1, 0.2, and 0.3. The first quadratic interpolation resulted in the initial value of 0.73491 which lessened the log-likelihood value. Indeed, the log-likelihood function could be maximized, and so the search continued. It was found that the approximate initial value of 0.30700 resulted in the maximum log-likelihood value of -281.25, as shown in Table 4.

4.3 The search for the parameter value of X2 using quadratic interpolation

with the parameter value of X1 fixed constant at the value maximizing the log-likelihood

In this step, the initial value of the parameter for X 1 was fixed at 0.30700— the best result from the previous calculation. A search for the parameter value of

X 2 was then performed using the same process: a close approximation to zero was randomized and the log-likelihood value calculated. It should be noted that only the initial values below zero were calculable, and that an inflection point existed near -0.60, as presented in Table 5. Accordingly, it was presumable that the most

appropriate initial and final values of the parameter for X 2 would be negative.

Table 2. Results of logistic regression with initial values set at zero.

Initial value of Initial value of Final value of Final value of Log-likelihood

parameter for X 1 parameter for X 2 parameter for X 1 parameter for X 2 value * * (βo) (γo) (β ) (γ ) 0 0 0.013238 0.010590 -347.66

IJITAS11(2)-06 Suriya.indd 148 2018/7/12 下午 05:47:51 Optimal Initial Values of Logistic Regression Models 149

Table 3. Results from experiments with randomized initial values of parameter for X1.

Initial values of Initial values of Final values of Final values of Log-likelihood

parameter for X 1 parameter for X 2 parameter for X 1 parameter for X 2 values * * (βo) (γo) (β ) (γ ) -0.3 0 -0.120350 0.143720 -447.85 -0.2 0 -0.075857 0.099314 -410.69 -0.1 0 -0.031323 0.054942 -377.23 0.0 0 0.013238 0.010590 -347.66 0.1 0 0.057809 -0.033753 -322.07 0.2 0 0.102380 -0.078100 -300.37 0.3 0 0.146920 -0.122460 -282.38 N ote: The bold-faced figures are the best initial values. The initial values below or equal to -0.4 and those above or equal to 0.4 were inapplicable as system processing ended in divergence after 20 rounds of estimation.

Table 4. Results of quadratic interpolation with initial values of parameter for X 1 Set at 0.1, 0.2, and 0.3.

Initial values of Initial values of Final values of Final values of Log-likelihood

parameter for X 1 parameter for X 2 parameter for X 1 parameter for X 2 values * * (βo) (γo) (β ) (γ ) 0.10000 0 0.05780 -0.03375 -322.07 0.20000 0 0.10238 -0.07810 -300.37 0.25928 0 0.12879 -0.10440 -289.28 0.28742 0 0.14132 -0.11688 -284.45 0.30000 0 0.14692 -0.12246 -282.38 0.30083 0 0.14729 -0.12283 -282.25 0.30700 0 0.15004 -0.12557 -281.25 0.73491* 0 122.58800 97.48300 -2,460.10 N ote: The bold-faced figures are the best initial values. *The iteration of estimation with the initial values above or equal to 0.4 could not reach 20 rounds; only the applicable final values were calculated.

By identifying the range in which the value in the middle was closest to the inflection point, the three initial values of -0.7, -0.6, and -0.5 were selected as the values maximizing the log-likelihood. The results of quadratic interpolation suggested that the optimal initial value was between 0.6 and 0.5. Finally, such value was found to be -0.59567, which led to the maximum log-likelihood value of -232.282360, as presented in Table 6.

IJITAS11(2)-06 Suriya.indd 149 2018/7/12 下午 05:47:52 150 Niwattisaiwong and Suriya

Table 5. Results of quadratic interpolation to search for parameter value of X 2 with

initial value of parameter for X 1 fixed constant at values maximizing log-likelihood

Initial values of Initial values of Final values of Final values of Log-likelihood

parameter for X 1 parameter for X 2 parameter for X 1 parameter for X 2 values * * (βo) (γo) (β ) (γ ) 0.30700 -1.0 0.70392 -0.68246 -243.87 0.30700 -0.9 0.64866 -0.62667 -239.23 0.30700 -0.8 0.59339 -0.57089 -235.61 0.30700 -0.7 0.53812 -0.51510 -233.21 0.30700 -0.6 0.48283 -0.45933 -232.28 0.30700 -0.5 0.42751 -0.40359 -233.16 0.30700 -0.4 0.37215 -0.34788 -236.23 0.30700 -0.3 0.31673 -0.29222 -241.96 0.30700 -0.2 0.26124 -0.23661 -250.93 0.30700 -0.1 0.20567 -0.18106 -263.79 0.30700 0.0 0.15004 -0.12557 -281.25 N ote: The bold-faced figures are the best initial values. The initial values above or equal to 0.1 were ineligible as system processing ended in divergence after 20 rounds of estimation.

Table 6. Results of quadratic interpolation with initial values of parameter for X 2 set at -0.7, -0.6, and -0.5.

Initial values of Initial values of Final values of Final values of Log-likelihood

parameter for X 1 parameter for X 2 parameter for X 1 parameter for X 2 values * * (βo) (γo) (β ) (γ ) 0.30700 -0.70000 0.53812 -0.51510 -233.210000 0.30700 -0.60000 0.48283 -0.45933 -232.284178 0.30700 -0.59862 0.48207 -0.45856 -232.283452 0.30700 -0.59652 0.48091 -0.45739 -232.282688 0.30700 -0.59567 0.48044 -0.45692 -232.282360 0.30700 -0.59415 0.47960 -0.45607 -232.282703 0.30700 -0.50000 0.42751 -0.40359 -233.160000 N ote: The bold-faced figures are the best initial values. *The iteration of estimation with the initial values above or equal to 0.4 could not reach 20 rounds; only the eligible final values were calculated.

IJITAS11(2)-06 Suriya.indd 150 2018/7/12 下午 05:47:52 Optimal Initial Values of Logistic Regression Models 151

4.4 Comparing the log-likelihood estimates ascribed to different initial values

It was found that the initial values of 0.30700 for X 1 and -0.59567 f or X 2— both achieved from the experiments—generated a higher log-likelihood value of -232.282360, compared with the log-likelihood value of -347.660000 calculated from an initial value of zero. The study result presented as an increase of 33.18% from the original log-likelihood, as illustrated in Table 7.

4.5 Comparing the log-likelihood value from the experiments and that from statistical sof tware

In the final step of this research, the result calculated from the final values of

the parameters for X 1 and X 2, together with the log-likelihood value estimated by PSPP were compared. As a result, the parameter values from both sources were close

to each other. The values of X 1 were both positive, while the values of X 2 were both negative. However, the log-likelihood value from the experiments was less than that from the statistical software package by 0.29%, as presented in Table 8.

Table 7. Result of parameter estimation in logistic regression with initial value of all parameters set at zero

Initial values Initial values Final values Final values Log-likelihood Change of of parameter of parameter of parameter of parameter values log-likelihood * * for X 1 (βo) for X 2 (γo) for X 1 (β ) for X 2 (γ ) value 0.00000 0.00000 0.013238 0.010590 -347.660000 0.30700 -0.59567 0.480440 -0.456920 -232.282360 33.18% increase

Table 8. Comparison between experiment and PSPP results.

Final values of Final values of Log-likelihood Change of log-

parameter for X 1 parameter for X 2 values likelihood value (β*) (γ*) Result from the 0.48044 -0.45692 -232.282360 experimentsa Result from 0.47000 -0.47000 -231.603703 0.29% PSPPb increase N ote: aCalculated through Octave; bCalculated through PSPP.

IJITAS11(2)-06 Suriya.indd 151 2018/7/12 下午 05:47:52 152 Niwattisaiwong and Suriya

5. Conclusion

The study shows that setting the initial values of parameters for all independent variables at zero does not automatically maximize the log-likelihood value in ML estimation of logistic regression using the Newton-Raphson algorithm. Thus, a calculation through quadratic interpolation is needed to optimize the initial values. The process can be employed to find initial values of the parameter for each of the variables until the most appropriate result is achieved which yields the maximum log-likelihood. However, although statistical computing packages default the initial value of parameters for all independent variables at zero, they have come with other adjustments on their estimation algorithm, which also lead to accurate maximum log-likelihood calculations. The series of experiments, therefore, has proved that parameter estimation by the available software packages remains relatively reliable.

Acknowledgement

Our deepest appreciation goes to Octave developer John W. Eaton and PSPP developer Free Software Foundation. Both software packages are offered free of charge under the GNU General Public License, which saved us a tremendous amount of time and cost for this research.

References

Burutnareerat, N. (2002). การประมาณค ่าพาราม ิเตอร ์ของต ัวแบบถดถอยโลจ ิสต ิคด ้วยว ิธ ีค วามควรจะเป ็น ส ูงส ุดแบบอะซ ิมโตต ิคและว ิธ ีมอนต ิคาร ์โล [Parameter estimation f or logistic regression model with asymptotic maximum likelihood and monte carlo methods] (Unpublished master’s thesis). Department of Statistics, Chulalongkorn University, Bangkok, Thailand.

Cook, D., Dixon, P., Duckworth, W. M., Kaiser, M. S., Koehler, K., ... Stephenson, W. R. (2001). Chapter 3: Binary response and logistic regression analysis. In D. Cook et al. (Eds.), Beyond traditional statistical methods. Ames, IA: Iowa State University.

Cramer, J. S. (2002). The origins of logistic regression (Tinbergen Institute Discussion Paper No. TI 2002-119/4). Retrieved from https://papers.tinbergen.nl/02119.pdf

Greene, W. H. (2003). Econometric analysis (5th ed.). New York, NY: Pearson Education.

Jongketkorn, T. (2003). การประมาณค ่าพาราม ิเตอร ์ของต ัวแบบถดถอยโลจ ิสต ิคทว ินาม [Parameter estimation methods of binomial logistic regression model] (Unpublished master’s

IJITAS11(2)-06 Suriya.indd 152 2018/7/12 下午 05:47:52 Optimal Initial Values of Logistic Regression Models 153

thesis). Department of Statistics, Chulalongkorn University, Bangkok, Thailand.

Kanjanatarakul, O., & Suriya, K. (2012). Comparison of sales forecasting models for an innovative agro-industrial product: Bass model versus logistic function. The Empirical Econometrics and Quantitative Economics Letters, 1, 89–106.

Kanjanatarakul, O., & Suriya, K. (2013, July). F orecasting the sales of an innovative agro-industrial product with limited inf ormation: A case of f eta cheese from buf f alo milk in T hailand. Paper presented at EcoMod 2013 Conference, Prague, Czech Republic.

Kanjanatarakul, O., Suriya, K., Hsiao, C., & Gourieroux, C. (2014). Sales forecasting with limited information: Comparison between cumulative observations and rolling windows. The Empirical Econometrics and Quantitative Economics Letters, 3(1), 11– 24.

Ketsuk, O. (2004). การประมาณค ่าพาราม ิเตอร ์ในต ัวแบบการถดถอยโลจ ิสต ิค เมื่อมีค่าผิดปกติ [Estimation of parameters in logistic regression model having outliers] (Unpublished master’s thesis). Department of Statistics, Chulalongkorn University, Bangkok, Thailand.

Panichakarn, K. (1996). การประมาณค ่าพาราม ิเตอร ์ในสมการถดถอยโลจ ิสต ิคด ้วยภาวะ น ่าจะเป ็นส ูงส ุด และฟ ังก ์ช ันจ ำ�แนกประเภท [Estimation of parameters in logistic regression by maximum likelihood and discriminant function] (Unpublished master’s thesis). Department of Statistics, Chulalongkorn University, Bangkok, Thailand.

IJITAS11(2)-06 Suriya.indd 153 2018/7/12 下午 05:47:52 IJITAS11(2)-06 Suriya.indd 154 2018/7/12 下午 05:47:52 訂閱單

International Journal of Intelligent Technologies and Applied Statistics

發行公司：華藝數位股份有限公司郵寄地址：23452 新北市永和區成功路一段 80 號 18 樓聯絡電話：+886-2-29266006 轉 8950 服務信箱 : [email protected] 傳真專線：+886-2-29235151 Airiti Press 網站：http://www.airitipress.com

個人訂閱價圖書館／機關團體訂閱價一期一年一期一年新台幣 $ 350 $ 1400 新台幣 $ 700 $ 2800 歐洲 € 48 € 190 歐洲 € 95 € 380 美洲 US$ 60 US$ 240 美洲 US$ 120 US$ 480 亞太 US$ 55 US$ 220 亞太 US$ 110 US$ 440

訂閱卷期至卷期數量金額

台灣郵寄處理費印刷品一本平信 40 元，印刷品一本掛號 60 元。數量金額

* 以上價格含稅總金額

注意事項 • 期刊將在付款後兩個營業天內寄出。 • 價格及建議出版日恕不另行通知可能變更。 • 機構團體是指圖書館、政府機構、公司使用者及公司以個人名義訂購。 • 個人訂戶是指單一個人使用而非以營利為目的者。 • 對於有意或無意扭曲「個人訂戶」一詞者，Airiti Press 保留法律追訴權，並得求償損失。

訂閱者資料個人姓名機關名稱發票抬頭及統一編號聯絡電話傳真號碼電子信箱

收件地址

付款方式：銀行匯款戶名華藝數位股份有限公司銀行玉山商業銀行 – 埔墘分行帳戶資訊代碼總行代碼： 808；分行代碼：0174 帳號 017 444 001 9696

付款後請傳真或回函本訂閱單，即完成訂閱。

IJITAS11(2)-07 訂閱單.indd 155 2018/7/12 下午 05:48:17 Subscription Form

International Journal of Intelligent Technologies and Applied Statistics

You may subscribe to the journals by completing this form and sending it by fax or e-mail to Address: 18F., No. 80, Sec. 1, Chenggong Rd., Yonghe Dist., New Taipei City 23452, Taiwan (R.O.C.) Tel: +886-2-29266006 ext. 8950 E-mail: [email protected] Fax: +886-2-29235151 Website: http://www.airitipress.com PERSONAL LIBRARIES / INSTITUTIONS 1 Issue 1 Year 1 Issue 1 Year Europe € 48 € 190 Europe € 95 € 380 US/CA US$ 60 US$ 240 US/CA US$ 120 US$ 480 Asia/Pacific US$ 55 US$ 220 Asia/Pacific US$ 110 US$ 440

Vol. No. – Vol. No. Copies Price US$

* Postage is not included in the price. TOTAL US$

PLEASE NOTE • Issues will be sent in two business days after receiving your payment. • Please note that all orders must be confirmed by fax or email. • Prices and proposed publication dates are subject to change without notice. • Institutions include libraries, government offices, businesses, and for individuals where the company pays for the subscription. • Personal rates are available only to single-user personal subscribers for personal and non-commercial purposes. • Airiti Press reserves its right to take appropriate action to recover any losses arising fromany intended or unintended misrepresentation of the term “Personal Subscriber.”

BILLING INFORMATION Name Company TEL FAX E-mail

Shipping Address

INTERNATIONAL PAYMENTS VIA DIRECT BANK TRANSFER Beneficiary AIRITI INC. Address 18F., No. 80, Sec. 1, Chenggong Rd., Yonghe Dist., New Taipei City 23452, Taiwan (R.O.C.) Bank Name E.Sun Commercial Bank, Ltd. Puchien Branch Account No. 017 444 100 6149 Swift Code ESUNTWTP Bank Address No. 188, Sec. 2, Sanmin Rd., Banqiao Dist., New Taipei City 220, Taiwan

IJITAS11(2)-07 訂閱單.indd 156 2018/7/12 下午 05:48:17