Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Pre-Conference Workshop Schedule (Venue: Faculty of Psychology and Education, UMS) Day 1 - August 5th Time 8:00 Bus will pickup workshop participants from Klagan Regency 08:30 - 09:00 Registration/ Arrival of guests 09:00 - 10:30 Introduction to Rasch Measurement Model by Prof. Bond & Dr. Zali 10:30 - 10:45 Break 10:45 - 12:30 Workshop continue 12:30 - 1:30 Lunch 1:30 - 3:30 Workshop continue 3:30 - 3:45 Break 3:45 - 5:30 Workshop continue 5:30 Bus will pickup workshop participants to Klagan Regency

Day 2 - August 6th 8:00 Bus will pickup workshop participants from Klagan Regency 08:30 - 09:00 Arrival of participants 09:00 - 10:30 Introduction to Rasch Measurement Model Introduction to SEM and Rasch Measures - 2nd Day by Prof. Bond & Dr. Zali Dr. Juliet Ling Mei Teng & Nor Irvoni Mohd Ishar 10:30 - 10:45 Break 10:45 - 12:30 Workshop continue Workshop continue 12:30 - 1:30 Lunch 1:30 - 3:30 Workshop continue Workshop continue 3:30 - 3:45 Break 3:45 - 5:30 Workshop continue Workshop continue 5:30 Bus will send workshop participants to Klagan Regency

1

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Conference Schedule Day 1 - August 7th

Time 8:00 Bus will pickup participants from Klagan Regency 08:30 - 09:30 Registration/ Arrival of guests Opening Ceremony 09:30 - 11:00 Welcome Speech by Host Committee Chairman Welcome Speech by PROMS Chairman Conference housekeepings Opening show 11:00 - 11:30 Break 11:30 - 12:00 Keynote Speech - Prof. Margaret Wu 12:00- 12:30 Group Photo 12:30 - 1:30 Lunch Parallel Session 1 Track 1 (Education) Track 2 (Scale Development) Track 3 (others) Chairperson: Prof. Stenner Chairperson:Dr. Haniza Yon Chairperson: Prof. Vincent Pang KK_001 On the measure of attitude towards science: A scientific KK_009 Examining the Measurement Properties of Students' KK_002 Internet Banking Service Quality Measurement: A Scale model - Liu Huang Perceptions of Assessment Scale Development for Malaysian Banks - Mahgoub Elradi Ahmed - Wilham M. Hailaya siddig KK_004 Mathematics Item Quality: An Illustrative Example using KK_013 Rasch-derived Measure for Assessing Student Competency in KK_003 Face Validity Test on Validity and Reliability of ICT 1:30 - 3:00 Rasch Measurement Model - Ling Mei-Teng University Introductory Computer Programming (CS1) - Leela Procurement Officer Competency Measurement Instrument Waheed by Using Rasch Model - Azran Ahmad KK_008 Multidimensional computerized adaptive testing for toddlers: KK_019 Rasch Analysis of the Malaysian Teachers’ Responses to the KK_018 Reliability Testing Instruments for Computer Programming a developmental screening tool - Ying-Hsien Chien Organizational Commitment Scale - Ahmad Zamri bin Khairani Learning: Applying the Rasch model - Azliza Yacob KK_010 Development and validation of the students’ ability KK_022 The use of MCQ as Formative Assessment to Reflect KK_006 Application of Multi-dimensional Computerized Adaptive Test questionnaire on science process skills - Ellyza Karim Attainment of Desired Learning Outcome - Ximei Zhou on Clinical Dementia Rating Scale using Computer-aided Technique - Ting-En Hui 3:00 - 3:30 Break Parallel Session 2 Track 1 (Education) Track 2 (Scale Development) Track 3 (others) Chairperson: Prof. Stenner Chairperson: Prof. Rob Cavanagh Chairperson: Prof Yanzi KK_014 Disciplinary Biases of a Student Evaluation of Teaching KK_029 Using Rasch Model for the Development of Intention to Stay KK_011 We have equal intervals; now we need invariance: Survey in Higher Education - Billy Wai Kei Chan Scale (ITSS) among Medical Academics at Public Universities - The next important step in Rasch measurement - Prof Wan Ismahanini Ismail Trevor G Bond KK_033 A Rasch Analysis of the Reading, Grammar, and Essay KK_030 Development of a Model of Positive L2 Self using the Rasch KK_017 Psychometric Features of Psychosocial Safety Climate 3:30 - 5:30 Sections of a Japanese University Entrance Examination - Model - J. Lake (PSC-12) - Rosnah Ismail Kristy King Takagi KK_038 Improving Teaching and Student Learning through Evaluation KK_031 Developing a Vocabulary Specification Equation for Second KK_024 Validation of Medical Statistics Exam Paper: Conventional of one TOEIC Preparation Textbook - YihYeh Pan Language Learners - J. Lake method versus RASCH. - Azmi Mohd Tamil

KK_066 Develop, deploy, determine: Surveying assessment for KK_032 Measuring the Validity and Reliability of Arabic Vocabulary learning in the Singapore secondary school context - Knowledge Test Using Rasch Model Approach - Zunita Christopher C. Deneen Mohamad Maskor 6:00 - 10:00 PROMS Dinner @ Kampung Nelayan 2

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Day 2 - August 8th

Time 8:00 Bus will pickup participants from Klagan Regency 08:30 - 09:00 Arrival of guests 9:00 - 9:40 Keynote Speech - Prof. Cavanagh 9:40 - 10:15 Break Parallel Session 3

Track 1 (Education) Track 2 (Scale Development) Track 3 (others) Chairperson: Prof Margaret Wu Chairperson: Dr. Bambang Sumintomo Chairperson:Dr. Haniza Yon KK_015 Gender DIF in Mathematics Items among Secondary Students in KK_035 Assesing Pedagogical Content Knowledge of The Particle Theory KK_059 The Unreasonable Effectiveness of Theory Based Instrument a Coeducational Learning Environment - S.Kanageswari of matter and Phasa Change in Pre-service Science Teacher - Calibration in the Natural Sciences: What Can the Behavioral Suppiah Shanmugam Maryati Sciences Learn? - Jackson Stenner 10:15 - 12:15 KK_016 Rasch Analysis Properties of Structural Items and Essay in KK_036 Preliminary Report on the Development and Calibration of a Rasch KK_012 Application of Rasch Model in Islamic Moral Value Scale for Chemistry Test - Adeline Leong Suk Yee Scale to Measure Chinese Reading Comprehension Ability in Islamic Education Teachers (INSPI) in Malaysia - Salbiah bt Singaporean 2nd Language Primary School Students, Part II - Mohamed Salleh @ Salleh Chung Tze Min KK_020 Development and Validation of The Plagiarism Tendency among KK_039 Development of Instruments for Measuring Mathematical Logical KK_025 The Effects of Chronic Daily Fears on Students’ Concept of Self: Malaysian Post-Graduate Students. - Anis Jauharah Abd. Kadir Thinking Ability College Students in Kapita Selekta - Novaliyosi Towards Identifying Students being Bullied - Rense Lange KK_021 Predictors of self-assessment intention and practice among KK_040 Assessing competency level among SIPartners+ using Rasch primary and secondary students in Hong Kong - Zi Yan Model approach - Hishamuddin Hashim 12:15 - 1:30 Lunch (PROMS Board Meeting) Parallel Session 4 Track 1 (Education) Track 2 (Scale Development) Track 3 (others) Chairperson: Prof. Vincent Pang Chairperson: Prof. Nazlinda Chairperson:Dr. Haniza Yon KK_026 Development of KKM-PPM Performance Instruments to Support KK_041 Validating Value Domain of the Facilitator Competency Profile KK_034 Applying Rasch Model To Identify A Contribution of Marital Status Comprehension, Social Skills, and Discipline Students of Sultan Instrument SIPartners+-2 (FCPI- SIPartners+-2) Using Rasch in Perceived Social Support of Merapi Volcanic Eruption Mount Ageng Tirtayasa University - Nurul Anriani Model Analysis - Raja Hamizah Raja Harun Survivors - Chandra C. A. Putri 1:30 - 3:00 KK_027 Development of a Diagnostic English Grammar Test for KK_047 Global mindset: Assessing construct dimensionality - Jeffrey KK_057 Analysing The Effect of Smart Partnership using Rasch – a Malaysian Lower Secondary School Students - Kho Chung Wei Durand case of women entrepreneurs in Tanjung Karang - Rohani Mohd KK_037 Multidimensional Rasch Analysis of Teaching Role-Specific KK_049 Psychometrics Properties of the Tuckman Procrastination Scale in KK_065 Validating the Usability Evaluation’s Instrument of Community Esteem - Yu-Shu Chen an Indonesian sample - Ngadiman Djaja Learning Centre Model (UEICLC) for Aboriginal in Tasik Chini, Pahang - Mazzlida Mat Deli KK_064 Misconceptions in electricity via Rasch Analysis - Nazlinda Abdullah KK_005 Rasch person fit statistics associated with the weighted degree indicators of Social Network Analysis - Tsair-Wei Chien 3:00 - 3:30 Break Parallel Session 5

Track 1 (Education) Track 2 (Scale Development) Track 3 (others) Chairperson: Prof Margaret Wu Chairperson: Dr. Jeff Durand Chairperson: Prof. Trevor Bond KK_042 Measuring Scientific Literacy: Using the Rasch Model Analysis to KK_054 Development and validation of a diagnostic pronunciation rating KK_062 Modernizing vs Ecologizing Approaches in Measurement - Determine Student Competency Using Data from PISA 2015 - scale: A rating scale and common-item equating analysis - William P. Fisher, Jr. Nor Azizi bt Abdullah Yuanyue Hao 3:30 - 5:30 KK_043 Validating Knowledge Domain of Facilitator Competency Profile KK_055 Development of instrument in measuring cottage industry KK_007 Using social network analysis to report Rasch papers’ keyword Instrument – SISC+1 (FCPI-SISC+1) Using Rasch Model accounting practices. - Susana Narawi development and association across years - Wei-Ru Jyun - Zulkifili Salleh KK_044 Measuring The Status Of Fasilinus Current Professional Profile KK_058 Development And Validation Of Malaysian Secondary School - Ma KK_023 Predicting Item Difficulty of a Knowing Numbers Test Using the Using Rasch Model - Ruzita Ahmad Chi Nan Inverse Partial Credit Model - Ong Yoke Mooi KK_045 Using Rasch Model to Assess the Foreign Language Speaking KK_060 Facilitator Training Needs in Malaysia Schools - Mohd Kashfi Mohd KK_050 Live Grading of Essay Questions Contributing to Computer Anxiety Scale (FLSAS) among University Students in Salatiga - Jailani Adaptive Testing - Dr. Haniza Yon Rizki Parahita Anandi 5:30 Bus will send participants back to Klagan Regency 3

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Day 3 - August 9th

Time 8:00 Bus will pickup participants from Klagan Regency 08:30 - 09:00 Arrival of guests 9:00 - 9:40 Keynote Speech - Dr. Zali 9:40 - 10:15 Break Parallel Session 5 Track 1 (Education) Track 2 (Scale Development) Track 3 (others) Chairperson: Dr. Juliet Ling Chairperson: Prof. Trevor Bond Chairperson: Prof. Rob Cavanagh KK_048 The Effectiveness Of Teacher Training Lessons - Burhanuddin KK_052 Rasch Model Application on Developing a Self-regulation Study KK_056 Comparison of holistic and analytic rating methods of a writing Tola Instrument for Mathematics Education Students - Wardani Rahayu task from the perspective of validity, reliability and practicality - Keita Nakamura KK_051 Development of Indonesia Science Literacy Test (ISLT) KK_053 Measuring Second Language Receptive Knowledge of Collocation KK_063 Rasch-based Test Equating: An Application of Winsteps in Instruments to Improve Criteria Validity of National Exam - Rosita Among Graduate Learners in Public Universities Malaysia Using 10:15 - 12:00 Uli Sihombing Rasch Analysis - Lily Hanefarezan Asbulah - Wu Jinyu KK_061 Performance of Early Mathematics Achievement Test (UPAM) KK_067 Assesing Pedagogical Content Knowledge of the particle theory of KK_069 Modelling a Meaningful Hybrid eTraining for Diverse Learners over time: Applying Rasch Measurement Racking - Dr. Connie matter and Phasa Change in Pre-service Science Teacher - Maryati using Rasch and SEM - Rosseni Din Cassy Ompok KK_070 Exploration of the psychometric properties of Eternal Love Instrument(ELI) and validation of ELI Model: A Rasch Model Approach - Akbariah Mohd Mahdzir

12:00 - 1:00 Lunch

1:00 - 2:00 Symposium on Publishing in Conference Proceedings (Prof. Cavanagh, Prof. Bond & Prof Durand) - Auditorium

2:00 - 2:30 Panel Discussion -auditorium Closing Ceremony Closing speech - Host chairman 2:30 - 3:30 Closing speech - PROMS chairman Next Year PROMS Committee 4:30 Bus will send participants back to Klagan Regency

4

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Keynote Speakers Professor Margaret Wu Title: Rater effects and IRT models was also an associate In this talk, Dr Wu will present a few IRT-based analyses of professor at the rater effects including the estimation of rater severity and University of rater discrimination. Rater severity refers to the Melbourne. Margaret's differences between raters in terms of their tendencies to main interests are in award higher or lower scores. Rater discrimination refers the statistical modeling to the extent to which raters use the score range to of assessment data and separate students on the ability scale. the development of Rating scale model, the partial credit model and the online teaching and generalized partial credit model are used to analyse rater learning tools. effects. A discussion on the interpretations of some She has worked as a psychometrician at the Australian measures of rater effect is provided. It is noted that a rater Council for Educational Research for more than ten years. who shows large discrepancies from other raters may in She is also a co-author of Item Analysis software Conquest, fact be the best rater. which has been used extensively within Australia and internationally. Professor Robert Title: Invariant measurement and metrological networks Frederick Cavanagh in amodern measurement received his PhD from Test score research tradition measurement theories (e.g. Curtin University Classical Test Theory and True Score Theory), share Western Australia in the common assumptions with a positivist philosophical year 1997. He is a orientation. This commonality renders test-score theories member of numerous susceptible to critique similar to that levelled at positivism, Professional the anti-positivist critique and post-modernism in general. associations and is An amodern theory of measurement needs to provide a currently the Chair of constructive response to the anti-positivist critique, to the Board of move beyond positivism and the test score research Management of the tradition. The four defining characteristics of amodern Pacific Rim Objective measurement are: advocating measurement to enable Measurement Society societal and environmental renewal; the philosophical (PROMS). genre of hermeneutical phenomenology; application of scaling research tradition theories; and inclusion of He is currently active as a reviewer in a peer-refereed constructs from related disciplines including metrology and conferences and journal since the year 1999. He is also network theory. This presentation builds on previous work active in writing book chapters and numerous articles in explicating the first two characteristics of amodern renowned journals, has supervised PhD candidates since measurement by examining aspects of the second two 2000, and has been PhD thesis examiner at several characteristics. In particular: the consonance between universities since 2004. invariant measurement and amodern measurement theory; and the application of network theory and modeling in amodern measurement theory. Dr Mohd Zali Mohd Nor is an I.T. Manager in a Title: Rasch in Malaysia – A Brief History, Challenges, and shipping services a Peek into Future company. He received his B.Sc. in Mathematics We looked at the progress of Rasch measurement in from The University of Malaysia during Pre-2008, 2008-2015 and Post-2015. Michigan, Ann Arbor, MI, After 2015, we do not progress much. Majority of Rasch USA, in 1988, Master of papers delved on verifying quality of items. We do not use Management in I.T. from Rasch to it’s full benefits as a measurement model. We Universiti Putra Malaysia have competent trainers but we do not have those who in 2005, and PhD in really understood Rasch model and it’s technicalities to Management Information teach advance levels. System in 2012. A peak into future – what do we need to progress like As Vice-President of myRasch, he is currently active in other countries such as Japan, Singapore, and Australia?. trainings and consultations on Rasch analysis and has provided assistance to postgraduate students from various local universities.

5

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

between IS and SC is .66, which revealed these two component are not highly List of Papers correlated in measuring SAS (Linacre, 2014). In contrary, SE and SC reported high correlation of .98. This strongly suggested SE and SC are consistent in measuring Paper No KK_001 students’ attitude towards science. 5. Conclusions: The analysis of PCA of residuals suggested that the negatively worded items in ASATSCS Paper Title On the measure of attitude towards science: A scientific model may not measure the same construct of SAS as those of positively worded items. All negatively and positively framed items clustered separately. The measure explained by Rasch increased significantly and the noises decreased sharply in the data after Email Address [email protected] removing the negatively worded items. Briner and Smith (1999) stated that the negatively worded items do not measure the same underlying construct as positively 1st Author Liu Huang worded, and the two kind of items in a same calibration often caused highly incompatible situation. We recommended the removal of negatively worded items in Subsequent Fan HUANG, Pey-Tee Emily OON the measure of SAS. authors The present study found that SC (confidence in science), and IS (importance of science- related activities) do not correlate well in the measure of SAS. Wang and Berlin (2010, 1. Aims/ The present study re-examines the incorporation of the three mostly used constructs in p. 2418), quoted from Dhindsa and Chung (2003), defined IS as the extent to which a the measure of students attitude towards science (SAS) by Rasch model that requiring Objectives of student thinks their science class to be an important and worthwhile class. It is related of invariance, and consistent response category functioning. study: to their science class experiences. We argued that teaching methods attribute to IS, is 2. Sample: At least one school from each district in Guangzhou were randomly selected and invited an extrinsic factor that affect SAS. On the other hand, Dhindsa and Chung (2003) to participate in the study. A total of eight secondary schools from different district, defined SC as the extent to which student is confident and successful doing science (p. which made up 10% of all schools in Guangzhou, agreed to participate in the study. Two 911). Confidence are highly relevant to motivational belief which measured by students’ classes of students from each grade from each participating school completed the belief of ability and behavior in science class (Bryan, Glynn, & Kittleson. 2011, p. 1050; questionnaire. A total of 1133 7th to 11th graders who study science completed the Simpkins, Price, & Garcia, 2015, p. 1387). It is an intrinsic aspect of motivation. These questionnaire two aspects, the intrinsic and extrinsic factors, had different conceptions and effect 3. Method: The survey data were subjected to Rasch analysis using WINSTEPS software (Version which might lead to inconsistencies in the measure of SAS but has often gone unaware 3.81) to examine whether there is fit to the scientific model. A principal component in the precursory SAS studies. analysis of residuals was performed to explore the invariance of data. A criteria of Rasch The 2015 PISA study defined SE, enjoyment of science, as ‘A measure of how much were used to verify the effectiveness of each of the 5-point response category. Residual students like learning about science’. Students’ enjoyment in science has often referred loadings plots were scrutinized to examine whether the three constructs were cohesive as intrinsic motivation (Ryan & Deci, 2000) which is similar to ‘science confidence’ (IS). in measuring the SAS. This explained, theoretically, why SE and IS correlated highly with each other. 4. Results: Though all items stayed within the acceptable fit, the variance explained by Rasch measures was only 30.0% and the first three unexplained variances were 10.0, 2.0 and 1.6 in the principal component analysis (PCA) of residuals. The result indicates the existence of secondary dimension (noises) than SAS. We further examined which items contributed to the noises through the exploration of the residual loadings plot. The figure shows Item A to Item N, which were all negatively worded items, had a factor loading greater than .40,. On the other hand, Item a to Item k, associated to all positively worded items, had factor loading <.-60. The dimensionality of the data improved through the remove of all negative items. The variance explained by the Rasch measures then increased from 30.0% to 38.2%, and the first three unexplained variance decreased to 1.9, 1.7 and 1.5. The strongest contrast was evidenced between SC and IS. The disattenuated correlation 6

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_002 4. Principle Components Analysis (PCA): looks OK, even though eigenvalue of the 1st contrast is about strength of 6 items, larger than 3. There is no indication of Paper Title Internet Banking Service Quality Measurement: A Scale Development for distinctions between high contrast items (> 0.6) and low contrast items (< -0.6). Malaysian Banks 5. Items measure: no difference in items measures between original and deleted Email Address [email protected] persons data.

1st Author Mahgoub Elradi Ahmed siddig 5. Based on the results above some items were deleted, rewarded in the final Conclusions: instrument. Affiliation UPM

Subsequent Prof. Dr. Rusli Abdullah, Assoc. Prof. Marzanah A. Jabar, authors Dr. Yusmadi Yah Jusoh 1. Aims/ The study have two objectives. The first objective is to identify the service quality Objectives of dimensions of Internet banking websites based on both qualitative and study: quantitative research methods. The second objective is to develop validated scale to measure these dimensions. 2. Sample: the sample size of pilot study is 42 respondents

3. Method: 1. Systematic Literature Review for Internet banking service quality measurement. 2- Proposed research model 3. Interviews with both academics and bank professional. 4. Scale development (Face and Content validity). 5. Scale validation through pilot test using Rasch model software (Items quality, persons quality, scale category confirmation) 4. Results: The analysis of the results as as following table: 1. Person Statistics: Spread: Spread of (5.5 – 0.2) = 5.3logit is poor. Person distribution has a much higher spread compared to Item spread. Reliability: reliability = 0.95 is excellent and Cronbach Alpha = 0.94 is very good. Distribution: Person distribution is normal with partially positively skewed distribution. The distribution is also platykurtic (flat). 2. Item Statistics Spread: Spread of (1.6 – (-1.5)) = 3.1logit is fair Reliability: reliability = 0.85 is good. Distribution: Item distribution is normal with slightly negatively skewed with leptokurtic distribution 3. Category Functionality: Looks good. 7

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_003 Paper No KK_004

Paper Title Face Validity Test on Validity and Reliability of ICT Procurement Officer Paper Title Mathematics Item Quality: An Illustrative Example using Rasch Measurement Competency Measurement Instrument by Using Rasch Model Model Email Address [email protected] Email [email protected] Address 1st Author Azran Ahmad 1st Author Ling Mei-Teng

Affiliation Affiliation UMS

Subsequent Prof. Datin Dr. Noor Habibah Haji Arshad, Dr. Syaripah Ruzaini Syed Aris Subsequent Lei-Mee Thien (USM), Mei-Yean, Ong (USM) authors authors 1. Aims/ The main purpose of this paper is to determine the validity and reliability of the 1. Aims/ This study aims to present a step-by-step data analysis procedure to validate a set Objectives of constructs that have been identified for the development of a new instrument in Objectives of of 20 TIMSS 2007 and 2011 mathematics released items for validate purposes. study: measuring the competency of ICT Procurement Officer (PO) who appointed as study: member of technical evaluation committee in Public Sector’s ICT projects. 2. Sample: A total of 113 grade eight students were selected from six secondary schools in 2. Sample: This test has involved a total of 45 experts drawn from the Public Sector ICT Sabah. personnel with at least of 3 years of ICT experience as a member of the technical 3. Method: evaluation committee for ICT projects. 3. Method: The data analysis was implemented using Rasch Model for assessing the validity 4. Results: Findings revealed that 19 items have shown the acceptable Rasch analysis and reliability of the items. properties Item 16 was found misfit and need to be revised. The mean of the item 4. Results: The overall Face Validity Level of Agreement was at 95.56%. Referring to the and person are less than 0.5, indicating the test was on-target. However, findings Rasch Statistic Summary, the item separation was at 3.26 and Cronbach Alpha revealed the ratio of item difficulty of low, medium, and high was 1:2:1 was at 0.97. The items polarity indicated by point correlation measure (PTMEA respectively and different from the test specification proposed by the researchers CORR) for 49 items were all positive, recorded between 0.30 and 0.70. While, which is 3:4:3. outfit mean square (MNSQ) values, range between 0.51 and 2.20, was 5. This study has contributed to the process of producing a set of validated considered in determining each of the construct validity and reliability. Conclusions: mathematics items using Rasch model particularly for the school teachers. 5. Conclusions: Based on the MNSQ accepted values range suggested by Bond and Fox (2015), Implications and limitation of the study were presented. between 0.60 and 1.40, out for 49 items, 40 items were sustained and another 3 items suggested for purification, while the rest of the 6 items were considered to be dropped in the development of the new expected instrument.

8

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_005 Paper No KK_006

Paper Title Rasch person fit statistics associated with the weighted degree indicators of Social Paper Title Application of Multi-dimensional Computerized Adaptive Test on Clinical Dementia Network Analysis Rating Scale using Computer-aided Technique Email [email protected] Email [email protected] Address Address 1st Author Tsair-Wei Chien 1st Author Ting-En Hui

Affiliation Chi-Mei Medical Center, Taiwan Affiliation Chi-Mei Medical Center, Tainan, Taiwan

1. Aims/ The purpose of latent trait and latent class analysis is to partition the sample of Subsequent Tsair-Wei Chien, Chi-Mei Medical Center, Tainan, Taiwan Objectives of persons into a minimum number of homogeneous classes that is meaningful to the authors study: study we conducted. How to incorporate Rasch residual data with social network 1. Aims/ With the increasingly rapid grow in elderly population, aged 65 and older, analysis(SNA) using graphical representations for detecting model misfit persons is Objectives of comprised more than 11.8% of the nation's citizen which was defined as a the helpful and interesting. study: super-aged society. However, the leading factor influencing the elderly is the 2. Sample: A simple polytomous dataset (Linacre, 1997) with 26 persons and 20 items was dementia. How to exactly examine and diagnose subjects using a specialized illustrated. After transforming the 2-mode (person in rows and item in columns) multidimensional computer adaptive testing (MCAT) tool is still unknown. Thus, we Rasch standardized residual scores into a one-mode (both person in rows and aim to develop a website that can help parents with their own computers, tablets, columns) metric to show person latent classes accordingly. or smart phones for online screening and prediction of dementia responding 3. Method: Applying Rasch model to calculate residual correlation coefficients of any paired Clinical Dementia Rating (CDR) Scale. persons as to form a one-mode metric, we applied SNA Gephi software to draw a 2. Sample: The CDR scale was applied to 366 outpatients in a hospital of southern part plot with person weighted degrees showing two distinguished latent classes(i.e., fit Taiwan. and misfit groups) of interest. The correlation coefficients between weighted 3. Method: We (1)used multi-dimensional computer adaptive test(MCAT) with parameters for degree and Rasch indices of Outfit and Infit Mean square errors as well as the items across six dimensions, (2) simulated responses to compare the efficiency and person measure correlation to the domain(ie., PT-MEASURE CORR. In Winsteps) precision of MCAT and NAT(non-adaptive test). The number of items saved and the were reported. cutoff points determined for the tool were determined. 4. Results: We can see that the classes according to the Rasch standardized residual patterns 4. Results: MCAT yielded significantly more precise measurements and was significantly more were easily and separately displayed using SNA Gephi software along with the efficient than was NAT: it yielded a 20.19%(=(53-42.3)/53) saving in item length momentum of Rasch fit statistics. The coefficients of the weighted degree when measurement differences less than 5% were allowed. Person-measure correlated with indices are 0.51(Outfit),0.48(Infit), and -0.62(PT-MEASURE CORR), correlation coefficients were highly consistent among the five domains. The cutoff respectively. points for the overall measures were -0.7 and 0.7 logits, which was equivalent to 5. Rasch standardized residual scores yielded by Winsteps software or other 33 and 67 in percentile scores. Significantly fewer items were answered on MCAT Conclusions: counterparts were recommended to apply SNA for obtaining homogeneous classes than on NAT without compromising MCAT’s precision. and further explaining the data in terms of how the persons in the different classes 5. Developing a website to help parents with their own computers, tablets, or smart responded differently to the items. Conclusions: phones for online screening and prediction of dementia in elders is useful and not difficult.

9

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_007 Paper No KK_008

Paper Title Using social network analysis to report Rasch papers’ keyword development and Paper Title Multidimensional computerized adaptive testing for toddlers: a developmental association across years screening tool Email [email protected] Email [email protected] Address Address 1st Author Wei-Ru Jyun 1st Author Ying-Hsien Chien, Jin San Chronic Care Hospital & Nursing Home, Tainan, Taiwan

Affiliation Chi-Mei Medical Center, Taiwan Affiliation

Subsequent Tsair-Wei Chien, Chi-Mei Medical Center, Tainan, Taiwan Subsequent Tsair-Wei Chien, Chi-Mei Medical Center, Taiwan authors authors 1. Aims/ To compare the keywords associated with Rasch related papers and analysis their 1. Aims/ To investigate using multidimensional computer adaptive testing (MCAT) tool Objectives of development in recent and past years. Objectives of combined with Multidimensional Screening in Child Development (MuSiC) for study: study: toddlers' parents. 2. Sample: 2. Sample: We had retrieved 75-item parameters from the literature regarding MuSiC at https://www.ncbi.nlm.nih.gov/pubmed/25127503 3. Method: Selecting 3,100 abstracts and their corresponding keywords downloaded from US 3. Method: After we had retrieved 75-item parameters from the MuSiC literature item bank National Library of Medicine National Institutes of Health (i.e., atpubmed.com) for 1- to 3-year-olds, we simulated 1,000 person measures from a normal standard between 1952 to 2017(April), we were to explore the keyword development and distribution to compare the efficiency and precision of MCAT and NAT (Non- association across years in additional to analyze the most outstanding authors who Adaptive Testing) in five domains: cognitive skills, language skills, gross motor published health-related articles and their collaboration pattern. We used social skills, fine motor skills, and socio adaptive skills. The number of items saved and network analysis(SNA) to explore the relations of the keywords and authors in the cutoff points determined for the tool were determined. journals. 4. Results: MCAT yielded significantly more precise measurements and was significantly more 4. Results: Besides the most frequent keywords are Rasch model, Rasch analysis, and item efficient than was NAT: it yielded a 46.67% (= 75-40)/75) saving in item length response theory, the strongest association of two authors and keywords are when measurement differences less than 5% were allowed. Person-measure reported in this study. The visual representations regarding the development and correlation coefficients were highly consistent among the five domains. change across years are present also. Significantly fewer items were answered on MCAT than on NAT without 5. The Rasch related papers related to health affairs are worth studying and reporting compromising MCAT’s precision. Conclusions in the wonderful Rasch PROMS conference. 5. Developing a website to help parents with their own computers, tablets, or smart Conclusions: phones for online screening and prediction of developmental delays in toddlers is useful and not difficult.

10

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_009 appeared to reflect perceptions of assignment. Furthermore, the option categories Paper Title Examining the Measurement Properties of Students' Perceptions of Assessment appeared to work well. Scale 5. The scale, albeit far from being perfect, has the utility in measuring students' Email [email protected] Conclusions: perceptions of assessment on tests and assignments. Moreover, the scale can be a Address starting point for further study on the instrument and has implications for 1st Author Wilham M. Hailaya, Ph.D. instrument development, educational assessment research, policy and practice.

Affiliation College of Education, Mindanao State University at Tawi-Tawi Sanga-Sanga, Bongao, Tawi-Tawi, Philippines Subsequent authors 1. Aims/ The aim of this study was to develop an instrument called the Students' Objectives of Perceptions of Assessment Scale that can be utilized to measure students' study: perceptions of assessment, particularly on test and assignment as commonly used in the Tawi-Tawi context. Specifically, the study sought to establish the utility of the instrument by investigating its measurement properties at the macro and micro levels. Developing the said instrument was deemed vital as it can help provide important information about the subjective qualities of assessment tasks, which can also be a basis for assessment practices to be properly tailored to meet students' interests and improve their learning. 2. Sample: The samples were purposely selected as some schools or locations were difficult to access. Moreover, specific grades were targeted to ensure that the students involved in the study did experience doing tests and assignments. In total, 2,077 students from Grade Six, Second Year and Fourth Year high school classes participated in the study. 3. Method: The instrument was first examined at the macro level using the confirmatory factor analysis to ascertain the two predetermined constructs namely perceptions of test and perceptions of assignment. To carry out the confirmatory factor analysis, LISREL 8.80 was used. After which, the instrument was investigated at the micro level using the Rasch model (Rating Scale Model) to further establish the constructs and the characteristics of the items. To carry out the analysis at this level, ConQuest 2.0 was employed. The results of the two analyses were used to judge the acceptability of the instrument. 4. Results: The results indicated that both one-factor and two-factor models were appropriate for the instrument, though one-factor model was preferred due to its parsimony. Moreover, 11 items appeared to tap perceptions of test and seven (7) items

11

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_010 Paper No KK_011 Paper Title Development and validation of the students’ ability questionnaire on science process skills Paper Title We have equal intervals; now we need invariance: Email Address [email protected] The next important step in Rasch measurement Email Address [email protected] 1st Author Ellyza Karim (PhD candidate) 1st Author Prof Trevor G BOND Affiliation Subsequent Dr Jamil Ahmad & Prof Kamisah Osman Affiliation authors 1. Aims/ As the instrument was newly developed, pilot study was conducted to determine Subsequent Objectives of the empirical proof on validity and reliability of the questionnaire. authors study: 1. Aims/ The distinctive attribute of a measurement system is the requirement for an 2. Sample: 340 respondents of 2016 Malaysian Primary School Evaluation Test (UPSR) Objectives of arbitrary unit of differences that can be iterated between successive leavers. Those respondents, aged 13 years old were the first cohort of the study: measures.Instead of focusing on constructing measures of the human condition, recently implemented science curriculum syllabus for primary school. psychologists and others in the human sciences have focused on applying 3. Method: This study reviews the assessment on the ability level of science process skills sophisticated statistical procedures to their data. In the human sciences, using 5-points Likert scale questionnaire ranging from unable to able. A total of invariance of item and person measures remains the exception rather than the 68 items were managed to be developed in the questionnaire applying verified rule. Interpretations of results from many tests of common human abilities must indicators based on the literature reviews. Later, the items-indicators were be made exactly in terms of which sample was used to norm the test, and justified by experts consensus operated by Fuzzy Delphi Method. Respondents candidates' results for those tests depend on which test was actually used. were given one hour to complete the instrument. Finally, Rasch Analysis for two- 2. Sample: n/a facet Model version 3.73 was employed to analyse the data. 4. Results: Overall, the Cronbach Alpha person reliability was found at 0.96 while item 3. Method: n/a reliability is 0.99. The range of Point measure correlation (PTMEA Corr) are positive between 0.33 to 0.71 for all items, which showed that all items were 4. Results: This presentation will demonstrate simple tests of invariance and show how measuring what are supposed to be measured in the science process skills invariance and DIF contribute to important understandings in the human construct. All items accepted as the outfit mean square (MNSQ) have range sciences. between 0.67 and 1.50, indicating a good measure of latent variables for item fit. 5. Conclusions: An important goal of early research should be the establishment of item difficulty Via item map, the findings showed that most students were unable to design values for important testing / data collection devices such that those values are scientific steps on their own within experimenting skill. Meanwhile, for using sufficiently invariant - for their intended purposes. space and time relationship skill, determination for the object position with time is the most able item for them to do. 5. Conclusions: The findings provide a more accurate insight on the construct validity and reliability of the questionnaire to measure students’ ability on science process skills.

12

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_012 Paper No KK_013

Paper Title Application of Rasch Model in Islamic Moral Value Scale for Islamic Education Paper Title Rasch-derived Measure for Assessing Student Competency in University Teachers (INSPI) in Malaysia Introductory Computer Programming (CS1) Email Address [email protected] Email Address [email protected]

1st Author Salbiah bt Mohamed Salleh @ Salleh 1st Author Leela Waheed

Affiliation Affiliation

Subsequent (1) Jamil bin Ahmad (2) Mohd Aderi bin Che Noh (3) Anis Jauharah bt Abdul Kadir Subsequent Rob Cavanagh authors authors 1. Aims/ This study aims to validate the Islamic Moral Value Scale for Islamic Education 1. Aims/ The purpose of this paper is to report the results of a project to develop and test Objectives of Teacher (INSPI) in Malaysia using Rasch Measurement Model. Objectives of a linear measure of university student performance in the first course of study: Keywords : Rasch Measurement Model, Islamic Moral Value, Islamic Education study: computer programming (CS1). Teacher 2. Sample: The sample comprised 85 students (25 [Maldives National University (MNU)], 31 2. Sample: Two hundred Islamic Education Teachers in primary and secondary schools [Asia Pacific University of Malaysia (APU) and 29 [Villa College, Maldives]). The participated in this study students had completed their CS1 course and were in the second semester of the 3. Method: This study employed a quantitative approach of data collection and analysis. A first year of university study. survey was used to gather information on the Islamic moral value practised by 3. Method: The validity of the CS1measure was investigated with the theoretical frame Islamic Education Teachers in primary and secondary schools in Selangor, expounded by Messick. The aspects of the frame are the content aspect, the Malaysia. The data were analysed using Winstep 3.80 for investigating the substantive aspect, the structural aspect, the generalisability aspect, the external functioning and rating scale categories, reliability and separation index, aspect, the consequential aspect and aspect of interpretability added by Smith unidimensionality, item polarity, goodness of fit and item difficulty level of the (2007). RUMM2030 was used to generate statistics and displays to exempligy the items. Messick aspects. 4. Results: Firstly, the original five-rating scale does function effectively; Secondly, the 4. Results: The initial analysis of the data set with RUM2030 demonstrated excellent Person reliability for item and person are very high and the separation are good that are Separation Index (PSI) with no evidence of Differential Item Functioning (DIF), greater than two; Thirdly, the Rasch Model proved that INSPI is a unidimensional misfit of the items or persons. However, there was some disordering of scale. There are 3 items deleted due to misfit. Lastly, all the items in this scale are thresholds. Hence, Question 1D was rescored dichotomously, and the middle two quite easy for the respondents and they performed well doing almost all the categories of Question 3D and 5D were collapsed. Principal Components Analysis items measured in the scale. of residuals showed no significant structure in the rotated component matric 5. Conclusions: This study produced a new Rasch measurement for a moral development. It supporting the assumption of local independence, and unidimensionality. provides new insight into the measurement in religious study especially in Islamic 5. Conclusions: There was evidence that the CS1 measure demonstrated reliability and validity in religion. measuring student competence in fundamental CS1 concepts. This study also demonstrated application of the Rasch model both as a powerful approach for instrument construction and the provision of evidence to argue for validity.

13

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_014 construction of interval–scale data. The data was then examined for its quality according to the expectation of Rasch’s psychometric model – the fit of data to Paper Title Disciplinary Biases of a Student Evaluation of Teaching Survey in Higher Education Rasch model. Fit statistics, indicated by Meansquare (MnSq) infit and outfit, was used to examine the fit. Acceptable values range between .60 and 1.40 (Wright & Email Address [email protected] Linacre, 1996; Bond & Fox, 2015). Data that fit the Rasch model are of unidimensional (Bond & Fox, 2015). Scores of data that found to misfit the Rasch 1st Author Billy Wai Kei Chan expectation are not interpretable as it might have measured more than one latent trait and hence distorts the measurement of latent trait (Bond & Fox, Affiliation University of 2015). The latent trait for the present study is teaching and learning quality of the GE courses. Next, analysis of Differential Item Functioning (DIF) was conducted to Subsequent Fan Huang and Emily Pey Tee Oon compare the pattern of items along the latent trait on a common scale as a authors function of difficulty/agreeability across the four disciplines. Disordered items 1. Aims/ The present study aims to identify rating biases across academic disciplines areas spread along the scale across the four disciplines signified potential bias. This Objectives of that could possibly plague the UMAC General Education Course Survey (GECS)’s means, if item difficulty/agreeability estimates for each item do not remain study: scores. In particular, we examined whether the four disciplines, namely Language identical (lack of invariance) across the four disciplines, the items might have and Communication, Science and Information Technology, Society and culture, been interpreted differently by students from different disciplines – some items and Self–development, garnered uneven scores that signified biases in student could have ‘favorably’ rated by certain group of students with reasons not ratings through the invariance assessment of Rasch measurement model. relating the teaching and learning quality. Lack invariance is signaled by a DIF contrast of greater than .50 logit (Bond & Fox, 2015; Linacre, 2014). Keywords: disciplinary biases, SET, higher education 4. Results: All items reported acceptable MnSq infit (.76–1.18) and outfit (.71–1.13). The 2. Sample: For the present study, the data to the GECS from 45,361 students who enrolled in results indicated that scores for the GECS met the expectation of Rasch model to the GE programme from 2011 to 2015 (a full cycle of 8 semesters) from all be unidimensional (measure only the quality of teaching and learning of the GE disciplines were included for analysis. courses) and hence are interpretable (Bond & Fox, 2015). 3. Method: Instrument Students’ overall responses on the five items The GECS consisted of five items of rating scale with a six–point response The extent of agreement is indicated by ‘Measure’ (Rasch estimates) – a lesser categories (1: Strongly Disagree; 2: Disagree; 3: Slightly Disagree; 4: Slightly positive value indicates greater extent of agreement, in contrast, a greater Agree; 5: Agree; 6: Strongly Agree). The items are: positive value indicates lesser extent of agreement. Item 5 (Did this course help 1. This course helped you participate actively in classroom activities. you develop your communication skills? (e.g., reading, writing, speaking and 2. This course helped manage your own independent learning in this subject area other forms of communication) appeared to be the most difficult item to be in the future. agreed with (47.52 logit). On the other hand, Item 3 (Did this course help you 3. Did this course help you understand its application in everyday life situations? understand its application in everyday life situations?) as the easiest item (47.36 4. This course helped you think critically about the course topics. logit). That is, relatively fewer students agreed that GE courses enhance the 5. Did this course help you develop your communication skills (e.g. reading, development of their communication skills. In contrast, many agreed that the writing, speaking and other forms of communication)? content of the courses is relevant as the knowledge they learn from the GE Data Analyses courses can be applied in daily situations. In addition, Item 1 (This course helped Raw scores of GECS were imported into Rasch’s statistical software, namely, you participate actively in classroom activities, 47.48 logit) are more difficult to Winsteps version 3.81.0 developed by Mike Linacre (Linacre, 2014), for be agreed with as compared to Items 2 (This course helped manage your own 14

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

independent learning in this subject area in the future, 47.46 logit) and 3 (Did this difference (p < .00). course help you understand its application in everyday life situations, 47.36 logit). The results seem to suggest that, overall, students agreed more to items probing 5. Conclusions: The present study concurred with many that reported that student ratings vary the value of GE courses on daily relevance, interaction enhancement and across academic disciplines that often to be called as ‘bias in ratings’ (e.g., Benton communication development. & Cashin, 2012; Kember & Leung, 2011; Royal & Stockdale, 2015; Sixbury & Cashin, 1995; Zahn & Schramm, 1992). However, the variation in ratings seemed Disciplinary differences in ratings to be reasonable. Students from the soft courses rated more favorably items The ratings of students from each discipline varied across items. Some items relating to class participation and communication. On the other hand, students appeared to be more difficult for certain group of students but is easier for the from hard sciences rated more favorably items with regard to content other to be agreed with. application. The results prompted us to suggest that the variations in ratings are the consequence of pedagogical difference: soft sciences yielded better rating as Students from Science and Information Technology, Society and Culture, and Self these courses tend to be more students–centered instructionally as compared to Development disciplines found Item 5 to be most challenging to be agreed with hard sciences where the instructional approach to be teacher–centered and rigid. but Items 3, 4, and 1 as the easiest items to be agreed with, respectively. In contrast, students from Language and Communication discipline found Item 5 to A point to be noted is that the afore–mentioned result is at variance with some be easiest to be agreed with and Item 4 as most difficult to be agreed with. other studies (e.g., Kember & McNaught, 2007; Kember et al., 2006; Murray & Renauld, 1995). These studies concluded that students’ perception of good Item 5 (Did this course help you develop your communication skills (e.g. reading, teaching is independent of academic discipline. The contradictory research writing, speaking and other forms of communication?) is reported to be least results suggest that the variation in ratings is of pedagogical consequences but agreed with for the entire sample, as discussed above. Indeed, this item is very academic disciplines (Kember & Leung, 2011; Murray & Renauld, 1995). difficult for students from Science and Information Technology, Society and Culture, and Self–Development to be agreed with (p < .00) but those from As the differences in ratings were found to be reasonable, institution is not Language and Communication discipline agreed to it unequivocally and recommended to compare the student ratings of math, science and information statistically (p < .00). technology courses to the ratings of other courses, e.g., language and communication courses, for human resources decisions, such as tenure, salary, Item 1 (This course helped you participate actively in classroom activities), which and promotion, because the evidence suggested that the relative low ratings may is on enhancement of class participation, appeared to be second most difficult be determined by the academic discipline, but not the teaching effectiveness. item to be agreed with for the entire sample. However, it is the easiest item for However, if the low ratings are due to low teaching effectiveness of teachers. students from Self Development discipline to be agreed with, statistically easier than all their counterparts (p < .00).

Item 3 (Did this course help you understand its application in everyday life situations), which is about the application of content knowledge, is the easiest item to be agreed with for the entire sample. This item, however, appeared to be statistically easiest item to be agreed with for the students from Science and Information Technology and is more difficult for students from Language and Communication to be agreed with, with a reported statistically significant 15

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_015 Paper No KK_016 Paper Title Rasch Analysis Properties of Structural Items and Essay in Chemistry Test Paper Title Gender DIF in Mathematics Items among Secondary Students in a Coeducational Email Address [email protected] Learning Environment 1st Author Adeline Leong Suk Yee Email Address [email protected] Affiliation Universiti Malaysia Sabah Subsequent Mei-Teng Ling, Lay Yoon Fah; Universiti Malaysia Sabah 1st Author S.Kanageswari Suppiah Shanmugam authors 1. Aims/ The study aimed to ascertain the Rasch analysis properties of structural and essay 1. Aims/ This study reports on preliminary findings of gender Differential Item Functioning Objectives of items by Rasch Partial Credit Model (PCM). Objectives of (DIF) in a school culture, which is renowned for the ‘special’ teaching approaches study: study: and successful mathematics learning that produce impressive mathematics 2. Sample: The test was administered to a group of 76 Form Four students who took results. The main aim of this study is to identify mathematics items that function chemistry subject. differently across gender groups in coeducational schools to study the 3. Method: The structural and essay questions were analysed using Rasch Partial Credit relationship between gender and characteristics of mathematics items. Model (PCAM) to ascertain the item fit the unidimensionality of the items to the 2. Sample: A total of 63 boys and 55 girls in form two were selected from a school for the construct. Structural items were analysed separately from the essay items preliminary study. because the number of students who answered the essay items were not the 3. Method: The software WINSTEPS was used to conduct DIF analysis. Based on the Rasch same based on the essay item they chose to answer. Analysis of fit helps detect model, items were flagged for DIF by using Mantel-Haenszel chi-square method discrepancies between the Rasch model expectation and the data collected. The with boys forming the reference group and girls forming the focal group. Some difference between the ability of students and difficulty of an item between two 12 computation and 12 word problem items from the grade eight TIMSS 1999 raters are assessed by a cross plot. Instrument used as an example in this study and TIMSS 2003 released mathematics items were arranged according to the was Chemistry Achievement Test (Paper 2), consists of part A structural (6 mathematical hierarchy of easy to difficult as stipulated in the mathematics items), Part B and Part C self-selected essay question (choose one from two in curriculum. Word problem items were distinguished as items that are set in real- each part), developed by researchers and a panel of an excellent teacher of world context. chemistry and 2 experienced secondary school chemistry teachers. 4. Results: Findings revealed that two items were flagged as DIF, with one computation and 4. Results: Overall, the Rasch analysis properties of the chemistry test are acceptable with one word problem item. The computation item exhibits moderate DIF, while the two misfit items (with MnSq more than 1.5), and two items are too good to be word problem item exhibits large DIF but both items tend to favour the boys. true (with z-Std less than -2.0). MnSq more than 1.5 indicates that the item is These DIF items assess the lower-order thinking skills in the cognitive domain of unproductive for the construction of measurement, modifications are needed Knowing and are from the topics of decimal and percentage in the content (Linacre & Wright, 2012) and a small number of the ‘too good to be true items’ domain of Number. do not degrade measurement (Bond & Fox 2015). The responses to three items 5. Conclusions: This initial exploration suggests that items which assess lower-order thinking skills (with negative PTMEA Corr.) contradict the direction of the latent variable. tend to favour boy and challenge the gender stereotype of items assessing 5. Conclusions: The content experts suggested another group of samples is given the test and the higher-order thinking skills favour boys. Since the items in this test have been item polarity should be checked again due to the fact that the test in this study arranged from easy to difficult, a possible explanation triggered from these was scored by high-performance students. The person ability rated by Rater 1 findings is that the serial position of items in a test may be an item characteristic was substantially different to those rated by Rater 2 for essay 2 and essay 3. The that need to be considered. This is because items that appear at the end of the third rater is suggested to resolve the disagreement between the first two raters. test tend to be more difficult for girls as suggested by some studies. 16

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_017 graph on each response category, and 6) outfit MNSQ was less than 2.

Paper Title Psychometric Features of Psychosocial Safety Climate (PSC-12) 4. Results: A total of 492 male employees, person fit data represented 33 companies of three major activities i.e. service, manufacturing and agricultural in Malaysia Email Address [email protected] were analyzed. The median (IQR) age is 33 (27; 42) years old. Median (IQR) duration served in current organization was 7 years (3; 17). Majority of them had 1st Author Rosnah ismail, Faculty of Medicine National University of Malaysia tertiary education (51.1%) and had no leading responsibilities in their job scope (61.4%). The PSC-12 had Cronbach Alpha of 0.93 with sufficient item range (0.97) Affiliation and enough spread of respondent ability across the sample to answer the items (0.89). The standard error of the item is 0.14 logit which is acceptable for 5-Likert Subsequent Azmi Mohd Tamil, Noor Hassim Ismail, response scale. Generally the items were fit and unidimensional in nature. All authors items showed no bias to leader role at the workplace. All ratings scale had 1. Aims/ The aim of this article is to examine the psychometric features of Psychosocial surpassed all required criteria and showed resemblance to prototypical Likert Objectives of Safety Climate (PSC-12). It is defined as shared perceptions of organizational scale probability curve. Absent of “noise” to measurement was observed. study: policies, practices and procedures for the protection of worker psychological 5. Conclusions: The PSC-12 is a psychometrically sound scale among sampled male employees in health and safety, main focal from management practices in a unit or Malaysia. It is a valid scale to measure shared perception of employees about organizational level. psychological health and safety protection disregard to their role at the 2. Sample: workplace.

3. Method: A cross sectional study of 509 male employees from multi-worksites had completed self-administered questionnaires from September 2012 to May 2013. The study used PSC-12 which consisted of 12 items to measure four domains i.e. senior management support and commitment for stress prevention; management priority to psychological health and safety versus productivity goals; organizational communication and organizational participation and involvement. Rasch model technique was used to examine Cronbach Alpha value, person and item measure, item and person reliability, standard error of item and item fit before submitting the data for unidimensionality verification. The scale’s unidimensionality was considered violated if 1) raw variance explained by the measure is less than 40% 2) the unexplained variance in first contrast was more than 2 Eigenvalue or 15% and; 3) the scale displayed Differential Item Functioning (DIF) more than 0.50 logit for role at the workplace i.e. leader role vs. non leader role in job scope. Finally, rating scale validity was examined based on 6 criteria: 1) minimum number of at least 10 responses per category, 2) the category frequencies displayed regular distributions, 3) average measures increased monotonically across the rating scale, 4) advance of at least 1.0 logits between structure calibrations for five category rating scale, 5) distinct probability curve 17

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_018 Paper No KK_019

Paper Title Reliability Testing Instruments for Computer Programming Learning: Paper Title Rasch Analysis of the Malaysian Teachers’ Responses to the Organizational Applying the Rasch model Commitment Scale Email Address [email protected] Email Address [email protected]

1st Author Azliza Yacob 1st Author Ahmad Zamri bin Khairani

Affiliation TATi University College Affiliation School of Educational Studies, USM

Subsequent Noraida Haji Ali University Malaysia Terengganu, Noor Suhana Sulaiman TATi Subsequent Aziah binti Ismail, School of Educational Studies, USM authors University College, Nur Sukinah Aziz TATi University College authors 1. Aims/ Assessment and measurement in teaching and learning Computer Programming 1. Aims/ The main objective of this study is to examine the psychometric characteristics of Objectives of is one of the most important process. As it known as a difficult subject to learn, it Objectives of a translated version of the Organizational Commitment Scale among Malaysian study: mostly results in high dropout and failure rates. This study aims to highlight the study: teachers. process of reliability testing for the learning instrument development, to support 2. Sample: Data were collected from 1021 school teachers (male = 275, female = 746) from student understanding. three states. Their age mean was 38.85 years (SD = 8.35 years). 2. Sample: A pilot study of 33 samples was carried out to test the reliability of the 3. Method: The present study employs quantitative approach with a survey method. A 24- instrument. item Organizational Commitment Scale was used to gauge responses from the 3. Method: The Rasch model was used to test the reliability of the measurement for each teachers. The responses were then used to provide evidence of the psychometric item by converting the test result into ration type data. properties of the scale. This study employs WINSTEPS 3.63 to provide statistics 4. Results: A gap was found between the most difficult items and the rest of the items. After and other relevant information from the Rasch Model analysis. deleting some items, the result indicates that the instrument has a high degree of 4. Results: Rating scale analysis showed that category 2 and category 3 of the ratings were reliability and suitable for the real data collection. In this case, the most difficult not adequately different. A total of 6 items did not fit the model’s expectation, item may require further investigation since students are either unfamiliar with and thus, were dropped from further analysis. The scale demonstrated a high the item or it is confusing and misleading or maybe the question given is too person reliability as well as high separation index. There were no items that hard. demonstrated gender DIF. The ordering of items on the measured scale was 5. Conclusions: To develop a learning instrument that can support students' understanding, the satisfactory, and no threat to construct validity was reported. construction of the corresponding item should be emphasized. Because of the 5. Conclusions: Based on our analysis, the empirical evidence on the psychometric properties of mentioned factor, lecturers should decide the suitability of the item to provide a the scale will provide important information in the future use of the scale, supportive and effective learning environment. especially in relation with other constructs. This practical importance is essential since commitment is considered as important educational outcomes in Malaysia. Even though scale validation is an ongoing process, it is perhaps not too off the mark to say that the present research provides important foundation for validation studies across different setting.

18

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_020 Paper No KK_021 Paper Title DEVELOPMENT AND VALIDATION OF THE PLAGIARISM TENDENCY AMONG MALAYSIAN POST-GRADUATE STUDENTS. Paper Title Predictors of self-assessment intention and practice among primary and Email Address [email protected] secondary students in Hong Kong 1st Author Anis Jauharah Abd. Kadir Email Address [email protected]

Subsequent Nur Riza Suradi, Mohd Salmi Md Noorani, Salbiah Md Salleh 1st Author Zi Yan authors 1. Aims/ This exploratory study aimed to test the validity and reliability of an instrument Affiliation Objectives of that was developed to measure the tendency of plagiarism among post-graduate study: students in one of Malaysian university. Subsequent Gavin T, L Brown, The University of Auckland, New Zealand authors John C. K. Lee, The Education University of Hong Kong, Hong Kong Key words: Plagiarism; attitude; knowledge; act. 1. Aims/ This study aims to explore the predictors of students’ self-assessment intentions 2. Sample: Simple random sample of 125 Masters and Doctoral post graduate students (40 Objectives of and practices in the Hong Kong context. males and 85 females). study: 3. Method: This study used a quantitative survey methodology and were analysed using 2. Sample: The target population of the study is Primary 4 to Secondary 3 students in Hong Rasch model. One Malaysian university was used for this case study which Kong. A survey was conducted on around 1,500 students. indicate a single institutional culture. The instrument comprises of 41 items that 3. Method: The Theory of Planned Behaviour (TPB) (Ajzen, 1991) was applied as a theoretical measure three constructs of plagiarism tendency, i.e., Knowledge, Act and framework to construct the understanding of students’ self-assessment Attitude. The instrument was measured by using Winsteps program of version intentions and practices as well as the predictors. The analytical methods include 3.73 in terms of reliability, item polarity, goodness of fit and unidimensionality. Rasch analysis (Rasch, 1960), which was used to examine the psychometric 4. Results: The research findings indicated that in terms of item polarity, the instrument was properties of the scales and calibrate student measures on each of the latent able to measure the tendency of plagiarism in the range of 0.01 to 0.71. Only one traits, and path analysis, which was applied to investigate the relationships item has a negative correlation and has been eliminated from the instrument. among the latent traits. The reliability for both person and item each are 0.69 and 0.98. In the misfit test 4. Results: The findings indicated that attitude, subjective norm, and self-efficacy were indicated that no item were eliminated because the value of infit mean square positive and significant predictors of self-assessment intention while were nicely in the range of 0.71 to 1.21 and the value of outfit mean square were psychological safety was a negative predictor. Attitude and self-efficacy were at 0.70 to 1.77. Although one item has an outfit mean square more than 1.50, it positive and significant predictors of self-assessment behavior, while was considered to be remained in the instrument because the infit mean square psychological safety was a negative predictor of behavior. was still in a good range. The dimensionality of the instrument shows value of 5. Conclusions: This result indicated that generally TPB appeared as an appropriate theoretical raw variance explained by measured was at 47.6% same as the modeled value framework in explaining students’ intentions and practices regarding self- with the unexplained variance of the first contrast was at 8.7%. assessment. Some non-TPB component (e.g., psychological safety) also played an 5. Conclusions: The study showed that Rasch modeling could help researchers to analyses their important role in determining students’ self-assessment intentions and practices. instrument into truly refined quality instrument and in a systematically way. It

also indicated that the instrument has a potential to measure the tendency of plagiarism among post-graduate students with minor adjustment in the sample size and the item. 19

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_022 were calibrated on the measurement scale based on the probability to answer the items correctly/incorrectly; individual ability is indicated by ‘individual Paper Title The use of MCQ as Formative Assessment to Reflect Attainment of Desired measure’ in relative to overall class mean. Learning Outcome Email Address [email protected] The attainment of learning outcomes can be reflected from the correct and incorrect responses where incorrect responses are indicator of ‘yet to be 1st Author Ximei Zhou attained’ and the correct responses indicated ‘already attained’ conceptual understanding. Teachers know which area of conceptual understanding need to Affiliation be further reinforced according to items that answered correctly/incorrectly in order to help each student to fully achieved the attainment of desired learning Subsequent Pey-Tee Oon, William P. Fisher outcome. These set an entry point for teachers on what to be reinforced next authors (Fisher, 2013). 1. Aims/ A vertically, horizontally and developmentally coherent assessment (NRC, 2006), 5. Conclusions: The purpose of formative assessment is to improve learning outcomes. The Objectives of moving away from the current unconnected and decontextualize assessment present study provides an example for teachers on how to trace learning study: framework, infusing hope to project a more comprehensive picture of student progression of students use of MCQ on what learning goals have and have not learning. The present study illustrates an example on how formative assessment attained. can be coherently developed and presented to reflect scientific understanding of students over time and on what areas of reinforcement needed for learning to progress in order to achieve the desired learning outcome. This paper sets to illustrate a developmental coherent formative assessment. 2. Sample: TIMSS 2015 instrument looking at physics knowledge of 8th grade students was used for this purpose. A total of 4155 students from Hong Kong participated in the study, of which 1974 (47.5%) were girls and 2181 (52.5%) were boys. Data to the 26 restricted items from the students were subjected for analyses. Only results to the 13 MCQ were retrieved for illustration. 3. Method: Quality of data was first examined. Next, scoring form for individual students was analyzed and modified showing the attainment of learning outcome with items calibrated on a measurement scale. 4. Results: Results indicating a good fit of data to Rasch model. The acceptable MNSQ infit and outfit statistics of all items are 0.83-1.39, which are all within the acceptable range of the Rasch model (0.50-1.50) (Bond & Fox, 2007; Cheng & Oon, 2016; Oon & Subramaniam, 2011).

A scoring form for individual student (Linacre, 1997) that reflects the attainment of desired learning outcomes. Items were arranged from easiest to most difficult – Item S042182 as the easiest item with content difficulty -2.36 and Item S062044 as the most difficult with content difficulty 1.20; apart 3.56 logit. All students 20

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_023 4. Results: The Pearson correlation between teacher trainers’ perception of item difficulty and trainees actual item difficulty was 0.69 (n =22, p = 0.0003). This shows a Paper Title Predicting Item Difficulty of a Knowing Numbers Test Using the Inverse Partial moderately strong correlation between the teacher trainers’ perception of item Credit Model difficulty and the trainees’ actual item difficulty in the Knowing Numbers test. Email Address [email protected] Further analysis shows that teacher trainers overestimated the item difficulty of 5 items and underestimated the item difficulty of 2 items. 1st Author Ong Yoke Mooi 5. Conclusions: Teacher trainers need to construct test items with varying item difficulties to discriminate the different range of trainees’ ability in a course. This study Affiliation IPG Kampus Ipoh provides empirical evidence that teacher trainers do have the knowledge to estimate the difficulty of items by reading the items. This skill is vital for teacher Subsequent Lee Leh Hong (IPG Kampus Ilmu Khas), Maria Pampaka (The University of trainers in developing test items that matched with trainees’ abilities. authors Manchester) 1. Aims/ The aim was to study to what extent teacher trainers are able to predict the References Objectives of difficulty of items in a Knowing Numbers test. Hadjidemetriou, C., & Williams, J. (2004). Using Rasch models to reveal contours study: of teachers' knowledge. Journal of Applied Measurement, 5(3), 243-257. 2. Sample: A Knowing Numbers test was administered to 19 teacher trainers to record their Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, perception of the difficulty of 22 items on a four point Likert scale (very easy, 47(2), 149-174. easy, difficult, and very difficult). A dataset was drawn from 83 trainees’ score to the 22 items in the Knowing Numbers course final examination from two Teacher Training Institutes in Malaysia. 3. Method: The Rasch analysis was conducted with the Winsteps software to 1. construct a scale of teacher trainers perception of item difficulty in the Knowing Numbers test using the Inverse Partial Credit model 2. construct a scale of trainees’ actual item difficulty in the Knowing Numbers test using the Partial Credit model (Masters, 1982) 3. to compare teacher trainers’ perception of item difficulty with trainees’ actual item difficulty.

Hadjidemetriou and Williams (2004) used the Inverse Partial Credit Model to reveal contours of teachers’ knowledge with respect to their students’ graphical knowledge. To analyse the dataset with the Inverse Partial Credit model, we transpose the dataset. The rows become columns and the columns become rows and we run this data using the Partial Credit Model. In other words, the person becomes the instrument to measure the item difficulty as perceived collectively by teacher trainers.

21

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_024 Paper No KK_025

Paper Title Validation of Medical Statistics Exam Paper: Conventional method versus RASCH. Paper Title The Effects of Chronic Daily Fears on Students’ Concept of Self: Towards Identifying Students being Bullied Email Address [email protected] Email Address [email protected]

1st Author Azmi Mohd Tamil, Universiti Kebangsaan Malaysia 1st Author Rense Lange

Affiliation Affiliation ISLA - Vila Nova de Gaia, Portugal

Subsequent Mohd Zali Mohd Nor, MyRASCH Subsequent Cynthia Martínez-Garrido and Alexandre Ventura authors authors 1. Aims/ Medicine always require the researchers to validate their tools of measurement. 1. Aims/ Students may experience considerable fear and stress in school settings, and Objectives of Yet our own tools in measuring the knowledge of our students are rarely Objectives of based on Dweck’s (2006) notion of “mindset” we hypothesized that fear study: validated. We do have the usual conventional measures such as difficulty index, study: introduces qualitative changes in students’ self-concepts. Moreover, these discrimination index, reliability index & Standard Error of Measurement (SEM) changes were expected to lead to lower student performance on academic tests automatically generated, but they were rarely referred to or understood by the of reading and mathematics. examiners. The objective of this study is compare the conventional methods 2. Sample: Hypotheses were tested on 3847 students from nine Iberoamerican countries against RASCH, to see which method of validation is superior. (Bolivia, Chile, Colombia, Cuba, Ecuador, Panama, Peru, Spain, and Venezuela). 2. Sample: Sample consists of all students taking the medical statistics exam paper, a total of 3. Method: The 3847 students completed Murillo’s (2007) adaptation of Marsh’ (1988) SDQ-I. 22 postgraduate students. No overall (average) raw score differences were found. Questionnaire data were 3. Method: The answers on the OMR sheets were scanned and converted into a flat database then analyzed using the Rasch rating scale model using questions' model text file. The text file was converted into Excel format and analysed using the residuals as predictors of student fear levels. In addition, these students took two conventional method and RASCH. The indexes for conventional method were also assessments in reading and mathematics each (a pre- and post-test). compared against the similar computer generated indexes. Questions with poor 4. Results: There are three classes of findings: discrimination and poor difficulty indexes were identified using both approaches. 4. Results: The conventional method and RASCH identified similar questions with poor Psychological Distress: discrimination and poor difficulty indexes. But RASCH was also able to determine As was anticipated, Rasch scaling indicated that the information-content of High- that the questions were too easy for the students, a clear item-person mismatch. Fear students’ ratings was more localized across the latent dimension than was RASCH was also able to determine that the students could be graded into three that of Low-Fear students, and their ratings also showed less cognitive variety. groups, indicating that there should only be 3 grades given. 5. Conclusions: RASCH is clearly superior than the conventional method in validating the exam Predicting Fear: paper. If we are able to create a work culture where lecturers always validate The resulting measurement distortions were captured via logistic regression over their exam questions, RASCH should be one of the tools that is utilised. the ratings’ residuals. Using training and validation samples (with respectively 60 and 40% of all cases), the changes in self-image were sufficiently strong to predict students’ fear levels and their gender based on the distortions in their self-image.

22

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Academic Performance: Paper No KK_026 Consistent with the fear effects, we found that fearful students attempted fewer items than did students without such fears, and this was the case across all levels Paper Title DEVELOPMENT OF KKN-PPM PERFORMANCE INSTRUMENTS TO SUPPORT of proficiency. As a result, low fear students performed much better on reading COMPREHENSION, SOCIAL SKILLS, AND DISCIPLINE STUDENTS OF SULTAN AGENG and mathematics than did high fear students - the overall effect being about 0.75 TIRTAYASA UNIVERSITY Logits. Email Address [email protected] 5. Conclusions: Fear in school changes students' self-image to the point where these distortions can predict their levels of fear - thus suggesting that it is possible to design early 1st Author Nurul Anriani warning systems to identify students in need of attention and protection. We see the present findings as a first step towards implementing an online warning and Affiliation Universitas Sultan Ageng Tirtayasa detection system for signs of bullying and related issues among students. Subsequent Ahsanul Khair Asdar (Universitas Negeri Jakarta) authors 1. Aims/ To develop performance, comprehension, social skills, and discipline instruments Objectives of in the implementation of KKN-PPM for students of Sultan Ageng Tirtayasa study: University 2. Sample: The sample in this research was 200 students of Sultan Ageng Tirtayasa University who follow KKN-PPM program selected by using simple random sampling method 3. Method: The model used in this research is a developmental research model with Confirmatory Factor Analysis (CFA) technique. The variables involved in this study consist of comprehension variables, social skills, discipline, and performance. So the result of this research is a standard instrument of measurement of understanding, social skill, discipline and student performance in the implementation of KKN-PPM. The data used in this study is primary data in the form of response given to the items on the instrument of comprehension, social skills, discipline, and performance by 200 students involved in KKN-PPM program. All the respondents involved were selected by using simple random sampling method. The research data processing that has been collected is done in two stages: (1) First Order Confirmatory Factor Analysis and (2) Second Order Confirmatory Factor Analysis. The whole analysis was done with the help of Lisrel 8.80 Full Version software. 4. Results: Based on the entire series of trials and revisions twice, the standard instruments were obtained to measure comprehension, social skills, discipline, and performance of the students in the implementation of KKN-PPM. The result of the instrument parameters of comprehension at the First Order Confirmatory Factor Analysis stage using maximum probability likelihood is obtained as follows: In Construction dimension, parameter value (λ) is 0,827. Conversely the lowest 23

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

parameter (λ) value is 0,603. In Process dimension, parameter value (λ) highest is (λ) highest is 0,986. Conversely the lowest parameter (λ) value is 0,864. The result 0,738. Conversely the lowest parameter (λ) value is 0,580. In Conclusion of measurement model analysis shows that all items of statement have value t- Withdrawal dimension, parameter value (λ) highest is 0,775. Conversely the value value > 1,96 at level 0,05 which states that the items are valid and feasible to be of the lowest parameter (λ) is 0,550. The result of measurement model analysis used with construct validity value equal to 0,997. While the results of analysis of shows that all items of statement have value t-value > 1,96 at level 0,05 which Second Order Confirmatory Factor Analysis of discipline instruments show that states that the items are valid and feasible to be used with construct validity the factor load on the dimensions that make up the construct is valid, with the value equal to 0,969. While the results of Second Order Confirmatory Factor factor loading value > 0,5. The amount of charge factor for the dimension of Analysis of the comprehension instrument show that the factor load on the responsibility is 0,989; The self-development dimension is 0,999; and the self- dimensions that make up the construct is valid, with the factor loading value > control dimension is 1,000 with construct validity value of 0,966. 0,5. The amount of charge factor for the construction dimension is 0,996; The The results of the performance instrument parameters at the First Order process dimension is 0,986; and the making conclusion dimension is 0,997 with Confirmatory Factor Analysis stage using maximum probability likelihood are construct reliability value of 0,844. obtained as follows: In Preparation dimension, the highest parameter value (λ) is The result of the social skills instrument parameter measurement at the First 0,959. Conversely the lowest parameter (λ) value is 0,715. In Implementation Order Confirmatory Factor Analysis stage using maximum probability likelihood is dimension, parameter value (λ) highest is 0,928. Conversely the lowest parameter obtained as follows: In Peer Relationship dimension, parameter value (λ) is 0,967. (λ) value is 0,603. In the Reporting dimension, parameter value (λ) is 0,987. Conversely the lowest parameter (λ) value is 0,639. In Self Management Conversely the lowest parameter (λ) value is 0,818. The result of measurement dimension, parameter value (λ) highest is 0,955. Conversely the lowest parameter model analysis shows that all items of statement have value t-value > 1,96 at (λ) value is 0,826. In the Academic Success dimension, the highest parameter (λ) level 0,05 which states that the items are valid and feasible to be used with value is 0,951. Conversely the lowest parameter (λ) value is 0,739. In the construct validity value equal to 0,990. While the results of Second Order Compliance dimension, the highest parameter value (λ) is 0,966. Conversely the Confirmatory Factor Analysis of performance instrument show that the factor lowest parameter (λ) value is 0,909. In Assertive dimension, parameter value (λ) load on the dimensions that make up the construct is valid, with the factor highest is 0,979. Conversely the lowest parameter (λ) value is 0,799. The results loading value > 0,5. The amount of charge factor for the preparation dimension is of the measurement model analysis show that all items of statement have value 0,999; Implementation dimension is 0,978; and the reporting dimension is 0,992 t-value > 1,96 at level 0,05 which states that the items are valid and feasible to be with the construct validity value of 0,913. used with construct reliability value equal to 0,996. While the results of Second 5. Conclusions: Based on the results of the analysis that has been done then obtained the Order Confirmatory Factor Analysis of social skills instruments show that the standards instruments of comprehension, social skills, discipline, and factor load on the constituent dimension is valid, with the factor loading value > performance. The comprehension instrument is composed of the dimensions of 0,5. The amount of charge factor for the dimension of peer relationship is 0,950; construction, process, and conclusion. Social skills instruments consist of The self-management dimension is 0,999; The dimension of academic success is dimensions of peer relationship, self-management, academic success, 0,977; The dimension of compliance is 0,999; and assertive dimension is 0,999 compliance, and assertiveness. The discipline instrument consists of the with construct validity value of 0,962. dimensions of responsibility, self-development, and self-control. While the The results of the discipline instrument parameters at the First Order performance instrument consists of preparation, implementation, and reporting Confirmatory Factor Analysis stage using maximum likelihood estimation are dimensions. obtained as follows: In the Responsibility dimension, the highest parameter value (λ) is 0,982. Conversely the lowest parameter (λ) value is 0,894. In Self Development dimension, parameter value (λ) highest is 0,978. Conversely the lowest parameter (λ) value is 0,862. In Self-Control dimension, parameter value 24

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_027 4. Results: Overall, the examinees’ responses to the diagnostic English grammar test fit the Rasch model with 85.45% of good fit items. There was also no strong evidence of Paper Title Development of a Diagnostic English Grammar Test for Malaysian Lower a secondary dimension, suggesting that the assumption of psychometric Secondary School Students unidimensionality of the test was not violated. The test scores were found to be Email Address [email protected] significantly, largely, and positively associated with the school English exam scores. The test had high reliability index and was sensitive enough to classify 1st Author Kho Chung Wei examinees into 3 ability levels. The difficulty of the test was suitable for the current sample and the item difficulty can be stratified into five levels. Of all the Affiliation Faculty of Education, University of Malaya 55 items, 81.82% were able to differentiate between examinees of different ability estimates; 92.73% did not exhibit any gender DIF; and 89.09% did not Subsequent indicate any DIF in terms of additional year of secondary schooling. These findings authors imply that there was a substantial number of good items that can be retained. 1. Aims/ The study aimed to develop a diagnostic English grammar test for Malaysian Objectives of lower secondary school students. Specifically, the study assessed the However, there were 5 items that appeared to be problematic in more than one study: psychometric properties of the diagnostic test, determined if any of the test item aspects: Item 4 (underfit and low discrimination); Item 33 (overfit and too is biased for some examinees, and examined if there is any significant association difficult); Item 42 (underfit and low discrimination); Item 21 (too easy, low between the examinees’ scores on the diagnostic test and their school English discrimination and exhibit gender DIF); and Item 32 (underfit, low discrimination, exam scores. exhibit DIF in terms of additional schooling year, and farthest from Rasch 2. Sample: The study was conducted on the whole Form 2 student population in a secondary convergence). As such, these items would be eliminated from future version of school in Sarawak. Altogether, there were 202 Form 2 students spread across six the test. This leaves approximately 90.91% or 50 items in the item bank. Of these classes. No sampling was done as it was more practical and useful for the school 50 items, 1 item would be revised for being an underfit; 6 would be revised to to administer the test to the whole population. On the day of the test enhance their discrimination; 3 would be reviewed for gender biasness; and 5 administration, however, 34 students were absent from school; thus, the final would be reviewed for biasness in terms of additional secondary schooling year. sample for the study was 168. This represented a response rate of 83.16%. Using Additional items would also need to be drafted to fill the gaps in terms of item existing data, it was found that the sample was not significantly different from difficulty. the population in terms of gender proportion and past English exam scores, 5. Conclusions: In conclusion, the diagnostic English grammar test had good psychometric suggesting that the non-responses did not introduce any bias in the study. properties but there were some items that needed to be reviewed and revised or 3. Method: The study was a small-scale pilot study utilizing the cross-sectional survey design. eliminated. This implies that it is possible for problematic items to be present in a It began with observation and initial assessment of potentially problematic areas test with an overall sound psychometric properties. The detection of such items of language knowledge. This provided the basis for the preparation of a justifies the need for extensive pilot testing in the test development process. This diagnostic English grammar test. The test consists of 55 multiple-choice items. implies that the development of a diagnostic language test can be a cyclical and After the test was prepared, it was administered to students from the research never-ending process. The findings of the study are significant to test developers site. Existing data such as name, gender, past English exam scores and whether and researchers who are interested in developing a diagnostic English language students had undergone an additional year of secondary schooling were obtained test for Malaysian lower secondary school students as well as the examination from the school database with permission from the school administrator. The board. data collected were then analysed using Winsteps version 3.66.0, jMetrik version 4.0.3 and SPSS Statistics version 21. 25

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_029 Paper No KK_030

Paper Title Using Rasch Model for the Development of Intention to Stay Scale (ITSS) among Paper Title Development of a Model of Positive L2 Self using the Rasch Model Medical Academics at Public Universities Email Address [email protected] Email Address [email protected]

1st Author Wan Ismahanini Ismail 1st Author J. Lake

Affiliation Faculty of Educational Studies, Universiti Putra Malaysia, Serdang, Malaysia Affiliation Fukuoka Jo Gakuin University

Subsequent 1 Roziah Mohd. Rasdi, 3 Rahinah Ibrahim, Subsequent Keita Kikuchi Affiliation: Kanagawa University authors 4 Bahaman Abu Samah authors 1, 3, 4 Faculty of Educational Studies, 3 Faculty of Design and Architecture, 1. Aims/ The field of positive psychology has been rapidly growing in the past few years. Universiti Putra Malaysia, Serdang, Malaysia Objectives of Interest in applying positive psychology to education is a more recent 1. Aims/ In this study, we discuss the initial stage of new scale creation that is Intention to study: development (e.g., Furlong, Gilman, & Huebner, 2014; White & Murray, 2015). A Objectives of Stay Scale (ITSS) using (Liu, 2010) framework and uses Rasch Model in scale few researchers have applied it to the field of second language (L2) learning in a study: evaluation to ensure the psychometrically sound measure is created and collect variety of contexts and a range of identity or self-levels from general trait-like to evidence of validity. the specific state-like (e.g., Gabryś-Barker & Gałajda, 2016; Lake, 2013; 2. Sample: The survey was administered on a convenience sample of 52 medical academics MacIntyre, Gregersen, & Mercer, 2016). The presenters show the process of from two public universities. developing a model of positive L2 self that integrates constructs of positive 3. Method: Few analyses utilised in the early stage of data collection that involved 50-item psychology and motivation in the context of L2 learning. scale like rating scale calibration, rating scale analysis, item fit, person misfit 2. Sample: This study was based on questionnaires given to over 3,500 Japanese college order, variable map, separation and reliability in persons and items, principal students and case study interviews conducted with a limited few. component analysis (unidimensionality) and Differential Item Functioning (DIF). 3. Method: We have used Winsteps for the Rasch analysis of global positive self-constructs of 4. Results: In surveying 52 medical academics through the pilot test one from two public flourishing, curiosity, and hope; positive L2 self-constructs of interest, passion, universities, a unidimensional construct of ITSS was empirically established after and mastery goal orientations; and L2 self-efficacy in speaking, listening, and misfitting items and persons were removed, and scale modifications were made. reading. These measures are then used with AMOS to construct a structural 5. Conclusions: It was found that ITSS can be used in the next tests to measure medical model of a Positive L2 Self. academics intention to stay in service. The scale also shows that it has construct 4. Results: The measures fit the Rasch model and an acceptable fit was found for a structural validity as displayed on the variable map, where the item featured level of model of a positive L2 self that incorporated constructs from positive psychology difficulty is not too different from the conceptual framework of the proposed and second language motivation. research. 5. Conclusions: Using both quantitative and qualitative data, presenters discuss how these constructs can be applied by researchers and educators in developing positive identities of language learners.

26

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_031 Paper No KK_032

Paper Title Developing a Vocabulary Specification Equation for Second Language Learners Paper Title Measuring the Validity and Reliability of Arabic Vocabulary Knowledge Test Using Rasch Model Approach Email Address [email protected] Email Address [email protected]

1st Author J. Lake 1st Author Zunita Mohamad Maskor

Affiliation Fukuoka Jo Gakuin University Affiliation Faculty of Education, Universiti Kebangsaan Malaysia, Malaysia.

Subsequent Subsequent Harun Baharudin & Maimun Aqsha Lubis, Faculty of Education, Universiti authors authors Kebangsaan Malaysia, Malaysia. 1. Aims/ This presentation describes the development of a vocabulary specification 1. Aims/ To investigate the validity and reliability of the Arabic Receptive Vocabulary Test Objectives of equation for second language (L2) learners for diagnostic purposes. I will first Objectives of (ARVT) using the Rasch Measurement Model study: briefly review and summarize the literature on corpus studies, wordlists, study: vocabulary tests, and test specifications. Next I will explain how a vocabulary test 2. Sample: Data were collected from a vocabulary test named Arabic Receptive Vocabulary was developed based on the research reviewed and following the procedures Test (ARVT) which had been answered by 106 Form Four students at one of the described in the test and item specifications for Japanese university students Islamic Religious Secondary School located in Perak. learning English as a second language. 3. Method: Arabic Receptive Vocabulary Test (ARVT) is a form of testing that was developed 2. Sample: Over 350 Japanese university students were sampled. to measure receptive vocabulary knowledge in Arabic language. The purpose of the testing was to find out the number of words known to the students. The test 3. Method: Winsteps was used for a Rasch analysis of the test. development used simple random sampling from A Frequency Dictionary of Arabic (Buckwalter & Parkinson, 2011) based on a ratio of 1: 100 derived from 4. Results: The results showed that a test with items based on frequency could be produced 2000 words frequency and consists 25 dichotomous items of Yes/No answering with the specifications as a guideline and that the resulting item difficulties had a pattern. It was answered by 106 respondents whom selected among Form Four strong relationship to the frequencies that was captured by the vocabulary students, which administered within 15 minutes. Rasch analysis was done using specification equation. WINSTEP software version 3.72.3 due to investigate whether the test was 5. Conclusions: Implications of these results suggest that the specification equation can be used unidimensional and fit. systematically by teachers and learners to guide vocabulary study in the 4. Results: The test was unidimensional and fit the Rasch model’s expectation. development of student L2 proficiency. 5. Conclusions: The finding demonstrated that item reliability and item separation was fitted to the model’s expected. The Arabic Receptive Vocabulary Test (ARVT) is competitive to be used to measure Arabic vocabulary acquisition among secondary students.

27

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_033 4. Results: Although the analysis of the entrance examination data is currently in progress, preliminary results show that, although dichotomous test items demonstrated Paper Title A Rasch Analysis of the Reading, Grammar, and Essay Sections of a Japanese reliability of .90, and essay ratings, .94, there were a number of problems, University Entrance Examination particularly with the dichotomous test items. First, the difficulty level of the test Email Address [email protected] sections was inconsistent. The reading passage comprehension questions were considerably easier than the grammar questions. In addition, as a whole, the test 1st Author Kristy King Takagi was generally too easy for applicants. In other words, the distribution of test items was not a good match to the distribution of applicant ability. A number of Affiliation Chuo University, Tokyo Japan the highest ability applicants had no test items at their level of ability, and nearly 10 of the 30 dichotomous items were too easy for all applicants. On the other Subsequent hand, the essay ratings, which were the lowest of all test scores, demonstrated authors good fit. The essay ratings had little in common with the reading passage 1. Aims/ University entrance examinations in Japan have been widely criticized, especially comprehension questions, but demonstrated medium to large correlations with Objectives of since the 1995 landmark studies of Brown and Yamashita. However, there are the grammar test item totals. study: few universities that have responded to such criticisms to the extent that the 5. Conclusions: Data from actual university entrance examinations in Japan can be nearly exams have been significantly altered. One format that remains standard for impossible to obtain, primarily because of the purported need to protect the many English entrance exams at Japanese universities is: a reading passage with privacy of students. The upshot of such policies is that these tests cannot be comprehension questions; grammar exercises, such as fill in the blank and evaluated in an objective manner and then revised accordingly. It is hoped that arrange English sentence parts in correct order, based on a Japanese translation; the results of this small study can be useful in terms of providing ideas for test and an essay prompt that is related to the reading passage. The essay target revision for the university that generously gave consent for use of these data, and length is usually short (sometimes less than 100 words), so that the final piece of that at least a small ripple effect in higher education in Japan might result as well. writing often resembles a paragraph more than an essay. The purpose of this study is to examine whether each section of the entrance examination contributed to assessment of student applicants, and to determine the relationship among the examination sections. Key Words: university entrance examinations in Japan, Japanese higher education 2. Sample: The data used for this study come from an actual entrance examination which was administered to 54 students of high school age, by a large private university in eastern Japan in the fall of 2016. Although student names and background information were not available, the students could generally be described as high school seniors from a variety of prefectures throughout Japan. 3. Method: The fit, difficulty, and reliability of the entrance examination components will be assessed using the Rasch model. Specifically, FACETS software will be used to examine the essay ratings of two examiners, together with dichotomous scores on the reading passage comprehension questions and two grammar sections of the test.

28

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_034 Paper No KK_035

Paper Title Applying Rasch Model To Identify A Contribution of Marital Status in Perceived Paper Title Assesing Pedagogical Content Knowledge of The Particle Theory of matter and Social Support of Merapi Volcanic Eruption Mount Survivors Phasa Change in Pre-service Science Teacher Email Address [email protected] Email Address [email protected]

1st Author Chandra C. A. Putri 1st Author Maryati

Affiliation (Indonesia University of Education Affiliation Yogyakarta State University

Subsequent Ifa Hanifah Misbach (Indonesia University of Education) Subsequent Zuhdan Kun Prasetyo, Yogyakarta State University authors authors Insih Wilujeng, Yogyakarta State University 1. Aims/ The aim of this study is to identify a difference perceived social support sources Bambang Sumintono, Malaya University Objectives of based on marital status in survivors of mount Merapi volcanic eruption. 1. Aims/ This research aims to asses the quality of PCK in pre-service secondary science study: Objectives of teachers in a specified topic— The particle theory of matter and phasa change. 2. Sample: Samples were selected using conveience sampling technique which consisted of study: 82 survivors of mt. Merapi volcanic eruption (20-40 years old) who lived in 2. Sample: Sample in this research consist of 16 pre-service secondary science teachers as Cangkringan, the sub-districts with the highest number of survivors. members of professional teacher training programe, with 32 lesson plans and 3. Method: This research used Rasch Modelling through differential item functioning (DIF) to instructional sessions videotaped identify the difference of responses pattern of perceived social support sources 3. Method: This is a quantitative research method to measure teacher’s PCK with PCK rubric based on marital status. that developed base on Magnuson et al.’s PCK component model. Measuring 4. Results: There is a difference of responses pattern between 46.34% respondents who involved multiraters and analyzed by a many-facet Rasch measurement. have married and 53.66% who have not. The curve of DIF analysis showed that 4. Results: Results indicate that PCK from Indonesian pre-service secondary science teachers respondents who have married is easier to answer the items rather than who is still low, especially on knowledge of science curricula, Knowledge of students’ have no married, especially on items that linked to the source of support from understanding of science and Knowledge of instructional strategies. significant others. 5. Conclusions: The ability of science teacher’s PCK in Indonesia as a criterion of professional 5. Conclusions: The marital status can differentiate a perceived social support responses of teachers still need to be improved and science teacher education curriculum survivors of mount Merapi volcanic eruption. must be reformed.

29

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_036 correlated with the passages' Lexile measures (r = .61 and r = .70 respectively) while average item measures and teacher’s leveling were strongly correlated (r = Paper Title Preliminary Report on the Development and Calibration of a Rasch Scale to .90). Measure Chinese Reading Comprehension Ability in Singaporean 2nd Language Primary School Students, Part II Dimensionality analysis results show that 30.3% of the variances were explained Email Address [email protected] by the latent trait, which is close to the general guideline of 29.5% for computer adaptive tests (Linacre, 2014). It shows the data are accountable by only one 1st Author Chung Tze Min dimension, which is the latent trait of item difficulty. The Eigenvalue for the unexplained variance in the first contrast is 3, which is higher than the Affiliation Commontown Pte Ltd recommended value of 2, but the variance explained by the first contrast is only .2%. Subsequent Mohd Nor, M. Z., Newstar Agencies; Yan, R. J. J., Commontown Pte Ltd; Loo, J. P. authors L., Commontown Pte Ltd Items that require more cognitive effort to answer were indeed more difficult 1. Aims/ To help teachers place students in their appropriate Chinese reading (e.g., average item measures for is -2.27, vocabulary, -.82, low cognitive Objectives of comprehension levels, a measurement scale of reading comprehension was level processing -.39, sentence structure, .88 and high cognitive level processing, study: created and calibrated in 2016. As there were misfitting items found in the 2016 1.76) and the difficulty levelling of the various question types agreed with the Rasch analysis, till date, about half of the items were refined and trialed. The findings reported by Meneghetti, Carretti, and De Beni (2006). Number of erratic current study evaluated and validated the measurement functionings of the items has also reduced from 12.5% (184 out of 1,462 items) of total items reading comprehension scale. analyzed in 2016 to 7.6% (91 out of 1194 items) in 2017. 2. Sample: A total of 19418 students to date from 42 schools participated in the adaptive 5. Conclusions: The reading comprehension scale is valid in measuring children’s reading placement test. Their school grades ranged from Primary 1 to 6; ages ranged comprehension abilities. As item refinement over the past year has improved the from 7 to 12 years old. quality of the scale, we will continue to refine the remaining items and trial them 3. Method: Rasch analyses were conducted to validate the reading comprehension scale. In in 2018. addition, the average item measure for each passage was calculated and correlated with the passage's Lexile measure as well as with teachers' levelling of the passage. Furthermore, average item measure for each question type was calculated to find out if items that required more cognitive effort to answer were indeed more difficult. We also compared the number of misfitting items for outfit ZSTD with last year’s number. 4. Results: Results indicated the Rasch indices were within acceptable ranges for a low-stake standardized test (Person indices: Infit MNSQ = .99, Outfit MNSQ = .93, Separation = 2.58, Reliability = .87; Item indices: Infit MNSQ = 1.03, Outfit MNSQ = .99, Separation = 6.66, Reliability = .98). The items were, however, more difficult for the students tested(Item measure = 0. Person measure = -.86).

Average item measures and teachers’ levelling of the passages were moderately

30

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_037 4. Results: In this study, we were interested in two kinds of DIF: gender (two groups) and employment styles (two groups). Separate group analyses showed that two items Paper Title Multidimensional Rasch Analysis of Teaching Role-Specific Esteem have MNSQ out the critical range (0.70, 1.30). After deleting DIF items, the remaining items of each dimension in the teaching role-specific esteem scale Email Address [email protected] constituted a single construct. The correlations (.55) between dimensions obtained from the multidimensional approach were much higher than those 1st Author Yu-Shu Chen obtained from the unidimensional approach. 5. Conclusions: The dimensionality of the teaching role-specific esteem scale is between-item Affiliation National Chung Cheng University, Taiwan multidimensionality. The results have demonstrated that the multidimensional rating scale model can be used to validate the scale, which is useful for Subsequent Yuan-Chi Lai, WuFeng University, Taiwan researchers and practitioners interested in investigating teaching role-specific authors esteem. Limitations and directions for future research were discussed. 1. Aims/ Self-esteem is a central construct in psychological theory. However it is coupled Objectives of with disagreement over how the construct is conceived. The lack of consensus study: has been lamented by researchers over years (Tafarodi & Swann, 2001). Through literature review, we adopt the position that teaching role-specific esteem consists of two distinct dimensions, which are role-competence and role-liking. That is, individuals take on value both by merit of what they can do and what they appear to be teachers. The former is founded on teaching abilities and talents, the latter on attractiveness and other aspects of teaching role worth. The main objective of the present study was to examine the validation of teaching role-specific esteem scale using multidimensional rating scale analysis. Key words: multidimensional Rasch model, rating scale model, role-competence, role-liking, teaching role-specific esteem. 2. Sample: Teachers from elementary, junior, and senior high schools in Taiwan comprised the sample for the current study. A total of 747 teachers were administered the 16-item teaching role-specific esteem scale consisting of 2 dimensions. Participants were invited to attend a survey and complete the teaching role- specific esteem scale. 3. Method: Because both role-competence and role-liking was designed to measure the teaching role-specific esteem, the multidimensional form of the rating scale model, the items within a scale were judged on the same kind of rating scales (Andrich, 1978), was used to analyze the data. To ascertain whether the scale items fit the model, two kinds of analysis were conducted. One was item fit analysis, and the other was the analysis of differential item functioning (DIF; Holland & Wainer, 1993).

31

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_038 one expects from a well-written test preparation book. In other words, we predict that the TOEIC preparation textbook will be flawed in ways related to Paper Title Improving Teaching and Student Learning through Evaluation of one TOEIC difficulty level, and that revision is needed in order to improve both the teaching Preparation Textbook and learning experience. Email Address [email protected] 5. Conclusions: This kind of analysis is needed for many reasons. Teachers need to understand the level of difficulty of test preparation book chapters more deeply in order to 1st Author YihYeh Pan plan lessons and teach more effectively. As for publishers, they should pursue this kind of analysis so that they can revise and improve the texts they produce. These Affiliation Sanno University publishers and TOEIC test preparation textbook writers should both be more accountable for the progression of difficulty in TOEIC preparation textbooks Subsequent Kristy Takagi, Chuo University because students’ skill development and confidence are both at stake. authors 1. Aims/ There are hundreds of TOEIC Preparation textbooks used at universities, private Objectives of English schools, and cram schools in Japan. But there are few studies of the study: quality of items used in these books. The aim of this project is to evaluate the test items in one TOEIC preparation textbook which is commonly used in university English courses in Japan. In order to determine whether the book chapters move from easiest to most difficult items, or remain stable in difficulty, as would be expected in a well-written test preparation book, the evaluation of test items will cover two areas: 1) All 100 items, from the five chapters, will be assessed for fit, difficulty and reliability, and 2) The difficulty level of items used in each chapter will be investigated. 2. Sample: The participants are all Japanese students currently studying in a university in eastern Japan. The ages of these participants are from 19 to 21 years old. Most have studied English for six to seven years. According to their placement test scores derived from the entrance exam, these students were placed in the most advanced English language class in the university. 3. Method: In the first evaluation of test items, the fit, difficulty and reliability of items will be assessed using the Rasch Model. These results will also provide insight into the second evaluation, of the difficulty level of items used in each chapter. In addition, the progression of difficulty of the five chapters in the TOEIC preparation book will be considered in light of the placement of students in relation to test items on the variable maps of the five chapter tests. 4. Results: Currently we are still in the process of collecting data. However, based on past experience with student performance on and reaction to test items, we predict that the test items of these TOEIC preparation book chapters will not move from the easiest to the most difficult, or even remain stable in difficulty level, which 32

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_039 Paper No KK_040

Paper Title DEVELOPMENT OF INSTRUMENTS FOR MEASURING MATHEMATICAL LOGICAL Paper Title Assessing competency level among SIPartners+ using Rasch Model approach THINKING ABILITY COLLEGE STUDENTS IN KAPITA SELEKTA Email Address [email protected] Email Address [email protected]

1st Author Novaliyosi 1st Author Hishamuddin Hashim

Affiliation Universitas Sultan Ageng Tirtayasa, Banten, Indonesia Affiliation

Subsequent Subsequent Roland@Rozaidi Abu Hajjan, Nordin Tahir, Nurul Badar Mohd Salleh, Mohd Jalani authors authors Hasan, Raja Hamizah Raja Harun, Nurulhidayah Sukiman, Siti Sarah Baharom, 1. Aims/ This study aims to develop instruments to measure mathematical logical thinking Ismail Mohamad, Mohd Kashfi Mohd Jailani Objectives of ability college students in Kapita Selekta 1. Aims/ With aim to improve the management quality of school leaders in Malaysia, study: Objectives of Ministry of Education (MOE) has come up with the initiative of appointing 2. Sample: study: selected officers as School Improvement Partners (SIPartners+). They are responsible in improving the quality of leadership through coaching and 3. Method: The method used through the development stage used in this study include: (1) mentoring, as well as implementing interventions. They also serve as subject defining variables; (2) describe the variables into more detailed indicators matter expert in school leadership development. /dimensions; (3) arrange the items; (4) conducting trials; (5) analyzing the validity and reliability The aim of this research was to identify the competency profile of SIPartners+ 4. Results: The results of the trial of legibility is instruments designed easy to read and well who were attached to District Education Office/State Education Office understood by students and the results of validity and reliability test show that throughout Malaysia. the instrument of matematical logical thinking ability developed included into the 2. Sample: The instrument was administered on 220 SIPartners+ in District Education category of valid and fit to be use as an instrument Office/State Education Office throughout Malaysia. The respondents consisted of 5. Conclusions: The instruments can be use to measure the mathematical logical thinking ability 160 males (73%) and 60 females (27%) who were selected using stratified random sampling method. 148 of the respondents (67%) possessed post graduate degrees while 72 ( 33%) were a degree holder. 3. Method: The research used a self-administered questionnaire on Facilitator Competency Profile Instrument (FCPI 2). It contained 99 items with 5 competency elements: General, Coaching & Mentoring, Subject Matter Expert, Clients Advancement Plan, and Needs Analysis/Evaluation. However, only items related to skill competency level were involved for the purpose of this research.

The data collected were analysed using Rasch Model Analysis.

Summary statistics was used to find the reliability and separation of person and 33

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

items. Paper No KK_041 The overall performance in the assessment was analysed using Person-Item Distribution. Paper Title Validating Value Domain of the Facilitator Competency Profile Instrument 4. Results: The summary statistics output reveal that the value of Cronbach alpha is 0.98.The SIPartners+-2 (FCPI- SIPartners+-2) Using Rasch Model Analysis person reliability is 0.96. The separation of person (G) is 4.98 Email Address [email protected]

The summary statistics for items show reliability at 0.97. It also shows that the 1st Author Raja Hamizah Raja Harun items have good difficulty measurement in measuring SIPartners+ ability. The separation of items is 6.04. Affiliation

The Wright map for the distribution of person and item for the (FCPI 2) shows the Subsequent Nurulhidayah Sukiman, Siti Sarah Baharom, Ismail Mohamad, Mohd Kashfi Mohd distribution of all persons and items on the logit measurement ruler. Majority of authors Jailani SIPartners+ has mastered Effective Communication while the least mastered skill 1. Aims/ The competency profiles expands on three main domains, namely knowledge, was Academic Writing. Objectives of skills and values that serves as the basic requirements that need to be possessed 5. Conclusions: The aim of this study was to identify the competency profile of SIPartners+. Based study: by School Improvement Partners+ (SIPartners+) in enhancing their competency on the data derived from the questionnaire, it can be concluded that most of and potential. This research aims to validate the value domain of the FCPI- SIPartners+ excel in the competency profile. This indicates that they mastered all SIPartners+-2 for SIPartners+ officers’ competency profile. Therefore, based from the skills needed as SIPartners+. However, they are still lacking in Needs Analysis this research findings, the domain value of competency profiles for SIPartners+ skill. Therefore, immediate action needs to be taken by relevant authorities to group will be enhance accordingly. ensure that all SIPartners+ will acquire this skill in the future. The purpose of this paper was to validate and examine the reliability and validity of FCPI- SIPartners+-2 . 2. Sample: The instrument was administered on 220 SIPartners officers throughout Malaysia. Their age group was between 40 to more than 50 years old. The samples were selected using stratified random sampling. 3. Method: This study focused on value domain of the FCPI- SIPartners+-2 (Likert scale 1 -5) which consists of 29 items. To gauge its validity and reliability of item and respondents, Winstep Version 3.68.2 was used in the process. The Rasch model was used because it can measure person reliability and item reliability and is more robust compared to Cronbach’s Aplha. It also allows item elimination based on t-value and differential measure. Analysis was based on Item and respondent validity, item and respondent reliability, Identifying Item fit, Item Difficulty and Respondent Ability 4. Results: All items have values of PTMEA CORR from 0.53 to 0.81. Cronbach-alpha value (KR 20) of FCSI-2 for value domain indicates high reliability of the questionnaire at 0.98, item reliability index value of 0.95, person reliability value of 0.92. Value of infit/outfit item shows there are three items which have more than 1.4 that 34

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Item 70 PR =1.42 logit, Item 21 U =1.88 logit and Item 20 U = 1.48 logit. Based Paper No KK_042 from item map results. Three items that have been identified as difficult. Paper Title Measuring Scientific Literacy: Using the Rasch Model Analysis to Determine Student 5. Conclusions: Through Rasch model analysis, researchers have obtained high validity and Competency Using Data from PISA 2015 reliability value from the test conducted. This means that the questionnaire is Email [email protected] valid and reliable to validate the competency profiles. The item reliability is high Address and this means the item is stable. All PTMEA shows positive value that shows all 1st Author Nor Azizi bt Abdullah the items used are parallel to the measurement in terms of validity. In examining Subsequent Dr. Wan Raisuha bt Wan Ali (PhD) the fit statistics, the outfit and infit MNSQ statistics used the range of 0.60 to 1.40 authors as the basis. Three items need to be improved. In terms of item difficulty and 1. Aims/ To identify factors that influence students' competency in providing responses for respondent ability, also three items need to be taken into account for Objectives the items of Scientific Literacy in PISA 2015. improvement. of study: The objective of the study is to answer the following questions: 1. Are the items a good measure for Malaysian students’ proficiency level in Scentific Literacy?; 2. Which items are considered difficult or easy by the students?; and 3. Which items show erratic responses from the students? Hypothesis: The higher the cognitive demand of the items the less likelihood of students with lower ability to respond correctly to the items. 2. Sample: 42 students who were randomly selected from each (randomly selected) school based on specific strata such as school category, school type, school location and medium of instruction. using the Keyquest software provided by OECD. The total number of students sampled for for the PISA 2015 study was 9,622. 3. Method: There were 184 science related items used in PISA 2015 which students were given various combinations (forms). Items were coded according to the numeric order of the item followed by four letters: the first letter to indicate the competency assessed; the second letter, to indicate the type of knowledge involved; the third, the system of knowledge involved and the fourth letter to indicate the level of cognitive demand assigned by the test designers. The competencies assessed for the items in the Scientific Literacy of PISA 2015 were symbolized by; P for explaining phenomena scientifically; E for evaluating and designing scientific inquiry; and D for interpreting data and evidence scientifically. The type of knowledge involved are symbolized by: C for content knowledge; P for procedural knowledge; and E for epistemic knowledge. The system of knowldege used in the items are symbolized by: P for physical systems; L for living systems; and E for earth and space systems. The level of cognitive demand assigned by the test developers were symbolized by: L for low cognitive demand; M for medium cognitive demand; and H for high cognitive demand. The PISA data provided several codes for student responses which include single digit codes of “1”, “0”, “9”, “6”, or “7” and double digit codes of “21”, “11”, “12”, “01”, 35

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

“02”, “03”, or “04”. Codes “1”, “11”, “12” or “21” were used for credited responses Paper No KK_043 with the “11” and “12” signifying partial credit while “21” as full credit. Codes “0”, Paper Title Validating Knowledge Domain of Facilitator Competency Profile Instrument – “01”, “02”, “03” or “04” were given for non credit responses of students. Codes “9” SISC+1 (FCPI-SISC+1) Using Rasch Model or “99” were non responses by students. Codes “6” or “96” were student responses Email Address [email protected] beyond the time given, while codes “7” or “97” were for items that were hidden 1st Author Zulkifili Salleh from students. Subsequent Noraisah Jamil, Shariff Yob, Shahrizal Shaarani, Mohd Isham Embong, Raja For the purpose of this study, credited responses both for single and double digits authors Hamizah Raja Harun, Nurulhidayah Sukiman, Siti Sarah Baharom, Ismail were re-coded as “1” while non-credit responses were re-coded as “0”. Non- Mohamad, Mohd Kashfi Mohd Jailani response codes and the other four codes were assumed as missing responses. 1. Aims/ This research is based on Facilitator Competency Profile Instrument – SISC+1 After re-coding the responses, analysis was carried out using the Bond and Fox Objectives of (FCPI-SISC+1) on Knowledge Domain which consists of 34 dichotomous items software. The outcome of the analysis is used as data for this study. study: focusing on 5 standards namely general knowledge, coaching and mentoring, 4. Results: The analysis showed that the items provided a good measure of student ability based subject matter expert, coachees’ advancement and needs analysis study/ on the person reliability. However, a difference of almost 1 logit indicated that many assessment. School Improvement Specialist Coaches+ (SISC+) refers to Education of the items were difficult for many students. Officer who is a subject matter expert in teaching and learning and able to The analysis also found that some items that were assigned as low cognitive demand provide support to improve teachers’ quality through strategic coaching and were of high difficulty for the students, while some items that were assigned as high mentoring. The purpose of this paper was to validate the reliability and validity cognitive demand were fair for average students. of the Facilitator Competency Profile Instrument – SISC+1 (FCPI -SISC+1) on The item of codes provided a reference to look further into the items with respect to Knowledge Domain using Rasch Model. the type of responses required or the task demanded of the students in giving 2. Sample: The instrument was administered to 587 SISC+ throughout Malaysia using responses appropriately. stratified random sampling. The sample of male respondents were 47.9% (n = 5. The findings show that while Scientific Literacy items provided a good measure of the 281) and female respondents were 52.1% (n=306). The sample comprises of 2.2% Conclusions: ability of 15+ year old students in Malaysia, the items were fairly difficult for many (n=13) PhD holders, 33.4% (n=196) master holders and 64.4% (n=378) degree students. Some items that were labelled as easy were difficult even for high ability holders. students. On the other hand, some items that were labelled as difficult could be 3. Method: This research used Facilitator Competency Profile Instrument – SISC+1 (FCPI - answered even by average students. Further analysis into various aspects of the SISC+1) focusing on Knowledge Domain. The Rasch model was used because it items would be required to provide a better description of students’ strengths and can measure person reliability and item reliability and is more robust compared weaknesses in solving tasks given in the Scientific Literacy items of PISA 2015. to Cronbach’s Alpha. It also allows item elimination based on t-value and The analysis of the items would entail identifying the characteristics of the items to differential measure. Winsteps version 3.68.2 was used in the process. provide background and a wider scope of insight for the policy makers at the Ministry 4. Results: The person reliability value obtained was 0.64 while the item reliability value was of Education and teachers to strategize efforts towards improving the scientific 0.95 for dichotomous items. The Winsteps analysis revealed a positive value in literacy among students in Malaysia. With the dissemination of information through PTMEA CORR, a high item reliability and person reliability index and the ordering this analysis, teachers are more able to align the teaching and learning experience of items in terms of hierarchy according to level of difficulties. based on the strengths and weaknesses of the taught curriculum. Last but not least, students are better able to engage with the learned curriculum in addressing issues 5. Conclusions: Results of the data analysis using Winsteps recorded a high level of person and arising from real life situations, a characteristic of the PISA 2015 Scientific Literacy item reliability index. Therefore, the instrument FCPI -SISC+1 on knowledge items. domain has high validity and reliability. Hence, the used of Rasch Model in analyzing the instrument has significantly contributed to its validity.

36

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_044 Paper No KK_045 Paper Title Measuring The Status Of Fasilinus Current Professional Profile Using Rasch Model Paper Title USING RASCH MODEL TO ASSESS THE FOREIGN LANGUAGE SPEAKING ANXIETY Email Address [email protected] SCALE (FLSAS) AMONG UNIVERSITY STUDENTS IN SALATIGA Email Address [email protected] 1st Author Ruzita Ahmad Affiliation 1st Author Rizki Parahita Anandi Subsequent Muhammad Faizal Razali, Noor Azam Asmaran, Wan Nor Anita Wan Hassan, Raja Affiliation Faculty of Education, University of Malaya authors Hamizah Raja Harun, Nurulhidayah Sukiman, Siti Sarah Baharom, Ismail Mohamad, Mohd Kashfi Mohd Jailani Subsequent Resa Syafitri (Faculty of Education, University of Malaya), 1. Aims/ The role of FasiLINUS is to realise the Ministry Of Education (MOE) goal through authors Bambang Sumintono (Institute of Educational Leadership, University of Malaya) Objectives of LINUS2.0 Programme in ensuring all level 1 pupils (year 1,year 2 and year 3) 1. Aims/ The aim of this study is to examine the validity and reliability of the Foreign study: except for pupils with Special Educational Needs (SEN) to achieve basic literacy Objectives of Language Speaking Anxiety Scale (FLSAS) by using the Rasch Measurement Model for Bahasa Malaysia, English and Numeracy through screening conducted in study: approach primary schools. 2. Sample: Forty-six Arabic Language Education students from a university in Salatiga

participated in this research This research is aimed to identify the current job profile of FasiLINUS in District Education Office / State Education Department in Malaysia. 3. Method: The survey research design is used in this study. The data is collected by distributing the Foreign Language Speaking Anxiety Scale (FLSAS) to the students 2. Sample: The study involved 377 FasiLINUS as sample. The respondents consisted of 173 of Arabic Language Education. The data is then analyzed using Rasch Model to males (46 %) and 204 females (54%) who were selected using stratified random measure the validity by referring to its value of row variance and the unexplained sampling method. 65 of the respondents (17%) possessed post graduate degrees variance. While its reliability is measured by analyzing its person-item reliability while 312 ( 83 %) were a degree holder. and Cronbach’s Alpha value. The wright person-item map is also used to analyze 3. Method: The Competency Profile of Validation Study for Education Officers of Group the items of questionnaire. Facilitator instrument was used in this study. It contained 70 items with four 4. Results: The result showed that the value of row variance of the questionnaire is 40.9% FasiLINUS competency standards namely coaching and mentoring, clients’ and the unexplained variance of it does not exceed 15%. This result indicated that advancement, subject matter expertise and needs analysis.The data collected the questionnaire is valid in terms of its construct validity. The reliability of the were analysed using Rasch Model. Analysis on item map was carried out to questionnaire is measured by referring to the person reliability (.88), item identify the level of difficulties of item- person. reliability (.90) and the value of Cronbach’ s Alpha (.90) of the questionnaire as 4. Results: From the analysis, a list of tasks were agreed by fasiLINUS as their current well. The wright item-person map showed 3 items which are placed on the top of professional profile.There were 2 items less agreed by respondents namely B10 the map that represented the situation where students did not feel anxious while Quality Procedure MS ISO 9001:2008 . speaking Arabic language. On the other hand, the map also showed an item that 5. Conclusions: From the results it requires the implementation of intervention programme PK18 is placed on the bottom of the map which means that most of students get procedure for Quality control compliance; and B9 Finance Circular which involved anxious when they cannot express their thought while speaking Arabic language. conducting briefing/courses and it requires OS 21000 and OS 29000 compliance. 5. Conclusions: It can be concluded that this instrument is valid and reliable to measure the However these items need to be reconsidered as they are the added values to students’ level of anxiety in speaking Arabic language previous knowledge and professionalism development for FasiLINUS.

37

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_047 Paper No KK_048 Paper Title Global mindset: Assessing construct dimensionality Email Address [email protected] Paper Title THE EFFECTIVENESS OF TEACHER TRAINING LESSONS 1st Author Jeffrey Durand Email Address [email protected] Affiliation Toyo Gakuen University, Japan Subsequent Jason Pratt, Toyo Gakuen University, Japan 1st Author Burhanuddin Tola authors Sarah Louisa Birchley, Toyo Gakuen University, Japan 1. Aims/ The number of Japanese university students studying abroad has been Affiliation State University of Jakarta Objectives of decreasing, and at the same time, more and more are interested in working for a study: traditional Japanese company. On the other hand, the number of Japanese Subsequent tourists abroad is increasing, as is overseas production by Japanese companies. authors We are interested in preparing our students for work in a global economy and 1. Aims/ To provide information/baseline data on the teacher training that conducted by having a connection to a global community. To this end we have developed a Objectives of DGDETEP, TPG/TSMD and the effectiveness of itself. This relates to BERMUTU questionnaire to understand what kind of global mindset our students have. study: program that will help training that organized by TPG/TSMD. Global mindset, the interest to be involved in an international community, is 2. Sample: The populations of this study were teachers of JHS in the study field, Math, IPA, developed from previous work on motivation and the L2 self system (Kikuchi, Indonesian Language, and English language, who followed the training which 2016) and international posture (Yashima 1998). Nine constructs resulted. The conducted by (DGDETEP) math, IPA, and Language, also TPG/TSMD. goal of this research is to compare the dimensionality of Global mindset with that 3. Method: In this study conducted the measurement to the teacher competence, even in the of the motivation research. A further goal is to determine whether some teaching competence (academic competence) or non-academic competence constructs are theoretically and statistically similar enough that they might be which consisted by personality, social competence, pedagogic competence, also it considered as one. was saw the effect of the training to the satisfaction of work, the teacher attitude 2. Sample: Initial sampling consists of 40 students at a Japanese university. Further sampling while teaching and teaching efficacy. is being conducted to increase this number. Students are recruited from a The measurement in this study was tested in the teacher of JHS in various number of required language and other classes across all majors at the university. regional in Indonesia and improved together, by team of the researcher of state Student language abilities range from very low to very high. university of Yogjakarta, Universitas Pendidikan Ganesha, Universitas Negeri 3. Method: Rasch partial credit analyses were conducted separately on the nine constructs of Makassar dan Universitas Indonesia, the measurement was improved by the global mindset. Item and person fit was examined and principle components content validity test and coefisien alpha (α) was counted to see the reliability. analyses were conducted. Beside the measurement as already mentioned, the questionnaier was 4. Results: Early results suggest that most items in each construct fit well. Construct distributed and studied qualitatively to see the respont of the training participant dimensionality, however, is sometimes suspect. These results will be updated toward the training which already conducted, especially, related to the benefits when more data is included. Dimensionality of global mindset constructs may not matter which can got from the training and some suggestions which they given coincide with that of the similar language motivation constructs from which they for the training. were derived. To test the effectiveness of this study, so the study was conducted through 5. Conclusions: If these results stand, ‘general’ concepts of motivation may not be applicable to experimental design as: pre-test and post-test group design. The measurement of different topics, i.e. language and global mindset. the effectiveness of the training was conducted two times with giving the questionnaire before and after the training. And the model of the training 38

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

explained as follow:The data which got depend on the conducting of the training reduction is not significantly. The possibility of this chase is caused by the narrow which existed. Moreover, the writer could not make the permanent schedule, training time which is two days and the training content that does not provide because of the schedule of the training was decided by TPG/TSMDor DGDETEP. enough material to equip teachers in academic terms. The same result could also The data collecting was conducted before training started and after training find in English which the data from the training TPG/TSMD for English training done. This matter aimed in order pre-test and post-test can took purely. The managed by DGDETEP not held/canceled. statistic analysis which used to test the effectiveness of the training was t-test The competence of teachers is above the average score. This personal independent sample with testing the differences of gain score. The differences of competence is highlights the teachers’ characteristics such as solid, honest, the averages score between before and after training. Besides that, also confident and responsibility for the teacher ethic. measured the pre-test score between group of TPG/MGMP and DGDETEP. The The social competence of teachers classify as above average score. The social counting of the data was counted by SPSS 11th version. competence statements in the questionnaire relates to the ability to develop 4. Results: There was a significance difference in the competence of profession of Math on social relationships with the fellow teachers, parents, and students. From the the teachers before and after participated the training, even on the teachers in results of the statistical analysis, there is a significant differences in social TPG/TSMD training or DGDETEP training. Moreover, there was the improvement competence among teachers who attends the training TPG/TSMD and DGDETEP, on the teachers’ competence after received the training program. it can be where teacher training DGDETEP have social competence higher than teacher looked at the differences of pre-test and post-test score. From the previous table training TPG/TSMD does. it also can be saw that there was the significance difference in the profession The pedagogical competence of teachers classified as above average score. The competence of math between teachers who followed TPG/MGMP training and pedagogical competence statements in the questionnaire related to the ability to DGDETEP, where the teachers who followed TPG/MGMP training had higher prepare and organize the learning in the classroom. competence than teachers who followed DGDETEP training (the comparison of The result that is obtained from the pedagogical competence is inconsistencies.It the pre-test and comparison of the post-test between TPG/MGMP and is shown in the group of teachers who attended the training TPG/TSMD. In the DGDETEP). TPG/TSMD teacher training, the score in the pedagogical competence of teachers Competence of profession of teachers’ biology of DGDETEP were bit high of increase and the teacher who follow DGDETEP group decrease. If we analyze the average (pre-test score). The score of training teachers of TPG/MGMP can be material that provided on the training, the TPG/TSMD training the teacher is processed it maybe there was an error in filling section so pre-test score between taught to master the preparation of classroom, such as how to create a syllabus teachers who followed TPG/MGMP training and P4TL of Jakarta cannot be KTSP. While on DGDETEP training, it emphasismore on teaching material. compared. 5. Conclusions: From the research toward the training which organized by TPG/MGMP and There are significant differences in the professional competence of teachers of DGDETEP, can be concluded as follow. The training of TPG/MGMP, there was the physics between before and after training on the TPG/TSMD. However, it cannot significance improvement on: Academic competence of Math, Personality be simply interpreted that there is significance difference from both the score competence, Teaching efficacy. On the training which conducted by DGDETEP, because the professional competence decrease after training. It could be caused happened on: Academic competence of Math, The attitude in teaching, Teaching by a mistake or lack of time in answering the questions in the posttest. In efficacy. The decreasing of the score happened in some aspects and only addition, there is a significance difference between teacherphysicscompetence happened in the training which conducted by DGDETEP. On aspects: Academic who trained by TPG/TSMD and DGDETEP. Whereas the teachers that training in competence of physics, Personality competence, Social competence, Pedagogic DGDETEP the competence scores is significantly higher than teachers who are competence. Whereas, from the data baseline when pre-test on both groups, training in TPG/TSMD. there were significance differences in some aspects, such as: in Math, Physics, Teachers have an average competence. From the results of statistical tests, it Personality Competence, Social Competence, Satisfaction of Work, Teaching appears that there is a reduction score in Indonesian Language subjects, but the Attitude, and Teaching efficacy. 39

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_049 Paper No KK_050

Paper Title Psychometrics Properties of the Tuckman Procrastination Scale in an Indonesian Paper Title Live Grading of Essay Questions Contributing to Computer Adaptive Testing sample Email Address [email protected] Email Address [email protected]

1st Author Ngadiman Djaja 1st Author Dr. Haniza Yon

Affiliation Krida Wacana Christian University (UKRIDA) Affiliation MIMOS Berhad

Subsequent Subsequent Rense Lange (ISLA), Norsyahida Abd Kadir (MIMOS Berhad) & Nur Ayu Johar authors authors (MIMOS Berhad) 1. Aims/ This research aimed to establish how well items from Tuckman Procrastination 1. Aims/ Design and implement a Computer Adaptive Testing (CAT) that integrates Objectives of Scale fit a Rasch Measurement Model. The analysis aimed to identify unreliable Objectives of multiple-choice (MC) and more complex question types requiring considerable study: items and items displaying poor fit to procrastination construct. study: real-time analysis to evaluate test-takers’ answers. Examples include test- 2. Sample: A sample of 47 participants were recruited via email and social media (facebook), enhanced learning, hybrid questions, and the grading of student written answers which comprises a cohort of men and women aged 21-57 years from the to essay questions. We outline the design of CAT using “open-ended” (OE) essay population of Jakarta, Indonesia in 2017. questions that are “scored” by an extension of the CAT system in real-time using 3. Method: All 47 participants completed the 16 items measuring procrastination; only items item calibrations obtained for a third grade reading test. Using a Rasch partial related to procrastination were used in this study. Using Rating Scale Model credit model, the essay score is used to estimate test-takers’ performance on a (RSM), the item and overall test parameters and ability parameters of reading test. participants were estimated. Acceptable mean square statistics (MNSQ) 2. Sample: Items were calibrated based on sample of 534 Malaysian third-graders who took parameters were defined as 0.6 and 1.4. Items with MNSQ outside this range are a test MC + OE designed to assess their comprehension of a reading passage. Five considered to under-fit or over-fit with the model. teachers graded each answer to the OE. 4. Results: Item calibration using RSM showed 13 items had acceptable mean square 3. Method: The OE questions were analyzed using Latent Semantic Analysis (LSA) using term statistics ranging from 0.62 to 1.34; three items were removed for further weighting (information-based TF-IDF) followed by Singular Value Decomposition analysis. Item measures for the 13 items ranged from theta -1.38 to 0.94. Only 13 (SVD) to obtain a semantic space in which to locate student answers. The MC items fitted the single construct, implying that the Tuckman Procrastination Scale items were analyzed using the standard binary Rasch model to derive student is a unidimensional measure of procrastination. Overall, the scale has good reading measures. We found that a rather low number of dimensions (50-100) psychometric properties with a person reliability of 0.82. provided the best results in predicting students' reading measures. It proved 5. Conclusions: This analysis provided evidence for the Rasch measurement qualities of the possible to predict students’ OE using linear regression, logistic regression, and Indonesian version of the Tuckman Procrastination Scale discriminant analyses. By ranges of predicted values, OE answers could now be treated as if they were ratings on a rating scale. Results obtained via LSA and those obtained from teacher ratings correlated highly. 4. Results: The simplest approach, linear regression, proved adequate to identify 3-5 ordered categories to represent student performance on the OE question (R2 > 0.7). After anchoring the MC, the OE was then calibrated as a partial credit item 40

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

to yield an augmented set of CAT items. Using the Rasch-based CAT system Paper No KK_051 described in Lange (2007) which allows for partial credit items, it is now possible to mix and match MC and OE questions in the same CAT system. In this system, Paper Title DEVELOPMENT OF INDONESIA SCIENCE LITERACY TEST (ISLT) INSTRUMENTS TO OE answers are (1) “graded” using the LSA approach, (2) this grade is treated IMPROVE CRITERIA VALIDITY OF NATIONAL EXAM analogous to the “grade” a teacher might have given, (3) given the OE item’s Email Address [email protected] difficulty and step values it is thus possible to treat OE and MC in the exact same way to guide the CAT system. 1st Author Rosita Uli Sihombing 5. Conclusions: We have designed and implemented an augmented CAT system in which MC and OE items can be mixed as desired. Various OE grading methods can be used, and Affiliation this would also accommodate other types of items that produce student behavior requiring further evaluation to yield a “grade.” We have implemented a Subsequent prototype system that we hope to demonstrate live during the talk at the authors conference. 1. Aims/ This study aims to develop the Indonesian Science Literacy Test (ISLT) instrument Objectives of that can measure the science literacy of 15 year old students (grade 9 or 10), study: especially in an effort to increase the criteria validity of national exam. 2. Sample: 3. Method: The methods used through 5 stages of the ADDIE development model are: (1) analyse; (2) design; (3) development; (4) implementation; (5) evaluate. Rasch model is used to get the good item for ISLT. 4. Results: The results of the validity and reliability test indicate that the developed ISLT instrument 5. Conclusions: ISLT instrument belongs to a valid category and deserves to be used as an instrument to measure students' literacy skills.

41

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_052 Paper No KK_053

Paper Title Rasch Model Application on Developing a Self-regulation Study Instrument for Paper Title Measuring Second Language Receptive Knowledge of Collocation Among Mathematics Education Students Graduate Learners in Public Universities Malaysia Using Rasch Analysis Email Address [email protected] Email Address [email protected]

1st Author Wardani Rahayu 1st Author Lily Hanefarezan Asbulah

Affiliation Universitas Negeri Jakarta Affiliation Fakulti Pendidikan, Universiti Kebangsaan Malaysia

Subsequent Subsequent Maimun Aqsha Lubis. Fakulti Pendidikan, Universiti Kebangsaan Malaysia authors authors 1. Aims/ To develop a Mathematics Education students’ self learning instrument using a 1. Aims/ This study investigated AFL graduate learners knowledge of verb-particle, noun- Objectives of Rasch model on item response theory Objectives of noun and noun-adjectives collocations at the first four 1000 word frequency study: study: levels. A 40-item collocation test was used to measure receptive knowledge of 2. Sample: The sample in the first trial comprised 249 Mathematics Education students verb-noun and adjective-noun collocations that are made up of words taken from whilst in the second trial there were 260 Mathematics Education students in 1000, 2000, 3000, and 4000-word frequency levels in Arabic. Jakarta and Tangerang. 2. Sample: Since the data were collected at a single duration of time, this study employed a 3. Method: This is development research with two trials. The analysis of results using a Rasch cross-sectional sample survey field study. By adopting non-probability sampling model on the first and second trial is that the Mathematics Education students’ techniques, a total of 345 graduate learners from Bachelor Degree of Arabic self learning instrument has a significant item reliability and respondent Language were involved in this study from seven public universities in Malaysia reliability; the item’s unidimension requirement and local independence are which were UKM, UPM, UM, UiTM, UIAM, USIM, UPSI and UniSZA. fulfilled as an assumption on the item response theory 3. Method: A 40-item multiple choice question test of verb-noun and adjective-noun 4. Results: The Mathematics Education students’ standard model instrument consists of 11 collocations was given to the participants. There were three options of distractor was (أنا ال أعرف) ”items that measure cognitive aspects, 7 items that measure motivation, 8 items (collocates) that the learners had to choose from; “I Don’t know that measure behaviour, and 4 items that meassure meta-cognitive aspects. also offered as the fourth option to prevent respondents from simply guessing a 5. Conclusions: The Mathematics Education students’ self-learning instrument that has been matching collocate. developed, can be used to measure students' motivation, cognitive dimension, 4. Results: Rasch analysis shows that all items on a test are reliable and unidimensional, that behaviour, and meta-cognitive dimension. the participants’ responses fit the parameters of the Rasch model and there is no item bias (differentiated item functioning) occurring. 5. Conclusions: Surprisingly, the result revealed after at least 9 years of formal language instruction, the respondents were close to a level mastery of collocational knowledge for category noun-noun and noun-adjective but not for verb-particle.

42

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_054 analysis to explore the validity of rating scale across different groups of students.

Paper Title Development and validation of a diagnostic pronunciation rating scale: A rating 4. Results: Initial construct validation by Minifac reveals satisfactory quality of the scale scale and common-item equating analysis construction, indicated by all the Infit and Outfit mean-square ranging between Email Address [email protected] 0.5 and 1.5, though further analysis of the use of category suggests that category collapse be necessary. Results from the common-item equating analysis 1st Author Yuanyue Hao corroborate the validity of this diagnostic rating scale of pronunciation, considering that the empirical line is parallel to the identity line. Findings from Affiliation Fudan University the two analyses have provided preliminary evidence for the validity of the rating scale, which is subject to further investigation. Subsequent 5. Conclusions: This study employs a rating scale Rasch analysis and common-item equating authors analysis to investigate the construct validity of a diagnostic rating scale of 1. Aims/ Assessment of pronunciation has long been established as an integral component pronunciation and its validity across different groups of students. Results suggest Objectives of of speaking assessment, usually combined with other dimensions such as fluency, that this scale can be considered a valid one in assessing pronunciation and study: lexical resource and topic development to generate an overall score for the diagnosing students’ difficulty in the process of learning. This study has speaking section in major English tests. Few studies focus on assessment of implications for the development and validation of diagnostic rating scale of pronunciation per se, which plays a critical role in pedagogical context such as other sub-dimensions of language proficiency and pronunciation teaching and pre-service teacher training and teaching assistant selection. This study attempts assessment. With that being said, many questions remain unsolved, such as the to develop a diagnostic rating scale of pronunciation for the purpose of contention between nativeness-like principle and intelligibility principle in the pronunciation instruction in a formal pedagogical practice and provide evidence practice of teaching pronunciation, especially against the backdrop of global for the construct validity of the scale by a many-facets Rasch analysis. Englishes, longitudinal study of the validity of this rating scale, the learning 2. Sample: Participants of this study were selected from the students who were receiving trajectory of students over the one-year pronunciation learning experience, and both formal pronunciation instruction in the course of English Pronunciation and the linkage between the diagnosis and instructional resources. the subsequent peer tutoring in a major university in China which is specialized in and renowned for pre-service teacher training. A total of 88 students were rated by 10 raters who were also the peer tutors and 2 raters who were instructors of the pronunciation course on 23 items in the diagnostic rating scale. 3. Method: 23 items in the rating scale were developed in a theoretically-informed and empirically-based fashion. Theories from the linguistic studies of phonetics and phonology inform the design of the rating scale. In the meantime, tutoring records of each student were collected, coded and analyzed to supplement the rating dimensions extracted from phonetic and phonological theories. 23 items were compiled into the final version of the diagnostic rating scale. The scale was analyzed by Minifac (version 3.80.0) in the model of rating scale to investigate the construct of the scale. In a subsequent stage, ratings of two groups of students who were enrolled in different majors in the Department of English were analyzed by Ministep (version 3.93.2) to conduct a common-item equating 43

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_055 Paper No KK_056 Paper Title Development of instrument in measuring cottage industry accounting practices. Email Address [email protected] Paper Title Comparison of holistic and analytic rating methods of a writing task from the 1st Author Susana Narawi perspective of validity, reliability and practicality Email Address [email protected] Affiliation Faculty of Accountancy, Universiti Teknologi MARA, Sarawak Subsequent Bambang Sumintono, Institute of Educational Leadership, University Malaya, 1st Author Keita Nakamura authors Kuala Lumpur. 1. Aims/ The main objective of this paper was to discuss issues pertinent to the Affiliation Eiken Foundation of Japan Objectives of development of instrument to measure cottage industry accounting practices. study: Keywords: instrument, accounting practices, Partial Credit Rasch Model. Subsequent 2. Sample: Based on preliminary investigation, purposive sampling technique was utilised. authors The sample was obtained through the assistance of an agency which provides 1. Aims/ In language testing, the debate between holistic and analytic scoring of writing micro credit facility. Pilot and actual data collection involved 31 and 117 cottage Objectives of tasks has been long and well-documented (Zhang et al, 2015). Holistic scales industry owner respectively which had obtained micro credit facility. study: respond to language performance as a whole, and each score on a scale 3. Method: Initial stage involved several preliminary interviews with cottage industry owners represents an overall impression, while analytic scales are composed of separate and subsequently follow with expert content validity. The next stage involved aspects of performance and each aspect is scored separately (Li et al, 2015). As face to face pretesting with 11 respondents and also pilot test with 31 previous literature has shown little consensus on which method yield more respondents after the pretest data was analysed using WINSTEPS software. reliable and valid result (Harsch et al, 2013), it has been argued that the purpose Personally by group administration questionnaires was used for final data of the writing task, whether diagnosis, selection, or achievement is significant in collection. Fives cottage industry owners were concurrently interviewed to deciding which method is chosen (Bacha, 2001). obtain qualitative responses to justify the quantitative findings. This research had used the Partial Credit Rasch Model as it involved polytomous data. In Weigle (2002) it has shown that from the reliability perspective, analytic scale 4. Results: Findings from the data collection clearly show that there were indeed vital to is better than holistic scale. From the practicality perspective, however, analytic have the non-applicable in the measuring scale for the level of accounting scale is more time consuming and expensive than holistic scale. As for the validity practices among the cottage industry. Analysis result of this study had shown perspective, holistic scale assumes that different aspects of the writing ability 15% of this non-applicable responses as missing data. In the development develop at the same rate, while analytic scales assumes those aspects develop at instrument for this study from the pilot test instrument to the final data different rates. However, there have been only a few studies which investigated collection instrument, all the result on the condition to fit the model were the all three perspectives of reliability, validity and practicality when comparing satisfied and improved. the holistic and analytic scales. 5. Conclusions: The result of this study reveals that the Partial Credit Rasch Model is capable to 2. Sample: In this paper, the author presents the result of a study in which 371 Grade 8 and handle the presence of non-applicable accounting practices and at the same time 204 Grade 7 students took a test for a grade-appropriate EFL writing task. Nine is able to satisfy all the conditions to fit the model although it involves trained raters rated the papers using both holistic and analytic scales by polytomous data. As a conclusion, it is expected that discussion from this paper counterbalancing the rating order effect. will add new knowledge on the importance of instrument quality to be examined 3. Method: Analytic scale contained three criteria, namely content, vocabulary and grammar. using Rasch Measurement Model or Partial Credit Rasch Model particularly to Each rater was asked to at first rate a set of 20 common anchor papers for both future study involving accounting practices. Grade 7 and 8 tasks. At the same time, each rater was asked to measure their 44

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

rating time for each paper. Paper No KK_057

4. Results: Using the many-facet Rasch model analysis (Linacre, 2015) for both Grade 7 and 8 Paper Title Analysing The Effect of Smart Partnership using Rasch – a case of women tasks separately, the results have shown that the analytic scale showed the entrepreneurs in Tanjung Karang higher reliability when compared to the holistic scale for both tasks. It was also Email Address [email protected] found that even the same rater showed variability of his rater severity across the tasks and scales. In addition, raters’ fit indices also varied within individual raters 1st Author Rohani Mohd across the tasks and scales. The estimated participants’ proficiencies from the two scales showed a high correlation (r= .97) for both tasks. Finally, it was found Affiliation UITM that the rating time for the analytic scale took as twice as the holistic scale for both tasks. Subsequent Salwana Hassan, Geetha Subramaniam, and Badrul Hisham 5. Conclusions: The implications would be discussed in terms of the validity, reliability and authors practicality perspectives for choosing the appropriate rating scale. The author 1. Aims/ The purpose of the paper is to investigate the impact of collaboration between would argue that the purpose of the test or the validity issue should always come Objectives of women micro-entrepreneurs and major retailers first when designing a test and also when choosing a rating scale, and the study: reliability and practicality should sometimes be compromised. However, the 2. Sample: The purposive sampling technique was used for this study. All 17 women micro degree of compromises should always be investigated by empirical studies before entrepreneurs participated in the Smart Partnership with Mydin Hypermarket making decisions. were selected. The data was obtained via self administered questionnaire using an adapted quality of life instrument. 3. Method: A case study was conducted upon women entrepreneurs in Tanjung Karang, Selangor who participated in a collaborative program called Smart Partnership with Mydin Hypermarket. Rasch analysis was conducted to answer the research objectives. 4. Results: Based on the analysis, there were 5 groups of micro-entrepreneurs, identified by the 5 strata scored. Two rulers were generated from the analysis. They are profiling and gaps. The ruler of profiling identifies 5 groups based on their level of satisfaction for different aspects of quality of life. The ruler on effectiveness measures the effective levels of the program depending on the size of gap between items before and after the program. 5. Conclusions: The findings indicated the collaborative program has successfully transformed the poor to a better (quality) life. The discussion and recommendations were also included in the paper.

45

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_058 4. Results: 1) Item Misfit Diagnosis: Two items from National Heritage construct, one item from Cultural Paper Title Development And Validation Of Malaysian Secondary School Homogeneity construct, three items from Emotional Attachment to Malaysian Nation construct, and two items from Collective Self-Esteem construct were Email Address [email protected] having the infit and outfit mean square outside the accepted range. The Bright 1st Author Ma Chi Nan Maps were checked, and the outliers of items were identified from Guttman Affiliation UMS scalogram of responses, in order to determine whether those items were suggested to be retained, revised or omitted. Subsequent Vincent Pang authors 2) Item Polarity Diagnosis: 1. Aims/ Purpose: The purpose of this study is to develop and evaluate an instrument that One item from Collective Self-Esteem construct was having the negative value of Objectives of measures Malaysian students’ national identity. point-measure correlation. study: Objectives: 3) Principal Component Analysis of Rasch Residual (PCAR): (a) To develop an instrument to measure Malaysian students’ national identity. Collective Self-Esteem construct was having the value of raw variance explained (b) To assess the validity of the Malaysian students’ national identity instrument. by measures lower than 40%. Standardized Residual Constrast 1 Plot was checked (c) To assess the reliability of the Malaysian students’ national identity to identify the cause. From the item misfit diagnosis and PCAR diagnosis, the instrument. negatively worded items EA43, CS68 and CS74 were suggested to be omitted. 2. Sample: Population: All secondary school students in Penampang District, Sabah Besides, the negative item EA58 was suggested to be remained and changed to positively worded item. Sampling Method: Stratified sampling (There are six secondary schools in Penampang, 105 students from each school were selected as sample, where 35 4) Separation Diagnosis: students from Form one, 35 students from Form 2 and 35 students from Form 4. All the items were having the item separation greater than 3.0 and item reliability greater than 0.90. Sampling Size: 630 students 3. Method: Research Procedure: 5) Category Function Diagnosis: Stage 1: Instrument Development Belief System construct, Emotional Attachment to Malaysian Nation construct, (a) Developing conceptual and operational definitions of the construct of and Nationalism construct were having step disordering. These constructs were student’s national identity suggested to collapse the category 1 to category 0, which means researcher (b) Generating an item pool for the instrument reduced the categories from a 5-point rating scale to a 4-point rating scale. (c) Determining the format or selecting a scaling technique for the measurement. 5. Conclusions: Three items (EA43, CS68 and CS74) were negatively worded, and most properly the respondents were confused with the items. After discussion with content Stage 2: Instrument Testing & Refining experts and refer to the misfit of negative items, the researcher decided to drop (a) Establishing content validity of the instrument the three negative items. The deletion of the item caused the gap between the (b) Performing back to back translation variance achieved by measures and Rasch Model to be closer. Therefore, the (c) Preparing a revised draft of the questionnaire total number of items in the inventory was reduced to 73 items (from 76 items). (d) Testing construct validity and reliability 46

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_059 social sciences instrument calibration depends on data and is typically devoid of theory. We hypothesize that most of the observed differences between Paper Title The Unreasonable Effectiveness of Theory Based Instrument Calibration in the behavioral and physical science measurement are traceable to this foundational Natural Sciences: What Can the Behavioral Sciences Learn? difference. Further, we offer the Lexile Framework for Reading as an example of a Email Address [email protected] theory referenced measurement system in the educational sciences that mimics key features of human thermometry. Finally, we review the affordances shared 1st Author Jackson Stenner by human thermometry and the Lexile Framework for Reading. 2. A Reading Framework Affiliation Chief Scientist, MetaMetrics, Inc., Durham, North Carolina, USA A consensus unit is typical of most natural science measurement. Sometimes, as in temperature measurement, the unification process is not fully completed but Subsequent Mark Stone, William Fisher for the vast majority of natural science attributes/constructs a unification process authors has resulted in diverse instrument makers sharing a unit of measure even when 1. Aims/ Abstract. the measurement mechanisms vary from manufacturer to manufacturer. Objectives of Mercury in a glass tube thermometers for human temperature measurement can study: In his classic paper entitled “The Unreasonable Effectiveness of Mathematics in be contrasted with Nextemp™ technology. Although the measurement the Natural Sciences” Eugene Wigner addresses the question of why the language mechanisms are drastically different they both report out in either Fahrenheit or of Mathematics should prove so remarkably effective in the physical [natural] Celsius units. In the case of Nextemp™ thermometry a chemical specification sciences. He marvels that “the enormous usefulness of mathematics in the equation calibrates the instrument in ⁰C or ⁰F. The chemical specification natural sciences is something bordering on the mysterious and that there is no equation enforces the unit. In the Lexile Framework for Reading a text complexity rational explanation for it” [1]. We have been similarly struck by the outsized specification equation enforces the ‘Lexile’ unit and ensures that 100L of benefits that theory based instrument calibrations convey on the natural difference between two readers, two texts or a reader/text encounter is invariant sciences, in contrast, with the almost universal practice in the social sciences of over any of 100+ English reading tests that, at present, employ the Lexile unit. using data to calibrate instrumentation. Strickly parallel instruments are typical in the natural sciences. Such instruments 1. Introduction share a common correspondence table that links a measurement outcome (count In our ongoing exploration of the differences between the way the natural of cavities turning black on a Nextemp thermometer) to a ⁰C or ⁰F. The ability to sciences and social sciences invoke, define and engage in measurement we have manufacture essentially identical instruments in large quantities is a hallmark of identified a number of differences. We have, to some benefit, contrasted human natural science measurement. The specification equation is the recipe for temperature thermometry (e.s. Nextemp thermometers) with the testing of manufacturing and calibrating clones of an instrument. The social sciences mathematical ability and the measurement of English language reading ability. borrow the concept and talk about ‘parallel’ instruments or ‘alternate forms’ and Although cataloging these differences has been useful, we now believe they are advertise that say, form A and B produce exchangeable measures. Of course, all traceable to a common cause. Physical science measurement virtually without without a specification equation it is impossible to manufacture copies or clones exception is founded on well-developed substantive theory. These theories are that share the same correspondence table. The Lexile Framework for Reading and not just compelling stories about the relationships between measurement its specification equation can be used to build strickly parallel clones of any outcomes (count of cavities turning black on a Nextemp thermometer) measures reading test. No such capability exists, for example, for the Quantile Framework (degrees Celsius) and measurement mechanisms (chemical specification equation for mathematics, and this is so precisely because, at present, there exists no ). They are sufficiently elaborated and precise in their specifications that they can specification equation for mathematical ability that can calibrate mathematics be used to calibrate instrumentation. In contrast, throughout the behavioral and test items. Different mathematics tests are empirically linked to the Quantile 47

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

scale through large scale, expensive, field studies typically involving thousands of theoretical invariance necessarily holds i.e. the attribute is homologous students. (Borsboom, also Molenaar). Molenaar (has) shown that inferences moving in the Typical Rasch model applications in the social sciences are singly prescriptive. The reverse direction, interring from inter-individual factor structures something major prescription that data must meet is non-intersecting item characteristic about intra-individual factor structures, is fraught with complications. The fact curves (ICC’s) which relate probability of a correct response to the difference that so much of social and psychological measurement is based upon factor between person ability and item difficulty. The data are used to estimate person analysis of inter-individual variation prompted Molenaar to call for a Kuhnian and item parameters with no a priori constraints on the item parameters. The revolution. This paper is intended as a contribution to this revolution. Quantile Framework for mathematics is typical of much social science Unification of measurement refers to a 200 year old process whereby dozens if measurement. Because there is no strong substantive theory for ‘mathematical not hundreds of distinct scales for measuring a common attribute are, sometimes ability’ there is no specification equation and, thus, no potential for theoretically quickly and more often slowly, reduced to one, two or three exchangeable units calibrating items/instruments. Instrument calibrations depend on sample data of measure. The history of temperature measurement is a paradigmatic case and a property of the Rasch model: when data fit the model differences between (Chang, Sherry) that parallels many contemporary measurement movements is persons and differences between items are independent of items and persons, the social and behavioral sciences. Typically, an attribute (construct captures the respectively. Contrast this singly prescriptive measurement framework with the imagination of a community of scholars and engineers and different tests / doubly prescriptive models underlying Nextemp™ human thermometry and the instruments / mechanisms and scales are proposed for measuring the attribute Lexile framework for reading. In both these cases strong substantive theory and each is uniquely names. Once there is consensus that the selfsame attribute coupled with either a Guttman model or a causal Rasch model requires not just is being measured across these various devices small scale linking studies are data fit to the model but also data fit to the theory specified item/instrument undertaken to build conversion tables to re express one unit in one or more other calibrations. For Nextemp a chemical specification equation is used as a recipe for units. More advanced linking studies reduce the link to an equation ⁰F = ⁰C * 9/5 the chemical compound that fills each cavity. By precisely varying the amount of +32 or making for quick and easy conversions. Since at this stage there is often additive the difference between any two adjacent cavities in sensitivity to the not much to elevate one scale about the competition the market place takes over green component of light is precisely .2 degrees Fahrenheit. The chemical and ‘unification’, with all its time and cost savings eventually prevails. Sometimes specification equation enforces this common unit difference for each of the 44 unification is swift and decisive but more often, particularly in the social sciences, adjacent cavity differences across the 9⁰F operating range for the instrument. metrology is poorly understood and unification plods along. When data fit a doubly prescriptive Rasch model absolute person measures (not A useful case study of unification in the social sciences is the Lexile Framework for merely differences) are independent of items and instruments and are Reading which has linked 100+ English language reading tests across the world, independent of person sample precisely because no person data figures in the 250,000 book measures and 200 million article measures to the Lexile scale. The instrument calibration process. Theory calibrated Rasch models are, thus, doubly unification process is 27 years old and is accelerating but is far from complete. prescriptive: prescriptive as to Rasch model requirements and prescriptive as to This effort drew inspiration and strategies from the unification of temperature the substantive theory i.e. item/instrument calibrations. Person misfit to a doubly (Chang, Sherry, Fisher, Stenner 2016). prescriptive model signals that the measurement mechanism that transmits Rather than using factor analysis of inter individual data to define an attribute variation in the attribute to the measurement outcome (often a count) is not structure and then asking if this structure obtains when examining intra individual working as intended for that individual. Frequent failures of theoretical data we suggest the use of substantive theory (in the form of specification / invariance forces reexamination of the substantive theory, the measurement calibration equations) to establish the universality of attribute structure and mechanism and instrument calibration procedures. Theoretical invariance can be measurement mechanism at the individual level. Once this is accomplished there tested within person over time (e.g. reading ability growth trajectories) and when is no puzzle about whether between person differences have the same structure intra individual theoretical invariance holds across persons then inter-individual as within person differences – of course they do. So, what this analysis reveals is 48

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

that it is problematic to study between person variation at one point in time to employs computer generated, four-option, multiple choice cloze items built on glimpse truths about within person structures over time. But the surprise is that if the fly for any prose text. Counts correct on these items are converted into Lexile we start with within person theory referenced measurement, where in the measures via an applicable Rasch model. Individual cloze items are one off and extreme no two persons have any items in common over 5 years of measurement disposable; an item is used only once. The cloze and foil selection protocol then we would not stop for a moment to puzzle about the validity of the claim ensures that the correct answer (cloze) and incorrect answers (foils) match the that at the end of year 1 Jane was higher than Bob but at the end of year 5 Bob vocabulary demands of the target text. The Lexile text complexity measure and was higher than Jane (i.e. a claim about inter individual variation.) This is yet the excepted spread of the cloze items are given by a proprietary text theory and another benefit of theory based instrument calibration. associated equations. Thus, the observed outcome is a count of correct answers. Note 1 The measurement mechanism is a text with a specified Lexile text complexity and “The NexTemp Thermometer is a thin, flexible, paddle-shaped plastic strip an item generation protocol consistent with that text complexity measure. The containing multiple cavities. In the Fahrenheit version, the 45 cavities are text complexity measure can be traded off for a change in reading ability to hold arranged in a double matrix at the functioning end of the unit. The columns are constant the number of items answered correctly. spaced 0.2⁰F intervals covering the range of 96⁰F to 104.8⁰F….Each cavity contains a chemical composition comprised of three cholesteric liquid crystal Note 3 compounds and a varying concentration of a soluble additive. These chemical The Quantile Framework® consists of a common supplemental metric – the compositions have discrete and repeatable change-of-state temperatures Quantile – that is employed to scientifically measure a student’s ability to think consistent with an empirically established formula to produce a series of change- mathematically and locate them in a taxonomy of mathematical skills, concepts, of-state temperatures consistent with the indicated temperature points on the and applications. In order to develop the Quantile Framework, several tasks were device. The chemicals are fully encapsulated by a clear polymeric film, which undertaken: (1) develop a structure of mathematics that spans the allows observation of the physical change but prevents any user contact with the developmental continuum from first grade content through Algebra I, Geometry, chemicals. When the thermometer is placed in an environment within its and Algebra II content, (2) develop a bank of items that have been field tested, measure range, such as 98.6⁰F (37.0⁰C), the chemicals in all of the cavities up to (3) develop the Quantile scale (multiplier and anchor point) based on the and including 98.6⁰F (37.0⁰C) change from a liquid crystal to an isotropic clear calibrations of the field-test items, (4) validate the measurement of mathematics liquid state. This change of state is accompanied by an optical change that is ability as defined by the Quantile Framework, and (5) link extant tests of easily viewed by a user. The green component of white light is reflected from the mathematical ability to the Quantile scale. The process of scale unification for liquid crystal state but is transmitted through the isotropic liquid state and mathematics ability is well underway. absorbed by the black background. As a result, those cavities containing At present the attribute “mathematical ability” in unspecified i.e. there is no compositions with threshold temperatures up to and including 98.6⁰F (37.0⁰C) specification equation and associated Quantile analyzer that can be used to appear black, whereas those with transition temperatures of 98.6⁰F (37.0⁰C) and locate ‘math text’ on the Quantile scale. Rather, data intensive methods are higher continue to appear green” (Medical Indicators, 2006, PP.1-2). Thus, the employed to calibrate instrumentation and human intensive qualitative analysis is observed outcome is a count of cavities turned black. The measurement employed to locate math text (e.g. a chapter on adding fractions with uncommon mechanism is an encased chemical compound that includes a varying soluble denominators) on the Quantile® scale. The vast majority of social science agent that changes optical properties according to changes in temperature. attributes are similarly unspecified. By contrasting Nextemp™ Thermometry, the Amount of soluble agent can be traded off for change in human temperature to Lexile Framework for Reading and The Quantile Framework for Mathematics we hold number of black cavities constant. hope to illuminate the chasm of difference between instrumentation that Note 2 employs strong substantive theory and that which that do not. The Edsphere™ technology for measuring English language reading ability For the vast majority of measurement systems it is the case “that the difference 49

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

between any two points for one individual is qualitatively the same as a Chicago Press, corresponding difference between two individuals at one time point” (Borsboom, 1980) (Copenhagen, Denmark: Danmarks Paedogogiske Institut) D., Cramer, A. O. J., Kievit, R. A., Scholten, A. Z., and Franic, S., 2009) that is, the [10] Stenner A J, Burdick H, Sanford E E, Burdick D S 2006 J Appl Meas 7 307- attribute is homologous. The same cannot be said for many measurement 322 systems used in the social sciences. We propose that the routine adoption of [11] Bond T, Fox C 2007 Applying the Rasch model (Mahwah, New Jersey: theory based instrument calibrations will pave the way for homologous attributes Lawrence in the social sciences, thus, assuring that the attribute on which I differ from Erlbaum) myself over time is the same attribute on which I differ from my brother [12] Engelhard G Jr 2012 Invariant measurement (New York: Routledge (Borsboom, 2005). Academic) 4. References [13] Latour B 1987 Science in action (New York: Cambridge University Press) [1] Valsiner J, Molenaar P C M, Lyra M C D P, Chaudry N (Eds) 2009 Dynamic [14] Latour B 2005 Reassembling the social (Oxford: Oxford University Press) Process [15] Heilbron J L 1993 Historical Studies in the Physical and Biological Sciences Methodology in the Social and Developmental Sciences (New York: 24 1-337 Springer) [16] Nersessian N J 2002 in Essays in the History and Philosophy of Science [2] Fisher W P Jr 2009 Measurement 42 1278-1287 and [3] Stenner A J, Fisher W P Jr, Stone M H, Burdick D S 2013 Frontiers in Mathematics ed. Malament D (Lasalle, Illinois: Open Court) 129-166 Psychology: [17] Wright B D 1999 in The New Rules of Measurement: What Every Educator Quantitative Psychology and Measurement 4 doi: and 10.3389/fpsyg.2013.00536 Psychologist Should Know, ed. Embertson S E and Hershberger S L [4] Fisher W P Jr, Stenner A J 2011 A technology roadmap for intangible assets (Hillsdale, New metrology. Jersey, Lawrence Erlbaum Associates) 65-104 International Measurement Confederation (IMEKO) TC1-TC7-TC13 Joint [18] Engelhard G 2001 J App Meas 2 1-26 Symposium. [19] Stenner A J, Smith M 1982 Perceptual and Motor Skills 55 415-426 Jena, Germany. hhtp://www.db- [20] Stenner A J, Smith M, Burdick D S 1983. J Ed Meas 20 305-316 thueringen.de/servlets/DerivateServlet/Derivate- [21] Williamson G L, Fitzgerald J, Stenner A J 2013 Educational Researcher 42 24493/ilm1-2011imeko-018.pdf 59-69 [5] Borsboom D, Dolan C V, 2007 Measurement 5 236-263 [22] Burdick D S, Stone M H, Stenner A J 2006 Rasch Measurement [6] Molenaar P C M 2004 Measurement 2 201-218 Transactions 20 1059- [7] Hamaker E L, Nesselroade J R, Molenaar P C M 2007 Journal of Research in 1060 Personality [23] Stenner A J, Stone M 2010 J App Meas 11 244-252 41 295-315 [8] Molenaar P C M, Newell K M American Psychological Association. doi: 10.1037/12146- 006 [9] Rasch G 1960 Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of 50

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_060 Paper No KK_061 Paper Title Facilitator Training Needs in Malaysia Schools Paper Title Performance of Early Mathematics Achievement Test (UPAM) over time: Applying Email [email protected] Rasch Measurement Racking Address Email Address [email protected] 1st Author Mohd Kashfi Mohd Jailani Subsequent Raja Hamizah Raja Harun, Nurulhidayah Sukiman, Siti Sarah Baharom, Ismail 1st Author Dr. Connie Cassy Ompok authors Mohamad 1. Aims/ Facilitator positions created for guide and consulted the leadership of the school, Affiliation UMS Objectives guide teachers to implement pedagogical interesting, creative and innovative and of study: for ensuring every student to acquire basic literacy (Bahasa Malaysia and English) Subsequent Dr. Ling Mei-Teng, Prof Vincent Pang, and Numeracy for pupils in Year 3 . authors This study aimed to identify the facilitator training needs and how these training 1. Aims/ This study aimed to see what performance indicators have changed over time. needs differed among groups of facilitator. Objectives of 2. Sample: The instrument was administered on facilitator including 220 School Improvement study: Partners (SIPartners+), 587 School Improvement Specialist Coaches+ (SISC+) , 377 2. Sample: The sample consisted of 170 P1-preschool children FasiLINUS in District Education Office/State Education Office throughout Malaysia 3. Method: The overall facilitator training needs in this study was analyzed using Rasch 3. Method: Rasch Measurement Racking analysis. The research used a self-administered questionnaire on Facilitator Competency Instrument. Analysis on item – person map was carried out to identify 4. Results: The instrument showed good Rasch analysis properties (PTMEA Corr. > 0; Infit the level of difficulties of item- person. and outfit mean square >2; Item separation and reliability = 6.55, 0.94; Person 4. Results: The results show that the majority of the facilitator performed well in all skills separation and reliability = 3.21, 0.91). The results shows generally consistent indicating that they were able to master all the competency as facilitator . However changes in item difficulties after intervention. All items difficulty logit value they are still lacking in Needs Analysis. The analysis found that there are some of reported a decrease at Time 2. The mean of pre-test item difficulty was 1.61 and the most difficult items agreed upon. These items have been grouped into several the mean of post-test item difficulty was -1.61 which shows the difference of 3.22 main focus of research methods, Design Training, Implementation Analysis Needs logits. The effect size of the difference between the post-test and pre-test item (need analysis) and Interpersonal courses / Effective Communication. difficulty was -1.28, which is considered large. 5. This study aimed to identify the facilitator training needs and how these training 5. Conclusions: Rack analysis provides information at item level, allowing distinction between Conclusions: needs differed among groups of facilitator. Suggestion training needs for which items that have become easier, more difficult or maintained. SIPartners+ were coaching and mentoring course, education research courses and Measurement of change of difficulty at item level allows the researchers to effective communication. Suggestion training for SISC+ were Standard Quality of identify the functioning items for both tests. Malaysia Education (SQME) Standard 4 workshop, workshops on ICT, courses related to education policies and Hands-On workshop to analyze and interpret data SQME Standard 4. Suggestion for FasiLINUS training were workshop on research methods, Effective interpersonal and communication courses, and ICT course.

51

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_062 ecological approach to problems of coherence in educational assessment may be key to understanding learning in each distinct contextualizing niche of the various Paper Title Modernizing vs Ecologizing Approaches in Measurement social environments in which it lives. Learning varies across these levels of complexity in ways that cannot be grasped directly from individual measures. Email Address [email protected] What form might conceivably be taken by communications systems capable of supporting broad-scale efforts at sorting out the sources of distinct classes of 1st Author William P. Fisher, Jr. effects on learning? 3. Method: Philosophers contrast modern, postmodern, and unmodern conceptions of Affiliation BEAR Center, University of California, Berkeley science as being positivist, antipositivist, and postpositivist (Galison, 1997; Latour, 1990). Philosophically modernist conceptions of science are positivist in the sense Subsequent A. Jackson Stenner; MetaMetrics, Inc. of prioritizing a focus on data as the ultimate criterion of objectivity. authors Postmodernism, in contrast, is sensitized by historical changes in what data count 1. Aims/ Education, health care, human resource management, social services, and many as worthy of attention and so is concerned with the role of theory in making data Objectives of other areas of life are marked by a kind of schizophrenia (Star & Ruhleder, 1996; salient. Unmodern (also known as amodern) postpositivist perspectives (Dewey, study: Bateson, 1972) that emerges in terms of the dissonance between a caring focus 2012; Latour, 1990; Latour, 1993) assert that the debate between modern and on individual needs for learning and healing, on the one hand, and demands for postmodern focuses too exclusively on the mutual implication of theory and data, accountability focused on standards and comparability. Support for the and so will remain unresolved as long as the roles of instruments and knowledge irrevocable concerns with the individual student’s and patient’s spontaneous technologies are not taken into account. Instruments encapsulate what is learned processes of development and healing stands as an immovable thesis that is from data and what can be explained by theory. Unmodern philosophical increasingly in opposition to the larger social antithesis of an imposed demand perspectives and research in the history of science (Galison, 1997; Latour, 1990; for evidence proving the achievement of quality standards. How might new Latour, 1993; Bud & Cozzens, 1992; Wise, 1995; Dear, 2012; O'Connell, 1993) institutional forms of social life resolving the schizophrenic break be formed at a focus on the collective cognition and team-based coordinations made possible higher level of system complexity? How might those forms of life synthetically when this embodied form is expressed in a uniform language distributed integrate the necessary concern for development and healing with a new throughout a community of practice. Metrology’s concern with measuring ecologizing bottom-up approach to accountability and standards that instruments traceable to unit standards then becomes a matter of focal interest authentically embodies individual uniqueness? as a way in which everyday model-based reasoning has been extended 2. Sample: Philosophers have long sought to grasp the multilevel nature of meaning (Star & productively into science (Nersessian, 2012). Recent developments suggesting Ruhleder, 1996; Bateson, 1972). Linguistic communication systems incorporate metrological paths forward for the constructs of psychology and the social within-individual processes distinct from, but interacting with, mid-level sciences (Mari & Wilson, 2014; Pendrill & Fisher, 2015; Pendrill, 2014; Wilson et processes between individuals, and which in turn are distinct from but interacting al., 2015; Fisher & Stenner, 2016) also extend everyday model-based reasoning with high-level group processes. These levels of complexity in communication (Fisher, 2004; Fisher, 2010) and open up new possibilities for enhanced have informed practical applications in epidemiology (Susser & Susser, 1996) innovation in education, health care and other fields. A significant problem that leading to new, productive relationships between clinical medicine and public remains unaddressed is how varying levels of information complexity can be health efforts (Bizouarn, 2016). Might not a similar kind of productivity be integrated into a new metrological culture encompassing all of the arts and possible in education if we apply similar approaches to developmental, sciences. horizontal, and vertical coherence (Gorin & Mislevy, 2013; National Research 4. Results: Reading measures are linked together in an ecosystem that has capitalized on the Council, 2006; Wilson, 2004) issues in educational assessment? This kind of literacy form of life that consistently asserts itself across samples of students, 52

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

texts, test items, time, and space (Fisher & Wilson, 2015). Niches in this characterized by Ricoeur (Ricoeur, 1992, p. 289) as potential or inchoate ecosystem span a wide range of classrooms, schools, homes, libraries, testing universals. Dewey (1954, p. 215) similarly held that "The local is the ultimate agencies, and book publishers. Reading test item difficulties have been shown to universal, and as near an absolute as exists." The interconnections of metrological be remarkably stable over decades of use (He & Kingsbury, 2016) and moreover networks supporting local approximations and translations of standards is can be predicted by an explanatory theory accounting for over 90 percent of the pointed to by Golinski (2012, p. 35) as replacing the uniform universality assumed observed variance (Fisher & Stenner, 2016). in modern science. And Haraway (1996, pp. 439-440) suggests another account as More than 100 English language reading tests across the world measure in a to how locally embedded relationships offer an alternative to both relativism and common unit. Over 30 million student measures annually are interpreted relative transcendence. to 250,000 book measures and 200 million article measures, where matching The sustainability opportunities created within an ecologizing paradigm stem student and text measures predict a 75 percent comprehension rate. Books, from the co-evolution of (a) concepts embodied in linguistic and measurement articles, assessments, and students have been brought into a common frame of technologies and (b) the institutional rules, roles, and responsibilities within reference in a process now over 27 years old and still accelerating. Text multilevel social, political, and economic ecologies (Hutchins, 2014; Miller & complexity corresponds with reading learning progressions such that student O'Leary, 2007). The end results are systems of tools embodying individually measures enable the individualization of instruction. Student measures are unique problem-solution unities that are useful in negotiating local particularities tracked over time and across grade levels, instantiating developmental while still recognizable as belonging to an identifiable general class. These results coherence. Teachers are able to compare learning outcomes across their own suggest potentially large payoffs of new analogies from existing online and each other’s classes, realizing horizontal coherence. And in many locations, engineering models of global cooperation enabling intelligent metrology state end-of-year or graduation tests report in the common unit, providing applications (Durakbasa, Bauer, Bas & Riepl, 2015). Perhaps caring for our parents, students, teachers, principals, librarians, researchers, and the public with measuring technologies in education and other fields the same way we care for the vertical coherence needed for connecting classroom formative assessments our children will yet lead to creation of forms of social life sensitive to the values with accountability standards. and experiences of those who inhabit them. 5. Conclusions: Instead of demanding strict conformity with item-based equating standards, then, it is likely more realistic and productive to think of standards in terms of shared information contextualized in a common theoretical and explanatory frame of reference (Stenner, et al., 2013). Instead of expecting all student measures to be produced from one set of items that fit one measurement model, individual response patterns can be displayed in instructionally relevant and developmentally coherent kidmaps with no need for reporting any scaling or statistics. Horizontally coherent statistical summaries of measures over time, within and across classrooms, will be reported to teachers and administrators in support of the local community of practice, in a unit comparable with summative accountability measures. These informationally coherent links stand as “potential” universals in partially interconnected, resonant, and multilevel traceability network ecosystems, bypassing the strictly local item-based problem-solution dependency and the universal problem-solution independence at the same time (Latour, 2005, p. 229). These ecologized “glocal” media, simultaneously local and global, are 53

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_063 4. Results: The key point is that the results based on the same data obtained from Winsteps, compared with those from other software such as Gitest, turn out to be exactly Paper Title Rasch-based Test Equating: An Application of Winsteps in China the same 5. Conclusions: This shows us Winsteps is another good choice for both Chinese scholars of Email Address [email protected] language testing to consider.

1st Author Wu Jinyu

Affiliation City , SAR

Subsequent Zhang Quan, City University of Macau, SAR authors 1. Aims/ In today’s testing practice, equating plays a central role and is held as the Objectives of prerequisite condition for item banking in computerized as well as in Internet- study: based testing. Through equating, the changes of item difficulties in the test forms can be observed and the corresponding ability estimates across different occasions are thus adjusted. As equating is a complicated process requiring enormous data processing and manual calculation is by no means feasible and as we have been using our self-developed software Gitest which, though a Rasch- based DOS program, has limit of processing jumble data matrix by a single run. This highly motivates the authors to seek for other effective tool. Now, among various kinds of computer software available for estimating test items, ability parameters and test equating, Winsteps is a great software program to consider. This paper attempts to present the significant aspect of Winsteps: parallel test equating based on a group of minimum yet representative data. It indicates a wide range of application of WINSTEPS to practical test equating problems, assumes binary scoring of item responses and gives stable and accurate estimates of item parameters and scale scores for both long and short tests and classroom exercises. 2. Sample: The results are based on 40 Chinese students of non-English major of a university in Zhongshan, Guangdong Province, China. 3. Method: The method used herein for test equating refers to linking of separate test forms through common (linking) items so that scores derived from the tests which were administered separately to different test takers on different occasions after conversion (in our presentation Rasch analysis referred) will be comparable on the same scale (Hambleton & Swaminathan, 1985; Gui, Li and Zhang, 1989).

54

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_064 Paper Title Validating the Usability Evaluation’s Instrument of Community Learning Centre Paper Title Misconceptions in electricity via Rasch Analysis Model (UEICLC) for Aboriginal in Tasik Chini, Pahang Email Address [email protected] Email [email protected] Address 1st Author Mazzlida Mat Deli 1st Author Nazlinda Abdullah Affiliation Faculty of Education, National University of Malaysia Subsequent Ruhizan Mohammad Yasin, Siti Mariam Dasman Affiliation Universiti Teknologi MARA authors Subsequent 1. Aims/ This study aims to produce an instrument for usability evaluation of Community authors Objectives of Learning Centre’s elements Model (UEICLC), which has the reliability and validity 1. Aims/ This study is a preliminary study which focuses in determining the suitability of study: using the Rasch model. Objectives of DIRECT, a test developed in the United States, which covers the topic of electricity, Keywords: Rasch Measurement Model, evaluation, usability, community learning study: in identifying the misconceptions of Malaysian students. In addition, this study aims centre to compare the performances of the various groups of Malaysian students. 2. Sample: Sixty community members of Orang Asli in Tasik Chini, Pahang participated in this 2. Sample: This study involves 104 Malaysian from various colleges, institutions and school in study. the Klang Valley 3. Method: This study employed a quantitative approach of data collection and analysis. A 3. Method: This is a preliminary research which uses the descriptive research method. A test on survey was used to gather information on the community Orang Asli’s perception electricity named DIRECT was distributed to various colleges, institutions and school towards the elements of CLC Model, which has involved Jakun’s Orang Asli in Tasik at different period of time. Chini, Pahang. The data were analysed using Winstep 3.80 for investigating the 4. Results: In general, results show that the Person reliability = 0.52 while the item reliability is functioning and rating scale categories, reliability and separation index, 0.95. The person mean shows a negative value at -0.30 logit, which means that the unidimensionality, item polarity, goodness of fit and item difficulty level of the students found the test to be challenging for them. The targeting was found to be items. acceptable with the item spread at 5 logits and the person spread over 3 logits. 4. Results: Firstly, the original five-rating scale does not function effectively, scale 1 and 2 There is a need to increase the number of students with wider range of ability. From should be combined to improve the threshold estimates value between category. the item frequency measure order table, the misconceptions which the students Secondly, the reliability for item and person are accepted and the separation are have on each item and area of electricity were identified. The misconceptions good that are greater than two. identified were like those found among the American students. The area of Thirdly, the Rasch Model proved that UEICLC is a unidimensional scale which is the potential difference and current were two common areas where the students were raw variance explained by measures were more than 60% and the unexplained having problem in. In addition, Malaysian students also faced challenges in energy variance in 1st contrast are below 15%. and power. As for the performance of each group, the comparison can be detected Forthly, all the items were fit with the model how ever 6 person were deleted due from the Wright Map by arranging each student according to their groups. Among to misfit. the six groups of students, the students who were doing the A-levels and pursuing a Lastly, the mean of the items were slightly below the person’s ability . The items in medical degree happen to be the highly capable students. this scale are quite easy for the respondents and there also was a big gap within 5. The findings show that DIRECT is suitable for the Malaysian students in identifying some items Conclusions: their misconceptions of electricity. In addition, the performances of each group 5. This study produced a new Rasch measurement to evaluate the CLC programs were easily identified, including the areas of strength and weaknesses. Conclusions: which proved CLC program that will provide an opportunity and space for Aboriginal Paper No KK_065 gained knowledge and skills in line with their beliefs and traditions.

55

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_066 responses across disciplines and assure maximum response rate. An average response rate of 80% was achieved. Paper Title Develop, deploy, determine: Surveying assessment for learning in the Singapore 3. Method: Pilot secondary school context Development of the pilot survey drew upon a) a review of relevant literature b)

Email Address [email protected] the underlying theoretical framework of the research c) careful analysis of similar surveys into AfL and d) a context-specific analysis of Singaporean secondary 1st Author Christopher C. Deneen school assessment and learning culture(s). In the full presentation, this last point will be discussed in some detail, as it is critical to understanding the piloting Affiliation National Institute of Education, Nanyang Technological University process.

Subsequent Gavin Fulmer, University of Iowa Data from the pilot was subjected to both Rasch (R software with TAM package) authors and factor analysis (AMOS). The use of Rasch and factor analysis as 1. Aims/ Aim: To develop, utilize and obtain results from a survey instrument on complementary methods for survey development has been reported/published Objectives of assessment for learning (AfL) in the Singaporean secondary school context. on previously by the first author. Results from this process were used to adjust study: items, scales, item groupings and parameters/factors. Objectives. To report findings on the: 1. Construction piloting, adjustment and deployment of a survey instrument From this was developed focusing on AfL perceptions, values, practices and proficiencies. 1. Stable factors: Alignment, Grading/Reporting, Actively Involving Students, and 3. Outcomes of research that may inform the survey target (AfL) and the practice Sustaining Engagement. of objective measurement in educational contexts. 2. A finalized structure for the survey with three main areas: 2. Sample: Pilot participants (n=163) consisted of Singaporean secondary school teachers A. Demographic enrolled in a master's degree program at the Singaporean National Institute of B. Core I: Purposes of assessment Education (NIE). NIE is the sole provider of teacher education degree programs in C. Core II: values, practices and proficiencies in assessment Singapore. The master's degree program is in curriculum and teaching and is not subject specific. This allowed pilot sampling to include responses from secondary Full deployment school teachers teaching within multiple subjects at a broad cross-section of Once data from the full deployment was obtained, the following actions were Singaporean schools. Only responses from teachers working at the secondary taken: school level were included. Thus, pilot sampling was highly congruent with the 1. Data cleaning and missing values analysis intended sample/participant group for full survey deployment. 2. EFA/CFA 3. Descriptive statistics for Cores I&II Participants in the full survey deployment (n=913, post data cleaning) consisted 4. Results: Inspection of exploratory results suggested that four factors were plausible. of teachers at 13 Singaporean secondary schools. School selection was planned as These models were tested for fit in AMOS and after trimming acceptable fit was a distributed national representation. Three criteria were used: 1) academic found. Nine items across the three constructs that align to a common factor, achievement as evidenced through school-level performance on standardized while another 13 are in the same factor for two of the three constructs. We tests 2) socio-economic status of students, and 3) stage that the school was at of conclude that, while not identical, the factors are conceptually similar. Based on Singapore-wide AfL policy implementation. The survey was deployed strategically content review of these items, we defined them as: (1) Alignment; (2) Sustaining and in cooperation with ministry officials and school leaders to allow for Engagement; (3) Involving Students, including peer- and self-assessment (PASA); 56

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

and (4) Grading/Reporting. sufficiently similar conceptually to allow common nomenclature and comparison. More importantly, some of the recovered factors mapped directly onto the Following this, we computed factor scores for each factor, and examined the intended structure. Notably , this occurred in the sustaining engagement, patterns of teachers’ mean agreement. The factors of Alignment, Sustaining alignment, and involving students in assessment factors which used sets of items Engagement, and Involving Students (PASA) are more valued by teachers than that had been grouped in the original design. It is only the Grading/Reporting they are practiced; teachers have yet lower reported rates of self-proficiency. The factor that draws items from across the original factors of Doing and one exception is Grading/Reporting, which the teachers report similar practices Accountability. Upon inspection, the connection of items is logical. Values, and values, but again relatively lower levels of proficiency. Differences between proficiency, and practice of assessment exist in a complex relationship that may the highest and lowest means are generally large (d>.80) be studied further in these data. 5. Conclusions: This paper present conclusions significant to survey use and to the topic under A clear pattern emerged from the data: To the degree that aligning assessment study: Assessment for learning. In the full paper, the following conclusions are with curriculum, sustaining student engagement, and involving students in discussed in the context of research and development agendas of interest ot a assessment are part of AfL, Singapore teachers in these schools already value AfL. global measurement audience. This is an important realisation and corresponds with much research into teacher beliefs about assessment—teachers endorse assessment that helps improve Assessment for Learning Conclusions teaching and student learning outcomes. However, teachers reported valuing This study lends clarity to the tensions that have been identified in research into three factors considerably more than they reported having proficiency or teacher perceptions of assessment; importantly, the differences in factor means opportunity to carry them out. This suggests potential impedance to provide potential directions for professional development in assessment for endorsement translating into impact. learning. The challenge now is to determine to what extent the lack of proficiency arises from deficient personal competencies and skills or from policy and priority A converse pattern emerged around Grading/Reporting, with the highest mean conditions that are inimical to AfL. By focusing on the key element of stakeholder for Practice, but lowest mean for Value. Given the public examination structure in perceptions, we may be able to create links in AfL perceptions, policies and Singapore schooling, it is unsurprising that teachers emphasised frequency of practices that have research, practice and development implications in Singapore summative activity. This would seem to confirm the impetus for attempting to and as well as any educational systems negotiating the challenges of balancing boost AfL. Low assigned value also corresponds to research in other national assessment priorities. contexts suggesting that teachers tend to negatively view assessment practices that could be used to label or blame students for poor performance. This tension Survey Use Conclusions may create a significant challenge for policy and practice initiatives attempting to Complex surveys can be utilized in school settings, but they require significant achieve a balanced assessment approach. Attempting to achieve work and support. This work and support includes: summative/formative balance in an environment in which formal examination, 1. Assembling a competent team able to shepherd the process from piloting, grading, and reporting are maintained tends to result in a one policy being ‘hard’ through full deployment and into interpretation of results. (i.e., formal external accountability) and the other ‘soft’ (i.e., formative 2. Negotiating with schools to allow not only access but high, valid response assessment for learning) (Kennedy, Chan, & Fok, 2011). This has implications well rates. beyond Singapore, as educational systems in The United States and elsewhere 3. Garnering the support of high level decision-makers. Developing and utilizing attempt to achieve balance and resolve tensions in assessment (Berry, 2011). this complex a survey would not have been possible without 'hard and soft' support from the Ministry of Education. This point is discussed in some detail as it Analysis demonstrates that while factors are not perfectly identical, they are especially has global impact and on objective measurement in school settings. 57

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

Paper No KK_067 Paper No KK_069

Paper Title Assesing Pedagogical Content Knowledge of the particle theory of matter and Paper Title Modelling a Meaningful Hybrid eTraining for Diverse Learners using Rasch and Phasa Change in Pre-service Science Teacher SEM Email Address [email protected] Email Address [email protected]

1st Author Maryati 1st Author Rosseni Din

Affiliation Universitas Negeri Yogyakarta Affiliation Universiti Kebangsaan Malaysia

Subsequent Zuhdan Kun Prasetyo, ([email protected]) Universitas Negeri Yogyakarta Subsequent authors Insih Wilujeng, ([email protected]) Universitas Negeri Yogyakarta authors Bambang Sumintono ([email protected]) Universiti Malaya 1. Aims/ This study aimed at designing, developing and implementing a new hybrid 1. Aims/ This research aims to asses the quality of PCK in pre-service secondary science Objectives of meaningful e-training system, which was tested to generate a two-stage model Objectives of teachers in a specified topic— The particle theory of matter and phasa change study: for meaningful hybrid e-training. The early framework of the model guided study: using many facet rasch measurement (MFRM). development of a questionnaire to measure meaningfulness of a hybrid e- 2. Sample: Sample in this research consist of 16 pre-service secondary science teachers as training. The questionnaire has three sections which assess (i) meaningful members of professional teacher training programe, with 32 lesson plans and learning, (ii) hybrid e-training and (iii) learning style preference. Overall reliability instructional sessions videotaped. Those pre-services teachers assessed by three analyses using Cronbach’s Alpha and the Rasch Model, in addition to expert assessors (lecturers), using instrument prepared by the researchers. reviews for the content validation of the questionnaire, suggested that the 3. Method: This is a quantitative research method to measure teacher’s PCK with PCK rubric questionnaire is reliable and valid to measure a meaningful hybrid e-training that developed base on Magnuson et al.’s component model. Measuring involved program. Data collected from 213 ICT trainers were subsequently tested with multiraters and analyzed by many-facet Rasch measurement. confirmatory factor analysis using AMOS software to obtain three best-fit 4. Results: Results indicate that PCK from Indonesian pre-service secondary science teachers measurement models from the three latent variables. Finally, the structural is still low, especially on knowledge of science curricula, Knowledge of students’ equation modeling was applied to test the hypotheses. understanding of science and Knowledge of instructional strategies. 2. Sample: 213 ICT trainers 5. Conclusions: The ability of science teacher’s PCK in Indonesia as a criterion of professional teachers still need to be improved and science teacher education curriculum 3. Method: must be reformed. An iterative triangulation participatory design and validation method is used to structure the research, to show how all of the major parts of the research project - the respondents, the system, the measures - work together to try to address the central research questions. Various research paradigms were engaged which are complementary to each other due to the nature of research procedures used in educational research is multidisciplinary and multimethod. Emphasis given to a particular paradigm depends on the objective and the six phases of the research design which consist of the design, development and validation of the I-MeT system, measuring instruments and models using participative design and 58

Pacific Rim Objective Measurement Symposium 2017 5 - 9 August, Kota Kinabalu

validation method. Data collected were analysed using Rasch and Structural Paper No KK_070 Equation Modelling. Paper Title Exploration of the psychometric properties of Eternal Love Instrument(ELI) and 4. Results: The results showed (i) distribution of major learning style preference among validation of ELI Model: A Rasch Model Approach respondents, (ii) evidence of a five-dimension measurement model for hybrid e- Email [email protected] training, (iii) evidence of a five-dimension measurement model for meaningful e- Address training, (iv) evidence of a five-dimension measurement model for learning style 1st Author Akbariah Mohd Mahdzir preference, (v) a strong relationship between hybrid e-training and meaningful e- Subsequent Norhayati Mohd Nor training, (vi) a positive relationship between learning style preference and hybrid authors e-training and (vii) a negative relationship between learning style preference and 1. Aims/ According to statistics provided by the Syariah Judiciary Department Malaysia (JKSM), meaningful learning. Objectives the number of Muslim couples getting divorced rose in the past years from 20,916 in 5. Conclusions: This section consists of three parts. The first part presents the method of study: 2004 to 59,712 in 2014, 63,463 in 2015 and to 48,077 till 10th July 2016. Research to contribution; second part the implications for future research related to the further understand this phenomenon is crucial to guide informed intervention. Hence, theoretical or conceptual framework of a meaningful hybrid e-training. The third Eternal Love Instrument(ELI) is proposed as one of a marriage status assessment part provides several implications for the practical developments of theory, instruments designed specifically for use with married couple in marriage counselling. practice, and policy. The aim of this study thus was to investigate the psychometric properties of the Malay version of Eternal Love Instrument(ELI) in a sample consisting of Malay married individuals (N = 500). 2. Sample: sample consisting of Malay married individuals (N = 500)

3. Method: The Rasch Model will be applied in the development process since it has been proven

to be able to help in the construction of valid, reliable and unbiased items pertaining to

attributes to be measured. The qualitative approached was used during the first stage

since this is the most suitable approach in conceptualizing of what it meant by

everlasting marriage by experts in Malaysia. During the second stage, important

constructs that the researcher believes to be able to measure everlasting marriage

were identified. Items were created based on the Instrument Blueprint. ELI consisted

of seven major constructs. ELI was distributed and the data were analysed.

4. Results: The findings were based on these aspects: (a)Construct definition-the spread of the

items ; (b)Summary of item difficulty and person ability-item-person map; ( c)Item

polarity- point-measure correlation index; ((d)Fit statistics-infit and outfit; (e)Unidimensionality-RPCA; (f)Result that are consistent with the aims of

measurement-Reliability and Separation; (g) Instrument usefulness- SEM as the guidance; (h)Test targeting- item and person mean within ± 2.0 SE; (i)Person fit; and

(j)Usability of the measurement scale- category and step calibrations. 5. Items were then improved and the test was run again. The respondents consisted of Conclusions: married Malay individuals in Malaysia

59