Chinese Language Learning Resources, A Warehouse Architecture and An Information Exchange Repository

Gee Kin Yeo 杨宜瑾 School of Computing National University of Singapore 3 Science Drive 2, Singapore 117543 Republic of Singapore Email: [email protected]

Abstract: Warehousing learning resources would become more and more important as computer and communications technology become pervasive among teachers and learners. Practitioners in business information systems have recognized the problem of warehousing data for at least a decade. This paper first reviews on lessons that can be learned from them in looking at Chinese language learning resources. It is also in the general consensus that learning resources should be shared. We shall therefore also look at a resource sharing framework and discuss how it can be set up as an information exchange repository with semi-automated management capability. In conclusion, a call is made to voluntary providers to an information exchange repository of Chinese language learning resources.

Key Words: Resource warehouse, web engineering, knowledge repository, information retrieval, domain-specific ontology

Introduction

Learning systems leveraging on new technologies have to deal with what is now called the knowledge media as learning resources while they are becoming more massive and heterogeneous. There are now growing concerns about organizing and accessing massively heterogeneous yet related learning resources, whether on school network or on Internet, e.g., the work on Semantic Web [SemanticWeb, Web] and Learning Object Metadata [LOM, Web]. In the business world, the same concerns about organizing and accessing massively heterogeneous yet related data have given rise to the concept of data warehousing. With data warehouses, business people are able to obtain quickly status knowledge to support their decision-making. While speed may not be that critical for teachers' accessibility to learning resources, quality and applicability of the resources obtained are equally important as in business decision making. Since its inception slightly more than a decade ago, data warehousing have gone from theory to conventional wisdom, the techniques however have never been discussed outside the realm of business data. The first part of this paper reports on our analysis on using a data warehouse architecture for warehousing Chinese language learning resources in a school network.

On the other hand, more and more information about learning resources are now being disseminated on the Internet. We have proposed a resource sharing framework called DSIC (Domain Specific Information Clearinghouse [Wong, 2001] as an exchange repository. The second part of the talk will focus on the mechanism of DSIC in managing the information about Chinese language learning resources. Quality concerns suggest that ontology-based knowledge engineering should be adopted.

From Data Warehouse to Resource Warehouse

Bill Inmon, who pioneered the concept and popularized the term Data Warehouse, once said that data warehouse is an architecture rather than a technology [Inmon, 1996]. From the Merriam-Webster dictionary, an architecture is the manner in which the components of a computer or computer system are organized and integrated. A data warehouse is therefore less driven by developing technologies than its guiding the use of technologies in its organization. Concept of warehousing stems from the requirement of knowledge and information sharing and from the demand for quick access to quality data and resources. However, data warehousing in business relate more closely to decision making whereas resource warehousing in Chinese language learning apply purely to academic extraction. Resources retrieved from warehouses are either selected for inclusion in teaching package or discarded altogether and do not lend themselves as criteria for decision making.

Data warehouse is defined to be a subject-oriented, integrated, non-volatile and time-variant collection of data in support of management's decisions. We shall soon see that not all of these concepts would apply to learning resources, and, in particular, Chinese language learning resources. A schematic drawing of an architecture for resource warehousing appears in Figure 1, which also illustrates the processes involved in building a warehouse.

Resource Access Source Source Integrated Environment Resourcebase

Cleaning and Resource Transformation Mart Resource Filtering And Data Integration Converter Upd Resource Resource ates Mart Streaming of Warehouse Source converted resources

Figure 1 A Resource Warehouse Architecture

Chinese language learning resources are expected to come from linguists, teachers, and CAI developers. They may cover different language aspects, presented in different format, and possibly on different media. For example, there could be a text file containing Chinese characters and their number of strokes, a spreadsheet (Figure 2) containing list of words with their 汉 语 拼 音 and some examples of usage, and a video clip illustrating the correct stroke order of writing a Chinese character.

It would be quick to recognize that while learning resource is just as non volatile as a business operational data, the time dimension that is important in data warehousing that requires special handling loses its significance here. Particularly in language resources, most language aspects remain stable even in usage. It is however, relatively more important to note the source of a particular piece of language resource; for example, the Chinese idiom ‘己所不欲,勿施于人’ would be traced back to what Confucius said in 论语. For learning purposes, it would be more useful to label each resource in its application level. Thus, a dimension indicating ‘difficulty level’ would be more appropriate. Summary statistics from data warehouses are generated without fail to provide bird-eye view of cross-section data for which some business decision can be based. For example, a manager could decide on promotion of salesperson because summary statistics show evidence that he has secured the highest sales figures for the last five months. In learning resources, a summary report of how many poems in 红楼梦 written on flowers would not as likely make a teacher decide on assigning 林 黛 玉 ’ s poems as compulsory readings. It could however interest the teachers to further explore further how many of these poems have 牡丹 as the flower.

Figure 2 An Example Chinese Language Resource File

The whole idea of data integration in data warehousing is to combine data selected from different data sources, resolve conflicts among data and bring data in consistent form. Tedious but important steps involve data conversion or transformation, and data scrubbing. A data converter is responsible to transform data to the warehouse format. It also unifies data references, e.g., “prft_marg”, “prof_mrg” would be unified to “profit_margin”. For Chinese language resources, this would, among other things, translate to (1) unifying the internal code representation: from BIG5 to Unicode and from GB to Unicode; (2) unifying “ 汉 语” with “华语” with “中文” or unifying “试题” with “题目”, etc. A data scrubber cleans data as well as assigns default value to missing cells. In our example of Chinese language resource in Figure 2, line 1 contains a mistake "一便" (一遍). Line 3 contains a corrupted data. The resource scrubber would detect these errors before the usage examples go into the resource warehouse. Another essential step in data integration is to establish “fact” and “dimension” tables that consolidate data from different sources. Dimension tables identify products, suppliers, customers, and salespersons for example. Fact tables detail a product sale to a customer by a salesperson on a certain date. Integration in Chinese language learning resources would amount to establishing relationships between different language items that would eventually for example, help a teacher to find a usage example for ‘清楚’ in example 20 in Figure 2, even though the original source may have only the same example for ‘毕竟’.

The ultimate presentation, possibly through a data mart, to the end user of a data warehouse is in multi-dimension, allowing data slicing and drilling to bring out different views and different granularities. From our Chinese language learning resource warehouse, we can think of a teacher being given the flexibility of looking at a 成语 with certain 词 and an example of usage containing words that come from, say, Primary 4, level. Obviously, this would imply the requirement of word segmentation as a necessary part of the data integration process and extensive indexing based on the results. A Resource Exchange Information Repository

The previous section discussed what needs to be done when learning resources are centralized to be shared. What about learning resources distributed over different school networks and even Internet? Searching for appropriate resources is becoming one of the major activities for educational practitioners. The requirement of quick access to quality information has led us to develop a web-centric model of information acquisition and dissemination, called DSIC [Wong, 2001]. In a nutshell, a DSIC is a Web-based clearinghouse of information on domain specific resources. In the DSIC, a proper taxonomy, or classification scheme, of the specific domain is first decided by the DSIC designers, which include domain experts, after which the resources are catalogued by the authors themselves. This ensures that descriptions and categorization of the resources are adequate and meaningful since the providers of the resources are themselves domain experts. In addition, being domain specific, resources catalogued are ensured to be relevant to the domain, thus resolving the irrelevance problem of the search engine. At the same time, the scalability problem of the web directory is resolved by an information agent that is built to automatically harvest, categorize and abstract information resources from the WWW. The information agent is unique and different from web indexing spiders employed by search engines in that, in addition to automatically categorizing and indexing the harvested resources, it would also attempt to identify the providers of the harvested resources and invite them to catalogue the resources, thus reducing the problem of irrelevance to a minimum. Moreover, being restricted to a much smaller subset of the whole information space compared to web directories and search engines, more regular checks can be performed automatically on the catalogued resources to ensure that they are valid. A schematic view of DSIC in operation with resource providers and resource consumers is depicted in Figure 3. A prototype DSIC can be found on SGX [SGX, web]. Figure 4 shows an attempt on the first taxonomy that may be applicable for Chinese language learning resources. It should be noted that we have extended the notion of a learning resource from the basic language learning items such as those used in the illustrations in the last section. It now could point to a packaged lesson, or an information source regarding conferences such as GCCCE2002.

Figure 3. A DSIC in operation

A set of toolkits has been developed to allow anyone to set up a clearinghouse for any specific domain and maintain it with minimal administration drudgery. Details can be found in [Wong, 2001] and [Yeo, 2002]. 应用级别 学习理论 运行环境 表现形式 专业课题 通用 行为理论 通用 课件 评测 学前教育 认知模型 因特网 课程 实践 小学 协作学习 局网 试题 研究 中学 创建架构 课室 题库 会议 大专院校 实验学习 实验室 多媒体素材 咨询 成人教育 探索学习 多媒体 软件工具 资源 师资训练 导航学习 虚拟现实 案例 演示 职业教育 其他 模拟 论文 设计 特殊教育 协作 论文集 出版 其他 竞赛 会议通告 管理 互动 期刊 编辑 游戏 其他 学会 角色扮演 参考书目 其他 其他 Figure 4. An Organized Information Exchange Repository for Learning Resources

Metadata , Ontology-based Web Engineering and a Call for Participation

The importance of metadata [Milstead, 1999] in effective management of large volume of information resources has been noted by many. One of the biggest challenges facing the metadata manager in the data warehouse environment is crossing the technological barriers found in the environment. There is a need to share metadata across the many technologies found in the data warehouse environment. If there is no central control of metadata, there will never have any uniformity of definition of data. This means there will never be any consistency of processing across the organization and the data warehouse will not be sharable. The same argument carries over to resource warehouses. Researchers around the world are already working on metadata standards of so-called Learning Objects [Suthers, 2001]. The most important aspect of metadata in a resource warehouse to an end user is obviously the thesaurus for identifying the kind of resources in the warehouse. Thesaurus, or the taxonomy used in the classification is obviously also required in an information exchange repository such as a DSIC. As a well-known example, ERIC [ERIC, Web], the largest US-based information network for educators, uses a Thesaurus to disseminate information on the descriptors used in indexing its resources. Figure 4 in the last section presents a proposed top-level taxonomy or classification scheme of describing an information about Chinese language learning resources. Our experience in the DSIC in experimental use [SGX, Web] reflects that even a carefully designed taxonomy could be subject to massive misinterpretations that result in mis- classified information. This eventually leads to the disorganization of the exchange repository and affects the quality that is of the primary concern in this entire discussion.

The notion of ontology as a formally specified conceptualization shared by a community of practice is now well established and is used and applied in many areas, including knowledge management [Alavi, 1999], and information retrieval. A DSIC is community-based. It will not be effective without a community- accepted classification scheme. That suggests that ontology building [Kietz, 2000] with community effort is in order. We are starting to build a DSIC for Chinese language learning resources and would like to call upon contributions of resources. Contributors are also expected to feedback on the appropriateness of the classification taxonomy such as the one used in Figure 4. Ontology building can be a never-ending task. But we are hoping that with active participation of colleagues, a steady-state equilibrium will be reached sooner with the following results: we have an effective information exchange repository for Chinese language learning resources and we have a working ontology of Chinese language learning resources that could be used in other applications. Please log on to http://www.comp.nus.edu.sg/~yeogk/ChineseDSIC. References

[Alavi, 1999] Alavi, M, and Leidner, D E “Knowledge Management Systems: Issues, Challenges and Benefit”, Communications of the AIS(1), 1999, pp. 1-37.

[ERIC, Web] ERIC http://www.eric.ed.gov/

[Inmon, 1996] Inmon, W H, Building the Data Warehouse (2nd Edition). Wiley, New York (1996)

[Kietz, 2000] Kietz, J and Volz, R, Maedche, A “Extracting a Domain-Specific Ontology from a Corporate Intranet”, Proceedings of CoNLL-2000 and LLL-2000, Pages 167-175, Portugal (2000).

[Lawrence, 1999] Lawrence, S and Giles, L “Accessibility and Distribution of Information on the Web”, Nature, Vol. 400. Pages 107-109, 1999.

[LOM, Web] IEEE Std. 1484.12 (draft). Draft 4.1 of the Learning Objects Metadata (LOM), IEEE, Piscataway, N.J. USA, 2001. Available: http://ltsc.ieee.org/doc/wg12/LOM_WD4.,htm

[Milstead, 1999] Milstead, J. & Feldman, S. (1999). Metadata: Cataloging by any other name. Online 23(1), 24-31. Available: http://www.onlineinc.com/onlinemag/OL1999/milstead1.html

[SGX, web] SGX http://sg.comp.nus.edu.sg/

[Suthers, 2001] Suthers, D D “Evaluating the Learning Object Metadata for K-12 Educational Resources”, Proceedings of the IEEE International Conference on Advanced Learning Technologies (CALT 2001), Madison, Wisconsin 2001.

[Wong, 2001] Wong, P Y “A Scalable Framework for Collaborating Web Clearinghouses”, Proceedings of the 10th International World Wide Web Conference, May 2001.

[SemanticWeb, Web] Semantic Web. Http://www.w3.org/2001/sw

[Yeo, 2002] Yeo, G K “Organized Exchange Repositories for Learning Resources”, accepted for ICCE2002, Auckland, New Zealand, December 2002.