PROGRAM GUIDE

The 5th APWeb-WAIM International Joint Conference on Web and Big Data (APWeb-WAIM 2021) August 23-25, 2021 Guangzhou,

Organizing Committee

General Chairs Yi Cai South China University of Technology, China Tom Gedeon Australia National University, Australia Qing Li Hong Kong Polytechnic University, China Baltasar Fernández Manjón UCM, Spain

Program Committee Chairs Leong Hou U University of Macau, China Marc Spaniol Université de Caen Normandie, France Yasushi Sakurai Osaka University, Japan

Workshop Chairs Yunjun Gao Zhejiang University, China An Liu Soochow University, China Xiaohui Tao University of Southern Queensland, Australia

Demo Chair Yanghui Rao Sun Yat-Sen University, China

Tutorial Chair Raymond Chi-Wing Wong Hong Kong University of Science and Technology, China

Industry Chair Jianming Lv South China University of Technology, China

Publication Chair Junying Chen South China University of Technology, China

Publicity Chairs Xin Wang Tianjin University, China Jianxin Li Deakin University, Australia

Local Arrangement Chairs Guohua Wang South China University of Technology, China Junying Chen South China University of Technology, China

Webmaster Jianwei Lu South China University of Technology, China

APWeb-WAIM Steering Committee Representative Yanchun Zhang Victoria University, Australia

Schedule at a Glance

Welcome Message from the General Chairs

On behalf of the Organizing Committee, it is our great pleasure to welcome you to The Fifth Asia Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data (APWeb-WAIM 2021).

APWeb and WAIM are two separate leading international conferences on research, development, and applications of Web technologies and database systems. Previous APWeb conferences were held in Beijing (1998), Hong Kong (1999), Xi’an (2000), Changsha (2001), Xi’an (2003), Hangzhou (2004), Shanghai (2005), Harbin (2006), Huangshan (2007), Shenyang (2008), Suzhou (2009), Busan (2010), Beijing (2011), Kunming (2012), Sydney (2013), Changsha (2014), Guangzhou (2015), and Suzhou (2016). Previous WAIM conferences were held in Shanghai (2000), Xi’an (2001), Beijing (2002), Chengdu (2003), Dalian (2004), Hangzhou (2005), Hong Kong (2006), Huangshan (2007), Zhangjiajie (2008), Suzhou (2009), Jiuzhaigou (2010), Wuhan (2011), Harbin (2012), Beidaihe (2013), Macau (2014), Qingdao (2015), and Nanchang (2016). Starting in 2017, the three conference committees agreed to launch a joint conference. The First APWeb-WAIM conference was held in Bejing (2017), the Second APWeb-WAIM conference was held in Macau (2018), the Third APWeb- WAIM conference was held in Chengdu (2019) and the Fourth APWeb-WAIM conference was held in Tianjin (2020). With the increased focus on big data, the new joint conference is expected to attract more professionals from different industrial and academic communities, not only from the Asia Pacific countries but also from other continents.

APWeb-WAIM 2021 will enable you to enjoy an outstanding program, exchange your ideas with leading researchers in various disciplines, and make new friends in the international science community. Some highlights include four keynote talks on the latest exciting topics of Web and big data, ranging from the fundamental topic of core database systems to the fast-growing artificial intelligence applications; a diverse range of tutorials and workshops; technical sessions with exciting talks and demonstrations, and social events.

We are grateful to the strong support of the Steering Committee of APWeb and WAIM, and we are honored to serve as General Chairs for such a unique joint conference. The conference would not have been possible without the dedication and the hard work of all members of the Organizing Committee. The Program Committee Chairs, Leong Hou U (University of Macau, China), Marc Spaniol (Université de Caen Normandie, France) and Yasushi Sakurai (Osaka University, Japan) put tremendous effort into the creation of an exciting program. Many other individuals and organizations contributed to the success of this conference. We would like to acknowledge the efforts of Workshop Chairs (Yunjun Gao, An Liu and Xiaohui Tao), Tutorial Chair (Raymond Chi-Wing Wong), Demo Chairs (Yanghui Rao), Industry Chair (Jianming Lv), Publication Chair (Junying Chen), Local Arrangement Chairs (Guohua Wang and Junying Chen) and Publicity Chairs (Xin Wang and Jianxin Li). In addition to members of the Organization Committee, many volunteers have contributed to the success of the conference. Volunteers helped in editing this conference booklet, and helped with conference arrangements and on-site setups, and many other important tasks. While it is difficult to list all their names here, we would like to take this opportunity to sincerely thank them all.

Last but not least, we would like to extend our most sincere congratulations to all authors and speakers for a job well done. We look forward to welcoming you in person, and we hope that you will enjoy APWeb-WAIM 2021!

General Chairs

Yi Cai (South China University of Technology, China) Tom Gedeon (Australia National University, Australia) Qing Li (Hong Kong Polytechnic University, China) Baltasar Fernández Manjón (UCM, Spain)

Welcome Message from the Program Committee Chairs

On behalf of the APWeb-WAIM 2021 Program Committee, we are delighted to welcome you to the virtual conference! For more than 20 years in the past, Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) have attracted professionals of different communities related to Web and big data who have common interests in interdisciplinary research to share and exchange ideas, experiences, and the underlying techniques and applications, including Web technologies, database systems, information management, software engineering, and big data.

The technical program APWeb-WAIM 2021 features four keynotes by Prof. M. Tamer Özsu (David R. Cheriton School of Computer Science, University of Waterloo, Canada), Prof. Huan Liu (School of Computing and Augmented Intelligence, Arizona State University, USA), Prof. X. Sean Wang (School of Computer Science, Fudan University, China), Prof. Xiaokui Xiao (School of Computing, National University of Singapore, Singapore), as well as a tutorial by Prof. Hongzhi Wang (Harbin Institute of Technology, China) and Ph.D. Xiaoou Ding (Harbin Institute of Technology, China). We are grateful to these distinguished scientists for their invaluable contributions to the conference program.

Our gratitude goes to Program Committee members and external reviewers whose technical expertise and dedication were not only thorough and crucial for the technical assessment of the selection of papers, but also inspirational in making the whole process even more pleasurable. During the double- blind review process, each paper submitted to APWeb-WAIM 2021 received at least three high quality review reports. Based on the obtained reviews, our Senior Program Committee members provided recommendations for each paper so that the difficult task of making decisions for acceptance could be performed. Finally, out of 184 submissions in total, the conference accepted 44 full research papers (23.91%), 24 short research papers, and 6 demonstration papers. The peer-review process and selection secure the quality of the publications. The contributed papers address a wide range of topics, such as storage and indexing, data mining, data management, graph data, neural network applications, machine learning, knowledge graph, text analysis, information extraction and retrieval, recommender system, social networks, spatial-temporal databases. Amongst a number of highly rated manuscripts, several candidates for best papers have been shortlisted for awards, where the final selection will be decided during the conference. In particular, we would like to thank Springer for its cash sponsorship of the APWeb-WAIM 2021 Awards.

In addition to the main conference program, we would also like to thank Qingpeng Zhang (City University of Hong Kong, China) and Xin Wang (Tianjin University, China) for organizing The Fourth International Workshop on Knowledge Graph Management and Applications (KGMA 2021), Ge Yu (Northeastern University, China), Baoyan Song (Liaoning University, China), Xiaoguang Li (Liaoning University, China), Linlin Ding (Liaoning University, China), and Yuefeng Du (Liaoning University, China) for organizing The Third International Workshop on Semi-structured Big Data Management and Applications (SemiBDMA 2021), and Tae-Sun Chung (Ajou University, Korea), Jianming Wang (Tiangong University, China), and Zhetao Li (Xiangtan University, China) for organizing The Second International Workshop on Deep Learning in Large-scale Unstructured Data Analytics (DeepLUDA 2021), which are in conjunction with APWeb-WAIM 2021.

We thank the General Co-Chairs Yi Cai (South China University of Technology, China), Tom Gedeon (Australia National University, Australia), Qing Li (Hong Kong Polytechnic University, China), and Baltasar Fernández Manjón (UCM, Spain) for their patience and especially thank for the support from the organizing team of Prof. Yi Cai. Yanchun Zhang represents the Steering Committee of APWeb and WAIM for guidance and advising. Many thanks also to all the members of the Organizing Committee for their full support in preparation of the conference, especially with respect to Website, publications, registration and local arrangements, without which the conference would not be possible to be put together.

Finally, the high-quality program would not have been possible without the authors who chose APWeb-WAIM for disseminating their findings. We would like to thank our authors whose valuable and novel contributions are essential for both the continued success of APWeb-WAIM and the advancement of technology for humanity.

Program Committee Co-Chairs

Leong Hou U University of Macau, China Marc Spaniol Université de Caen Normandie, France Yasushi Sakurai Osaka University, Japan

Keynotes

Keynote Speech 1: Approaches to Distributed RDF Data Management and SPARQL Processing Time: 09:00-10:20, August 23, 2021, Monday

Abstract: Resource Description Framework (RDF) has been proposed for modelling Web objects as part of developing the "semantic web", but its usage has extended beyond this original objective. As the volume of RDF data has increased, the usual scalability issues have arisen and solutions have been developed for distributed/parallel processing of SPARQL queries over large RDF datasets. RDF has also gained attention as a way to accomplish data integration, leading to federated approaches. In this talk I will provide an overview of work in these two areas.

M. Tamer Özsu Professor, David R. Cheriton School of Computer Science, University of Waterloo, Canada Speaker Bio: M. Tamer Özsu is a University Professor of Computer Science at the David R. Cheriton School of Computer Science of the University of Waterloo. His research is in data management focusing on large-scale data distribution and management of non-traditional data. His publications include the book Principles of Distributed Database Systems (co-authored with Patrick Valduriez), which is now in its fourth edition. He has also edited, with Ling Liu, the Encyclopedia of Database Systems which is in its second edition. He is a Fellow of the Royal Society of Canada, American Association for the Advancement of Science, Association for Computing Machinery, and Institute of Electrical and Electronics Engineers. He is an elected member of the Science Academy of Turkey, and a member of Sigma Xi. He is the recipient of the 2022 IEEE Innovation in Societal Infrastructure Award, Computer Science Association Lifetime Achievement Award (2019), ACM SIGMOD Test-of-Time Award (2015), ACM SIGMOD Contributions Award (2006), and The Ohio State University College of Engineering Distinguished Alumnus Award (2008).

Keynote Speech 2: Democratizing the Full Data Analytics Software Stack Time: 10:40-12:00, August 23, 2021, Monday

Abstract: Data analysis and machine learning is a complex task, involving a full stack of hardware and software systems, from the usual compute systems, cloud computing and supercomputing systems, to data collection systems, data storage and database systems, data mining and machine learning systems, and data visualization and interaction systems. A realistic and highly efficient data analytics and AI application often requires a smooth collaboration among the different systems, which becomes a big technical hurdle, especially to the non-computing professionals. The history of computing may be viewed as a technical democratizing processing, which in turn brings huge benefit to the society and its economy. The democratizing process for data analysis and machine learning has started to appear in various aspects, but it still needs research and development in multiple directions, including human-machine natural interaction, automated system selection and deployment, and automated workflow execution and optimization. It can be expected that this democratizing process will continue, and the research and development efforts by the computer scientists are much needed.

X. Sean Wang Professor, School of Computer Science, Fudan University, China Speaker Bio: X. Sean Wang is Professor at the School of Compute Science, Fudan University, a CCF Fellow, ACM Member, and IEEE Senior Member. His research interests include data analytics and data security. He received his PhD degree in Computer Science from the University of Southern California, USA. Before joining Fudan University in 2011 to be the dean of its School of Computer Science and the Software School, he served as the Dorothean Chair Professor in Computer Science at the University of Vermont, USA, and as a Program Director at the National Science Foundation, USA. He has published widely in the general area of databases and information security, and was a recipient of the US National Science Foundation CAREER award. He’s currently chief editor of the Springer Journal of Data Science and Engineering. He’s also currently on the steering committees of the IEEE ICDE and IEEE BigComp conference series, and past Chair of WAIM Steering Committee.

Keynote Speech 3: Striving for Socially Responsible AI in Data Science Time: 09:00-10:20, August 24, 2021, Tuesday

Abstract: AI has never been this pervasive and effective. AI algorithms are used in news feeds, friend/purchase recommendation, making hiring and firing decisions, and political campaigns. Data empowers AI algorithms and is then collected again for further training AI algorithms. We come to realize that AI algorithms have biases, and some biases might result in deleterious effects. Facing existential challenges, we explore how socially responsible AI can help in data science: what it is, why it is important, and how it can protect and inform us, and help prevent or mitigate the misuse of AI. We show how socially responsible AI works via use cases of privacy preservation, cyberbullying identification, and disinformation detection. Knowing the problems with AI and our own conflicting goals, we further discuss some quandaries and challenges in our pursuit of socially responsible AI.

Huan Liu Professor, School of Computing and Augmented Intelligence, Arizona State University, USA Speaker Bio: Dr. Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He was recognized for excellence in teaching and research in Computer Science and Engineering at ASU. His research interests include AI, data mining, machine learning, social computing, and data science to investigate problems that arise in real- world applications with high-dimensional data of disparate forms. He is a co-author of the textbook, Social Media Mining: An Introduction, by Cambridge University Press. He is Founding Field Chief Editor of Frontiers in Big Data, its Specialty Chief Editor of Data Mining and Management, Editor in Chief of ACM TIST (Aug. 1 2021 - ), and a founding organizer of the International Conference Series on Social Computing, Behavioral-Cultural Modeling, and Prediction. He is a Fellow of ACM, AAAI, AAAS, and IEEE.

Keynote Speech 4: Efficient Network Embeddings for Large Graphs Time: 10:40-12:00, August 24, 2021, Tuesday

Abstract: Given a graph G, network embedding maps each node in G into a compact, fixed- dimensional feature vector, which can be used in downstream machine learning tasks. Most of the existing methods for network embedding fail to scale to large graphs with millions of nodes, as they either incur significant computation cost or generate low-quality embeddings on such graphs. In this talk, we will present two efficient network embedding algorithms for large graphs with and without node attributes, respectively. The basic idea is to first model the affinity between nodes (or between nodes and attributes) based on random walks, and then factorize the affinity matrix to derive the embeddings. The main challenges that we address include (i) the choice of the affinity measure and (ii) the reduction of space and time overheads entailed by the construction and factorization of the affinity matrix. Extensive experiments on large graphs demonstrate that our algorithms outperform the existing methods in terms of both embedding quality and efficiency.

Xiaokui Xiao Professor, School of Computing, National University of Singapore, Singapore Speaker Bio: Xiaokui Xiao is a Dean's Chair Associate Professor at the School of Computing, National University of Singapore (NUS). His research focuses on data management, with special interests in data privacy and algorithms for large data. He received a Ph.D. in Computer Science from the Chinese University of Hong Kong in 2008. Before joining NUS in 2018, he was an associate professor at the Nanyang Technological University, Singapore.

Tutorial

Tutorial 1: Industrial Big Data Cleaning Time: 9:00-12:00 August 25, 2021

Abstract: Nowadays, it is realized that industrial data is a kind of valuable intangible assets. At the meanwhile, the data collected from industrial field is faced with various quality problems. Practicable data cleaning techniques are urgently needed in industrial applications. In this tutorial, we will synthesize and survey the research and development in industrial data cleaning, including industrial data cleaning workflow, error detection and error repairing in industrial data, knowledge modelling, the state-of-art data cleaning tools and systems, and the new challenges and opportunities brought by industrial applications.

Hongzhi Wang and Xiaoou Ding Harbin Institute of Technology

Conference Sessions

Research Session 1: Graph Mining Time: 13:30-15:10, August 23, 2021, Monday Chair: Qingbao Huang

Co-Authorship Prediction Based on Temporal Graph Attention Dongdong Jin, Peng Cheng, Xuemin Lin, and Lei Chen Hong Kong University of Science and Technology Abstract. The social network analysis has received significant interests and concerns of researchers recently, and co-authorship prediction is an important link prediction problem. Traditional models inefficiently use multi-relational information to enhance topological features. In this paper, we focus on the co-authorship prediction in the co-authorship knowledge graph (KGs) to show that multi- relation graphs can enhance feature expression ability and improve prediction performance. Currently, the main models for link prediction in KGs are based on KG embedding learning, such as several models using convolutional neural networks and graph neural networks. These models capture rich and expressive embeddings of entities and relations, and obtain good results. However, the co-authorship KGs have much temporal information in reality, which cannot be integrated by these models since they are aimed at static KGs. Therefore, we propose a temporal graph attention network to model the temporal interactions between the neighbors and encapsulate the spatiotemporal context information of the entities. In addition, we also capture the semantic information and multi-hop neighborhood information of the entities to enrich the expression ability of the embeddings. Finally, our experimental evaluations on all dataset verify the effectiveness of our approach based on temporal graph attention mechanism, which outperforms the state-of-the-art models.

Degree-specific Topology Learning for Graph Convolutional Network Jiahou Cheng, Mengqing Luo, Xin Li, and Hui Yan Nanjing University of Science and Technology Abstract. Graph Convolutional Networks (GCNs) have gained great popularity in various graph data learning tasks. Nevertheless, GCNs inevitably encounter over-smoothing with increasing the network depth and over-fitting specifically on small datasets. In this paper, we present an experimental investigation which clearly shows that the misclassification rates of nodes with high and low degrees are obviously distinctive. Thus, previous methods such as Dropedge that randomly removes some edges between arbitrary pairwise nodes are distant from optimal or even unsatisfactory. To bridge the gap, we propose the degree-specific topology learning for GCNs to find a hidden graph structure that augments the initial graph structure. Particularly, instead of training with uncertain connections, we remove edges between the high-degree nodes and their first-order neighbors with low similarity in the attribute space; besides, we add edges between the low-degree nodes and their connectionless neighbors with high similarity in the attribute space. Experiments conducted on several popular datasets demonstrate the effectiveness of our topology learning method.

Simplifying Graph Convolutional Networks as Matrix Factorization Qiang Liu, Haoli Zhang, and Zhaocheng Liu Tsinghua University, China Abstract. In recent years, substantial progress has been made on Graph Convolutional Networks (GCNs). However, the computing of GCN usually requires a large memory space for keeping the entire graph. In consequence, GCN is not flexible enough, especially for large scale graphs in complex real-world applications. Fortunately, for transductive graph representation learning, methods based on Matrix Factorization (MF) naturally support constructing mini-batches, and thus are more friendly to distributed computing compared with GCN. Accordingly, in this paper, we analyze the connections between GCN and MF, and simplify GCN as matrix factorization with unitization and co-training. Furthermore, under the guidance of our analysis, we propose an alternative model to GCN named Unitized and Co-training Matrix Factorization (UCMF). Extensive experiments have been conducted on several real-world datasets. On the task of semi-supervised node classification, the experimental results illustrate that UCMF achieves similar or superior performances compared with GCN. Meanwhile, distributed UCMF significantly outperforms distributed GCN methods, which shows that UCMF can greatly benefit complex real-world applications.

GRASP: Graph Alignment through Spectral Signatures Judith Hermanns, Anton Tsitsulin, Marina Munkhoeva, Alex Bronstein4, Davide Mottin, and Panagiotis Karras Aarhus University Abstract. What is the best way to match the nodes of two graphs? This graph alignment problem generalizes graph isomorphism and arises in applications from social network analysis to bioinformatics. Existing solutions either require auxiliary information such as node attributes, or provide a single-scale view of the graph by translating the problem into aligning node embeddings. In this paper, we transfer the shape-analysis concept of functional maps from the continuous to the discrete case, and treat the graph alignment problem as a special case of the problem of finding a mapping between functions on graphs. We present GRASP, a method that captures multiscale structural characteristics from the eigenvectors of the graph’s Laplacian and uses this information to align two graphs. Our experimental study, featuring noise levels higher than anything used in previous studies, shows that GRASP outperforms state-of-the-art methods for graph alignment across noise levels and graph types.

FANE: A Fusion-based Attributed Network Embedding Framework Guanghua Li, Qiyan Li, Jingqiao Liu, Yuanyuan Zhu, and Ming Zhong Wuhan University Abstract. Network embedding, which learns a low-dimensional representation for each node in a network, has been proved to be highly effective for a variety of downstream tasks. In this paper, we propose a novel Fusion-based Attributed Network Embedding framework (FANE), which consists of two modules. The first is the feature-learning module, in which we propose a general and scalable method SparseAE to embed different types of information (structure, attribute, etc.) into separate low-dimensional vectors. The second is the feature-fusion module, which learns the fused embedding vector to capture the underlying relationships between different types of information for downstream prediction tasks. Extensive experiments on multiple real-world datasets show that our method can outperform several state-of-the-art methods in many downstream tasks, including node classification and link prediction.

Research Session 2: Data Mining Time: 13:30-15:10, August 23, 2021, Monday Chair: Raymond WONG

What Have We Learned from Open Review? Gang Wang, Qi Peng, Yanfeng Zhang, and Mingyang Zhang Northeastern University Abstract. Anonymous peer review is used by the great majority of computer science conferences. OpenReview is such a platform that aims to promote openness in peer review process. The paper, (meta) reviews, rebuttals, and final decisions are all released to public. We collect 5,527 submissions and their 16,853 reviews from the OpenReview platform. We also collect these submissions’ citation data from Google Scholar and their non-peer-reviewed versions from arXiv.org. By acquiring deep insights into these data, we have several interesting findings that could help understand the effectiveness of the public-accessible double-blind peer review process. Our results can potentially help writing a paper, reviewing it, and deciding on its acceptance.

Unsafe Driving Behavior Prediction for Electric Vehicles Jiaxiang Huang, Hao Lin, and Junjie Yao East China Normal University Abstract. There is an increasing availability of electric vehicles in recent years. With the revolutionary motors and electric modules within the electric vehicles, the instant reactions bring up not only improved driving experience but also the unexpected unsafe driving accidents. Unsafe driving behavior prediction is a challenging tasks, due to the complex spatial and temporal scenarios. However, the rich sensor data collcted in the electric vehicles shed light on the possible driving behavior profiling. In this paper, based on a recent electric vehicle dataset, we analyze and categorize the unsafe driving behaviors into several classes. We then design a deep learning based multi-feature fusion approach for the unsafe driving behavior prediction framework. The proposed approach is able to distinguish the unsafe behaviors from normal ones. Improved performance is also demonstrated in the different feature analysis of unsafe behaviors.

Resource Trading with Hierarchical Game for Computing-Power Network Market Qingzhong Bao, Xiaoxu Ren, Chunfeng Liu, Xin Wang, and Xiaofei Wang Tianjin University Abstract. As the internet of things (IoT) and artificial intelligence (AI) continue to evolve, it has spawned a proliferation of heterogeneous data volumes and model complexity. The centralized cloud computing can no longer meet the computing needs of an intelligent society. Currently, the development of edge computing and next-generation network technologies has enabled people to gradually utilize the computing-power on the edge of the network to address the above challenges. As a novel resource-integration scheme, the computing-power network emerges, using high- efficiency network to integrate computing resources from the end, edge, and cloud to provide unified services to the outside world. However, most of the current researches focus on end-edge-cloud collaborative computing framework, and there is a lack of in-depth discussion on the market models and resource trading of the computing-power network. In this paper, we propose a computing-power market framework and formulate resource trading as a three-stage Stackelberg game. We prove the existence of Stackelberg equilibrium (SE) in game. Then the dynamic-game reinforcement learning (DG-RL) algorithm is designed to solve the optimization problem. The experimental results verify the feasibility of the framework and the excellent performance of the proposed algorithm.

Analyze and Evaluate Database-Backed Web Applications with WTool Zhou Zhou and XuJia Yao Shanghai Jiao Tong Universit Abstract. Web applications demand low latency. In database-backed web applications, the latency is sensitive to the efficiency of database access. Hence, previous works have proposed various techniques to optimize the database access performance. However, the effectiveness of these techniques remains unverified when being applied to real-world applications. The reason is twofold. First, the benchmarks used to evaluate the methods in the literature differ from the real applications. Second, the diversity of applications makes it hard to predict whether a specific application will benefit from a certain technique. To this end, this paper presents WTool, a tool that can automatically analyze and evaluate the database access in a specific web application. It first collects SQL queries in the application, then extracts the information about queries for analysis purposes. Furthermore, WTool is also able to generate configurable benchmark scripts based on collected queries. The user can use the scripts to simulate the database access of a specific application for performance evaluation. To demonstrate the usage WTool, we analyze 16 open-source web applications. We introduce several simple optimizations based on the analysis and evaluate them by the generated benchmark scripts. The result shows that the query throughput can be improved by up to 7x

Semi-supervised Variational Multi-view Anomaly Detection Shaoshen Wang , Ling Chen, Farookh Hussain, and Chengqi Zhang University of Technology Sydney Abstract. Multi-view anomaly detection (Multi-view AD) is a challenging problem due to the inconsistent behaviors across multiple views. Meanwhile, learning useful representations with little or no supervision has attracted much attention in machine learning. There are a large amount of recent advances in representation learning focusing on deep generative models, such as Variational Auto Encoder (VAE). In this study, by utilizing the representation learning ability of VAE and manipulating the latent variables properly, we propose a novel Bayesian generative model as a semi- supervised multi-view anomaly detector, called MultiVAE. We conduct experiments to evaluate the performance of MultiVAE on multi-view data. The experimental results demonstrate that MultiVAE outperforms the state-of-the-art competitors across popular datasets for semi-supervised multi-view AD. As far as we know, this is the first work that applies VAE-based deep models on multi-view AD.

A Graph Attention Network Model for GMV Forecast on Online Shopping Festival Qianyu Yu, Shuo Yang, Zhiqiang Zhang, Ya-Lin Zhang, Binbin Hu, Ziqi Liu, Kai Huang, Xingyu Zhong, Jun Zhou, and Yanming Fang Ant Group Abstract. In this paper, we present a novel Graph Attention Network based framework for GMV (Gross Merchandise Volume) forecast on online festival, called GAT-GF. Based on the well-designed retailer customer graph and retailer-retailer graph, we employ a graph neural network based encoder cooperated with multi-head attention and self attention mechanism to comprehensively capture complicated structure between consumers and retailers, followed by a two-way regression decoder for effective predition. Extensive experiments on real promotion datasets demonstrate the superiority of GAT-GF.

Suicide Ideation Detection on Social Media during COVID-19 via Adversarial and Multi-task Learning Jun Li, Zhihan Yan, Zehang Lin, Xingyun Liu, Hong Va Leong, Nancy Xiaonan Yu, and Qing Li The Hong Kong Polytechnic University Abstract. Suicide ideation detection on social media is a challenging problem due to its implicitness. In this paper, we present an approach to detect suicide ideation on social media based on a BERT- LSTM model with Adversarial and Multi-task learning (BLAM). More specifically, BLAM combines BERT model with Bi-LSTM model to extract deeper and richer features. Furthermore, emotion classification is utilized as an auxiliary task to perform multi-task learning, which enriches the extracted features with emotion information that enhances the identification of suicide. In addition, BLAM generates adversarial noise by adversarial learning improving the generalization ability of the model. Extensive experiments conducted on our collected Suicide Ideation Detection (SID) dataset demonstrate the competitive superiority of BLAM compared with the state-of-the-art methods.

Research Session 3: Data Management Time: 13:30-15:10, August 23, 2021, Monday Chair: Haibo HU

An Efficient Bucket Logging for Persistent Memory Xiyan Xu and Jiwu Shu Tsinghua University Abstract. Logging is widely used to provide atomicity and durability for transactions in database management systems (DBMSs). For decades, the traditional logging protocol for disk-oriented database storage engines focuses on making a trade-off between data persistence and performance loss due to the large performance gap and access granularity mismatch between dynamic random- access memory (DRAM) and disks. With the development of persistent memory (PM) especially the release of the commercial Optane DC Persistent Memory Module (Optane DCPMM), a new class of storage engine which employs PM as its primary storage has emerged. The disk-based logging protocol is not suitable for these PM-aware storage engines, since PM provides data persistence and has low-latency comparable to DRAM. In this paper, we design and implement an efficient logging protocol for PM-aware storage engines: Bucket Logging (BKL). BKL uses the per-transaction log structure (i.e., bucket) to store logs internally and ensures efficient writing of metadata and logs. Benefit from multi version concurrency control, BKL only records small fixed-size log entries to implement fast logging and crash recovery. Moreover, we optimize our design based on our basic performance evaluation of Optane DCPMM. We implement a micro storage engine in MariaDB and using YCSB to evaluate BKL's performance on Optane DCPMM. The results show that the storage engine with BKL has 1.5×-7.1× higher throughput compared to InnoDB under write-heavy workloads. Compared with other logging protocol, BKL achieves higher throughput and better scalability and reduces the system performance recovery time by 1.4×-11.8×.

Data Poisoning Attacks on Crowdsourcing Learning Pengpeng Chen, Hailong Sun, and Zhijun Chen Beihang University Abstract. Understanding and assessing the vulnerability of crowdsourcing learning against data poisoning attacks is the key to ensure the quality of classifiers trained from crowdsourced labeled data. Existing studies on data poisoning attacks only focus on exploring the vulnerability of crowdsourced label collection. In fact, instead of the quality of labels themselves, the performance of the trained classifier is a main concern in crowdsourcing learning. Nonetheless, the impact of data poisoning attacks on the final classifiers remains underexplored to date. We aim to bridge this gap. First, we formalize the problem of poisoning attacks, where the objective is to sabotage the trained classifier maximally. Second, we transform the problem into a bilevel min-max optimization problem for the typical learning-from-crowds model and design an efficient adversarial strategy. Extensive validation on real-world datasets demonstrates that our attack can significantly decrease the test accuracy of trained classifiers. We verified that the labels generated with our strategy can be transferred to attack a broad family of crowdsourcing learning models in a black-box setting, indicating its applicability and potential of being extended to the physical world.

Dynamic Environment Simulation for Database Performance Evaluation Chunxi Zhang, Rong Zhang, Qian Su, and Aoying Zhou Nankai University Abstract. The wide popularity and the maturity of cloud platform promote the development of Cloud Native database systems. On-demand resource configuration is an attractive feature of cloud platforms, but its complexity in resource management challenges the benchmarking of database performance, which is no longer in a stand-alone test environment. Sharing or contending of resources aggravates the dynamics of environment, which can influence database performance much. In order to expose the real performance in production environment, environment simulation is prerequisite for benchmarking databases. Although Docker Containers have been promoted to isolate resources, we still cannot achieve the true resource isolation. In this paper, we first define four kinds of workload generators corresponding to the key environmental dimensions, then builds a multi- order polynomial linear regression model to calculate the correlation among workloads, and simulates the dynamical changes of environment. It is the first work to provide a complete and dynamic simulation to environment. We conduct comprehensive experiments on the open-source DBMS by running the standard benchmarks to verify the effectiveness of our work.

LinKV: an RDMA-enabled KVS for High Performance and Strict Consistency under Skew Xing Wei, Huiqi Hu, Xuan Zhou, and Aoying Zhou East China Normal University Abstract. We present LinKV, a novel distributed key-value store that can leverage RDMA network to simultaneously provide high performance and strict consistency (i.e., per-key linearizability) for skewed workloads. To avoid the potential performance loss caused by load imbalance under skew, existing solutions will replicate popular items into different nodes’ caches to support quick and even accesses. But for those writes hitting cache, there will be multiple consistency actions to guarantee linearizability, which degrade the overall performance. In this paper, we present a batch method to make multiple writes amortize those overheads caused by a round of consistency actions. While for reads, we introduce a lease-based scheme to make them quickly return the most recently completed batches of writes. Comparing to the state-of-the-art solutions, LinKV with above strategies can improve the throughput by 1.5 − 3× and reduce the latency to about 10% with different write ratios under skew.

Cheetah: An Adaptive User-space Cache for Non-volatile Main Memory File Systems Tian Yan, Linpeng Huang, and Shengan Zheng Shanghai Jiao Tong University Abstract. Over the past decade, most NVMM file systems have been designed without detailed knowledge of real NVDIMMs. With the release of Intel Optane DC Persistent Memory, researchers find that the performance characteristics of real NVMM differ a lot from their expectations. The design decisions they made lead to limited scalability, significant software overhead, and severe write amplification. We present Cheetah, a user-level cache designed for existing NVMM file systems to improve overall performance. Cheetah leverages the unique characteristics of Intel Optane DC persistent memory to design a fine-grained data block allocation policy in order to reduce write amplification. To minimize the impact of the long write latency of NVMM, Cheetah absorbs asynchronous writes in DRAM rather than NVMM. Our experimental results show that Cheetah provides up to 3.5× through-put improvement compared to the state-of-the-art NVMM file systems in write-intensive workloads.

Research Session 4: Topic Model and Language Model Learning Time: 13:30-15:10, August 23, 2021, Monday Chair: Yanghui Rao

Chinese Word Embedding Learning with Limited Data Shurui Chen, Yufu Chen, Yuyin Lu, Yanghui Rao, Haoran Xie, and Qing Li Sun Yat-sen University Abstract. With the increasing demands of high-quality Chinese word embeddings for natural language processing, Chinese word embedding learning has attracted wide attention in recent years. Most of the existing research focused on capturing word semantics on large-scaled datasets. However, these methods are difficult to obtain effective word embeddings with limited data used for some specific fields. Observing the rich semantic information of Chinese fine-grained structures, we develop a model to fully fuse Chinese fine-grained structures as auxiliary information for word embedding learning. The proposed model views the word context information as a combination of word, character, pronunciation, and component. Besides, it adds the semantic relationship between pronunciations and components as a constraint to exploit auxiliary information comprehensively. Based on the decomposition of shifted positive pointwise mutual information matrix, our model could effectively generate Chinese word embeddings on small-scaled data. The results of word analogy, word similarity, and name entity recognition conducted on two public datasets show the effectiveness of our proposed model for capturing Chinese word semantics with limited data.

Sparse Biterm Topic Model for Short Texts Bingshan Zhu, Yi Cai, and Huakui Zhang South China University of Technology, China Abstract. Extracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus level. However, BTM ignores the fact that a topic is usually described by a few words in a given corpus. In other words, the topic word distribution in topic model should be highly sparse. Understanding the sparsity in topic word distribution may get more coherent topics and improve the performance of BTM. In this paper, we propose a sparse biterm topic model (SparseBTM) which combines a spike and slab prior into BTM to explicitly model the topic sparsity. Experiments on two short texts datasets show that our model can get comparable topic coherent scores and higher classification and clustering performance than BTM.

EMBERT: A Pre-trained Language Model for Chinese Medical Text Mining Zerui Cai, Taolin Zhang, Chengyu Wang, and Xiaofeng He East China Normal University Abstract. Medical text mining aims to learn models to extract useful information from medical sources. A major challenge is obtaining large-scale labeled data in the medical domain for model training, which is highly expensive. Recent studies show that leveraging massive unlabeled corpora for pre-training language models alleviates this problem by self-supervised learning. In this paper, we propose EMBERT, an entity-level knowledge-enhanced pre-trained language model, which leverages several distinct self-supervised tasks for Chinese medical text mining. EMBERT captures fine-grained semantic relations among medical terms by three self-supervised tasks, including i) context-entity consistency prediction (whether entities are of equivalence in meanings given certain contexts), ii) entity segmentation (segmenting entities into fine-grained semantic parts) and iii) bidirectional entity masking (predicting the atomic or adjective terms of long entities). The experimental results demonstrate that our model achieves significant improvements over five strong baselines on six public Chinese medical text mining datasets.

Self-Supervised Learning for Semantic Sentence Matching with Dense Transformer Inference Network Fengying Yu, Jianzong Wang, Dewei Tao, Ning Cheng, and Jing Xiao Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China Abstract. Semantic sentence matching concerns predicting the relationship between a pair of natural language sentences. Recently, many methods based on interaction structure have been proposed, usually involving encoder, matching, and aggregation parts. Although some of them obtain impressive results, the simple encoder training from scratch cannot extract the global features of sentences effectively, and the transmission of information in the stacked network will cause certain loss. In this paper, we propose a Densely-connected Inference-Attention network (DCIA) to maximize the use of the feature from each layer of the network by dense connection mechanism and to get robust encoder by self-supervised learning(SSL) based on contrastive method, which can maximize the mutual information between global features and local features of input data. We have conducted experiments on Quora, MRPC, and SICK dataset, the experimental results show that our method owns competitive results on these dataset where we drive 89.13%, 78.1% and 87.7% accuracies respectively. In addition, the accuracy of DCIA with SSL will surpass the one of DCIA without SSL by about 2%.

An Explainable Evaluation of Unsupervised Transfer Learning for Parallel Sentences Mining Shaolin Zhu, Chenggang Mi, and Xiayang Shi Zhengzhou University of Light Industry Abstract. The parallel sentences are known as very important resources for training cross-lingual natural language process applications, such as machine translation (MT) systems. However, these resources are not available for many low-resource language pairs. Existing methods mined parallel sentences using transfer learning. Although several attempts can get a good performance, they are not able to explain why transfer learning can help mining parallel sentences for low-resource language pairs. In this paper, we propose an explainable evaluation to quantity why transfer learning is useful for parallel sentence mining. Besides, we propose a novel unsupervised transfer learning that can maintain the robustness of transfer learning. Experiments show that our proposed method improves the performance of mined parallel sentences compared with previous methods in a standard evaluation set. In particular, we achieve good results at two real-world low-resource language pairs.

Research Session 5: Text Analysis Time: 15:20-17:10, August 23, 2021, Monday Chair: Xiaohui TAO

Leveraging Syntactic Dependency and Lexical Similarity for Neural Relation Extraction Yashen Wang China Academy of Electronics and Information Technology of CETC Abstract. Relation extraction is an important task in knowledge graph completion, information extraction and retrieval task. Recent neural models (especially with attention mechanism) have been shown to perform reasonably well. However, they sometimes fail to: (i) understand the semantic similarity of words with the given entities; and (ii) capture the long-distance dependencies among the words and entities such as co-reference. Moreover, this paper proposes a novel relation extraction model, which leverages syntactic dependency and lexical similarity for enhancing attention mechanism, to get rid of dependence for rich labeled training data. We conduct experiments on widely-used real-world datasets and the experimental results demonstrate the efficiency of the proposed model, even compared with latest state-of-the-art Transformer-based models.

A Novel Capsule Aggregation Framework for Natural Language Inference Chao Sun, Jianzong Wang, Fengying Yu, Ning Cheng, and Jing Xiao Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China Abstract. Recent advances have advocated the use of complex attention mechanism to capture interactive information between premise and hypothesis in Natural Language Inference (NLI). However, few studies have focused on the further processing of matching feature, i.e. information aggregation. In this paper, we first investigate a novel capsule network for NLI, referred as Gcap. Gcap utilizes a gated enhanced fusion operation to obtain richer features between massive soft alignment information. Then the capsule aggregates those features through routing algorithms. Benefit from the routing mechanism of the capsule network, Gcap can dynamically generate feature vectors for subsequent classifier. Evaluation results demonstrate that our model achieves accuracy of 89.1%, 88.2% and 79.6% (79.3%) on SNLI, SciTail and MultiNLI datasets respectively, which outperforms the strong baseline with gains of 0.2%, 1.4% and 0.3% (0.6%). In particular, we compare the runtime inference efficiency of BERT and our model. Our model can attain up to 33.3× speedup in online inference time. Thanks to dynamic aggregation, Gcap shows a strong ability to distinguish those cases that are easily confused.

Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering Ze Fu, Changmeng Zheng, Yi Cai, Qing Li, and Tao Wang South China University of Technology Abstract. Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.

Difficulty-controllable Visual Question Generation Feng Chen, Jiayuan Xie, Yi Cai, Tao Wang, and Qing Li South China University of Technology Abstract. Visual Question Generation (VQG) aims to generate questions from images. Existing studies on this topic focus on generating questions solely based on images while neglecting the difficulty of questions. However, to engage users, an automated question generator should produce questions with a level of difficulty that are tailored to a user’s capabilities and experience. In this paper, we propose a Difficulty-controllable Generation Network (DGN) to alleviate this limitation. We borrow difficulty index from education area to define a difficulty variable for representing the difficulty of questions, and fuse it into our model to guide the difficulty-controllable question generation. Experimental results demonstrate that our proposed model not only achieves significant improvements on several automatic evaluation metrics, but also can generate difficulty-controllable questions.

Incorporating Typological Features into Language Selection for Multilingual Neural Machine Translation Chenggang Mi, Shaolin Zhu, Yi Fan, and Lei Xie Northwestern Polytechnical University Abstract. In this paper, we propose to use rich semantic and typological information of languages to improve the language selection method for multilingual NMT. In particular, we first use a graph- based model to output the most semantic similarity languages; then, a random forest model is built which integrates features such as data size, language family, word formation, morpheme overlap, word order, POS tag and syntax similarity together to predict the final target language(s). Experimental results on several datasets show that our method achieves consistent improvements over existing approaches both on language selection and multilingual NMT.

Removing Input Confounder for Translation Quality Estimation via a Causal Motivated Method Xuewen Shi, Heyan Huang, Ping Jian, and Yi-Kun Tang Beijing Normal University Abstract. Most state-of-the-art QE systems built upon neural networks have achieved promising performances on benchmark datasets. However, the performance of these methods can be easily influenced by the inherent features of the model input, such as the length of input sequence or the number of unseen tokens. In this paper, we introduce a causal inference based method to eliminate the negative impact caused by the characters of the input for a QE system. Specifically, we propose an iterative denoising framework for multiple confounding features. The confounder elimination operation at each iteration step is implemented by a Half-Sibling Regression based method. We conduct our experiments on the official datasets and submissions from WMT 2020 Quality Estimation Shared Task of Sentence-Level Direct Assessment. Experimental results show that the denoised QE results gain better Pearson’s correlation scores with human assessments compared to the original submissions.

Research Session 6: Text Classification Time: 15:20-17:10, August 23, 2021, Monday Chair: Junying Chen

Learning Refined Features for Open-World Text Classification Zeting Li, Yi Cai, Xingwei Tan, Guoqiang Han, Haopeng Ren, Xin Wu, and Wen Li South China University of Technology Abstract. Open-world classification requires a classifier not only to classify samples of the observed classes but also to detect samples which are not suitable to be classified as the known classes. State- of-the-art methods train a network to extract features for separating known classes firstly. Then some strategies, such as outlier detector, are used to reject samples from unknown classes based on the feature space. However, this network as a feature extractor cannot model comprehensive features of known classes in an open world scenario due to limited training data. This causes a problem that the strategies are unable to separate unknown classes from known classes accurately in this feature space. Motivated by the theory of psychology and cognitive science, we utilize class descriptions summarized by human to refine discriminant features and propose a regularization with class descriptions. The regularization is incorporated into DOC (one of state-of-the-art models) to improve the performance of open-world classification. The experiments on two text classification datasets demonstrate the effectiveness of the proposed method.

Emotion Classification of Text Based on BERT and Broad Learning System Sancheng Peng, Rong Zeng, Hongzhan Liu, Guanghao Chen, Ruihuan Wu, Aimin Yang, and Shui Yu Guangdong University of Foreign Studies Abstract. Emotion classification is one of the most important tasks of natural language processing (NLP). It focuses on identifying each kind of emotion expressed in text. However, most of the existing models are based on deep learning methods, which often suffer from long training time, difficulties in convergence and theoretical analysis. To solve the above problems, we propose a method for emotion classification of text based on bidirectional encoder representation from transformers (BERT) and broad learning system (BLS) in this paper. The texts are input into BERT pre-trained model to obtain context-related word embeddings and all word vectors are averaged to obtain sentence embedding. The feature nodes and enhancement nodes of BLS are used to extract the linear and nonlinear features of text, and three cascading structures of BLS are designed to transform input data to improve the ability of text feature extraction. The two groups of features are fused and input into the output layer to obtain the probability distribution of each kind of emotion, so as to achieve emotion classification. Extensive experiments are conducted on datasets from SemEval-2019 Task 3 and SMP2020-EWECT, and the experimental results show that our proposed method can better reduce the training time and improve the classification performance than that of the baseline methods.

Improving Document-level Sentiment Classification with User-Product Gated Network Bing Tian, Yong Zhang, and Chunxiao Xing Tsinghua University Abstract. Document-level sentiment classification is a fundamental task in Natural Language Processing (NLP). Previous studies have demonstrated the importance of personalized sentiment classification by taking user preference and product characteristics on the sentiment ratings into consideration. The state-of-the-art approaches incorporate such information via attention mechanism, where the attention weights are calculated after the texts are encoded into the low-dimensional vectors with LSTM-based models. However, user and product information may be discarded in the process of generating the semantic representations. In this paper, we propose a novel User-Product gated LSTM network (UP-LSTM), which incorporates user and product information into LSTM cells at the same time of generating text representations. Therefore, UP-LSTM can dynamically produce user- and product-aware contextual representations of texts. Moreover, we devise another version of it to improve the training efficiency. We conduct a comprehensive evaluation with three real world datasets. Experimental results show that our model outperforms previous approaches by an obvious margin.

Integrating RoBERTa Fine-Tuning and User Writing Styles for Authorship Attribution of Short Texts Xiangyu Wang and Mizuho Iwaihara Waseda University Abstract. Authorship Attribution (AA) is a fundamental branch of text classification, aiming at identifying the authors of given texts. However, authorship attribution of short texts faces many challenges like short text, feature sparsity and non-standardization of casual words. Recent studies have shown that deep learning methods can greatly improve the accuracy of AA tasks, however they still represent user posts using a set of predefined features (e.g., word n-grams and character n- grams) and adopt text classification methods to solve this task. In this paper, we propose a hybrid model to solve author attribution of short texts. The first part is a pretrained language model based on RoBERTa to produce post representations that are aware of tweet-related stylistic features and their contextualities. The second part is a CNN model built on a number of feature embeddings to represent users' writing styles. Finally, we assemble these representations for final AA classification. Our experimental results show that our model on tweets shows the state-of-the-art result on a known tweet AA dataset.

Dependency Graph Convolution and POS Tagging Transferring for Aspect-based Sentiment Classification Zexin Li, Linjun Chen, Tiancheng Huang, and Jiagang Song Guangxi Normal University Abstract. Aspect-based sentiment classification (ABSC) task is a fine-grained task in natural language processing, which mainly recognizes the sentiment polarity of various aspects in a sentence. Most of the existing work ignores the syntactic constraints of the local context, and few studies use feature enhancement when dealing with ABSC problems. To solve these problems, this paper proposes a new transfer learning model based on aspect sentiment analysis, namely LCF- TDGCN. It is based on local context focus mechanism and self-attention mechanism, and uses Part- Of-Speech (POS) tagging as an auxiliary task to enhance sentiment polarity. Secondly, this method utilizes the dependency graph convolution (DGC) to analyze the syntactic constraints of local context and capture long-term word dependencies. In addition, this paper integrates the pre-trained BERT model, and improves the performance of ABSC tasks by using syntactic information and word dependence. The experimental results on five different datasets show that the LCF-TDGCN produces good results.

Research Session 7: Machine Learning 1 Time: 15:20-17:10, August 23, 2021, Monday Chair: Zhenguo Yang

DTWSSE: Data Augmentation with a Siamese Encoder for Time Series Xinyu Yang, Xinlan Zhang, Zhenguo Zhang, Yahui Zhao, and Rongyi Cui Yanbian University Abstract. Access to labeled time series data is often limited in the real world, which constrains the performance of deep learning models in the field of time series analysis. Data augmentation is an effective way to solve the problem of small sample size and imbalance in time series datasets. The two key factors of data augmentation are the distance metric and the choice of interpolation method. SMOTE does not perform well on time series data because it uses a Euclidean distance metric and interpolates directly on the object. Therefore, we propose a DTW-based synthetic minority oversampling technique using Siamese encoder for interpolation named DTWSSE. In order to reasonably mea- sure the distance of the time series, DTW, which has been verified to be an effective method forts, is employed as the distance metric. To adapt the DTW metric, we use an autoencoder trained in an unsupervised self-training manner for interpolation. The encoder is a Siamese Neural Network for mapping the time series data from the DTW hidden space to the Euclidean deep feature space, and the decoder is used to map the deep feature space back to the DTW hidden space. We validate the proposed methods on a number of different balanced or unbalanced time series datasets. Experimental results show that the proposed method can lead to better performance of the downstream deep learning model.

PT-LSTM: Extending LSTM for Efficient processing Time Attributes in Time Series Prediction Yongqiang Yu, Xinyi Xia, Bo Lang, and Hongyu Liu Beihang University Abstract. Long Short-Term Memory (LSTM) has been widely applied in time series predictions. Time attributes are important factors in time series prediction. However, existing studies often ignore the influence of time attributes when splitting the time series data, and seldom utilize the time information in the LSTM models. In this paper, we propose a novel method named Position encoding and Time gate LSTM(PT-LSTM). We first propose a position-encoding based time attributes in- tegration method, which obtains the vector representation of time attributes through position encoding, and integrate it with the observed value vectors of the data. Moreover, we propose a LSTM variant by adding a new time gate which is specially designed to process time attributes. Therefore, PT-LSTM can make good use of time attributes in the key phases of data prediction. Experimental results on three public datasets show that our PT-LSTM model outperforms the state- of-the-art methods in time series prediction.

Loss Attenuation for Time Series Prediction Respecting Categories of Values Jialing Zhang, Zheng Liu, Yanwen Qu, and Yun Li Nanjing University of Posts and Telecommunications Abstract. Forecasting future values is a core task in many applications dealing with multivariate time series data. In pollution monitoring, for example, forecasting future PM2.5 values in air is very common, which is a crucial indicator of the air quality index (AQI). These values in time series are sometimes affiliated with category information for easy understanding. As an illustration, it is often to categorize the PM2.5 values to indicate the levels of health concern or health risk based on pre- defined category intervals. Forecasting future values without considering the categories leads to potential inconsistency between the categories of predicted values and real values. The underlying reason is that the objective during training is to minimize the overall prediction error, e.g., mean square error, which does not respect the category information. We propose a category adaptive loss attenuation method with respect to training samples in stochastic gradient descent for multi-horizon time series forecasting. The proposed weighting strategy considers training samples’ closeness to category boundaries in a parameterized cost-sensitive manner. The results from extensive experiments demonstrate that the weighting method can improve the overall performance of category-aware time series prediction.

PFL-MoE: Personalized Federated Learning Based on Mixture of Experts Binbin Guo, Yuan Mei, Danyang Xiao, and Weigang Wu Sun Yat-sen University Abstract. Federated learning (FL) is an emerging distributed machine learning paradigm that avoids data sharing among training nodes so as to protect data privacy. Under the coordination of the FL server, each client conducts model training using its own computing resource and private data set. The global model can be created by aggregating the training results of clients. To cope with highly non-IID data distributions, personalized federated learning (PFL) has been proposed to improve overall performance by allowing each client to learn a personalized model. However, one major drawback of a personalized model is the loss of generalization. To achieve model personalization while maintaining better generalization, in this paper, we propose a new approach, named PFL-MoE, which mixes outputs of the personalized model and global model via the MoE architecture. PFL- MoE is a generic approach and can be instantiated by integrating existing PFL algorithms. Particularly, we propose the PFL-MF algorithm which is an instance of PFL-MoE based on the freeze-base PFL algorithm. We further improve PFL-MF by enhancing the decision-making ability of MoE gating network and propose a variant algorithm PFL-MFE. We demonstrate the effectiveness of PFL-MoE by training the LeNet-5 and VGG-16 models on the Fashion-MNIST and CIFAR-10 datasets with non-IID partitions.

A New Density Clustering Method using Mutual Nearest Neighbor Yufang Zhang, Yongfang Zha, Lintao Li, and Zhongyang Xiong University Abstract. Density-based clustering algorithms have become a popular research topic in recent years. However, most these algorithms have difficulties identifying all clusters with greatly varying densities and arbitrary shapes or have considerable time complexity. To tackle this issue, we propose a novel density assessment method by using mutual nearest neighbor, and then propose a relative density clustering algorithm (RDC). RDC can get the right number of clusters for the datasets, which include varying densities and arbitrary shapes; in addition, the time complexity of it is O(nlog n).

Research Session 8: Machine Learning 2 Time: 15:20-17:10, August 23, 2021, Monday Chair: Hao Wang

Unsupervised Deep Hashing via Adaptive Clustering Shuying Yu, Xian-Ling Mao, Wei Wei, and Heyan Huang Beijing Insititute of Technogy Abstract. Similarity-preserved hashing has become a popular technique for large-scale image retrieval because of its low storage cost and high search efficiency. Unsupervised hashing has high practical value because it learns hash functions without any annotated label. Previous unsupervised hashing methods usually obtain the semantic similarities between data points by taking use of deep features extracted from pre-trained CNN networks. The semantic structure learned from fixed embeddings are often not the optimal, leading to sub-optimal retrieval performance. To tackle the problem, in this paper, we propose a Deep Clustering based Unsupervised Hashing architecture, called DCUH. The proposed model can simultaneously learn the intrinsic semantic relationships and hash codes. Specifically, DCUH first clusters the deep features to generate the pseudo classification labels. Then, DCUH is trained by both the classification loss and the discriminative loss. Concretely, the pseudo class label is used as the supervision for classification. The learned hash code should be invariant under different data augmentations with the local semantic structure preserved. Finally, DCUH is designed to update the cluster assignments and train the deep hashing network iteratively. Extensive experiments demonstrate that the proposed model outperforms the state-of-the-art unsupervised hashing methods.

FedMDR: Federated Model Distillation with Robust Aggregation Yuxi Mi, Yutong Mu, Shuigeng Zhou, and Jihong Guan Fudan University Abstract. This paper presents FedMDR, a federated model distillation framework with a novel, robust aggregation mechanism that exploits transfer learning and knowledge distillation. FedMDR adopts a weighted geometric-median-based aggregation with trimmed prediction accuracy on the server-side, which orchestrates communication-efficient training on both heterogeneous model architectures and non-i.i.d. data. The aggregation provides resilience to sharp accuracy drop of corrupted models. We also extend FedMDR to support differential privacy by adding Gaussian noise to the aggregated consensus. Results show that FedMDR achieves significant robustness gain and satisfactory accuracy, and outperforms the existing techniques.

Data Augmentation for Graph Convolutional Network on Semi-Supervised Classification Zhengzheng Tang, Ziyue Qiaoy, Xuehai Hong, Yang Wang, Fayaz Ali Dharejo, Yuanchun Zhou, and Yi Du Beijing University of Chinese Academy of Sciences Abstract. Data augmentation aims to generate new and synthetic features from the original data, which can identify a better representation of data and improve the performance and generalizability of downstream tasks. However, data augmentation for graph-based models remains a challenging problem, as graph data is more complex than traditional data, which consists of two features with different properties: graph topology and node attributes. In this paper, we study the problem of graph data augmentation for Graph Convolutional Network (GCN) in the context of improving the node embeddings for semi-supervised node classification. Specifically, we conduct cosine similarity based cross operation on the original features to create new graph features, including new node attributes and new graph topologies, and we combine them as new pairwise inputs for specific GCNs. Then, we propose an attentional integrating model to weighted sum the hidden node embeddings encoded by these GCNs into the final node embeddings. We also conduct a disparity constraint on these hidden node embeddings when training to ensure that non-redundant information is captured from different features. Experimental results on five real-world datasets show that our method improves the classification accuracy with a clear margin(+2.5% - +84.2%) than the original GCN model.

Generating Long and Coherent Text with Multi-Level Generative Adversarial Networks Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen Renmin University of China Abstract. In this paper, we study the task of generating long and coherent text. In the literature, Generative Adversarial Nets (GAN) based methods have been one of the mainstream approaches to generic text generation. We aim to improve two aspects of GAN-based methods in generic text generation, namely long sequence optimization and semantic coherence enhancement. For this purpose, we propose a novel Multi-Level Generative Adversarial Networks (MLGAN) for long and coherent text generation. Our approach explicitly models the text generation process at three different levels, namely paragraph-, sentence- and word-level generation. At the top two levels, we generate continuous paragraph vectors and sentence vectors as semantic sketches to plan the entire content. While, at the bottom level we generate discrete word tokens for realizing the sentences. Furthermore, we utilize a Conditional GAN architecture to enhance the inter-sentence coherence by injecting paragraph vectors for sentence vector generation. Extensive experiments results have demonstrated the effectiveness of the proposed model.

A Reasonable Data Pricing Mechanism for Personal Data Transactions with Privacy Concern Zheng Zhang, Wei Song, and Yuan Shen Wuhan University Abstract. In the past few years, more and more data marketplaces for personal data transactions sprung up. However, it is still very challenging to estimate the value of privacy contained in the personal data. Especially when the buyer already has some related datasets, he is able to obtain more privacy by combining and analyzing the bought data and the data he already has. The main research motivation of this work is to reasonably price the data with privacy concern. We propose a reasonable data pricing mechanism which prices the personal privacy data from three aspects and is different from the existing work, we propose a new concept named ‘privacy cost’ to quantitatively measure the privacy information increment after a data transaction rather than directly measuring the privacy information contained in a single dataset. In addition, we use the information entropy as an important index to measure the information content of data. And we conduct a set of experiments on our personal data pricing method, and the results show that our pricing method performs better than the alternatives.

Research Session 9: Knowledge Graph Time: 13:30-15:10, August 24, 2021, Tuesday Chair: Lu CHEN

A Probabilistic Inference Based Approach for Querying Associative Entities in Knowledge Graph JianYu Li, Kun Yue, Jie Li, and Liang Duan University Abstract. Querying associative entities is to provide top ranked entities in knowledge graph (KG). Many entities are not linked explicitly in KG but actually associated when incorporating outside user-generated data, which could enrich entity associations for the query processing of KG. In this paper, we leverage user-entity interactions (called user-entity data) to improve the accuracy of querying associative entities in KG. Upon the association rules obtained from user-entity data, we construct the association entity Bayesian network (AEBN), which facilitates the representation and inference of the dependencies among entities. Consequently, we formulate the problem of querying associative entities as the probabilistic inferences over AEBN. To rank the associative entities, we propose the approximate method to evaluate the association degree between entities. Extensive experiments on various datasets verify the effectiveness and efficiency of our method. Experimental results show that our proposed method outperforms some state-of-the-art competitors.

BOUNCE: An Efficient Selective Enumeration Approach for Nested Named Entity Recognition Liujun Wang and Yanyan Shen Shanghai Jiao Tong University Abstract. The scenario that one entity contains other entities is known as nested entities. Nested named entity recognition is a fundamental and challenging task in various NLP applications. The state-of-the-art nested NER approach first enumerates all the text spans in a sentence and then performs classification. We realize that a large proportion of entities contain only one token which cannot be nested, and most text spans in a sentence are not entities and the full enumeration is thus costly and unnecessary. In this paper, we propose an efficient selective enumeration approach named BOUNCE. We decompose the nested NER task into two subtasks for identifying unit-length entities and the others respectively. We develop a delicate model for each subtask and perform joint training for both of them. To improve the efficiency, we employ a head detection module to locate the start points of entities, which acts as a filtering step before enumeration. We provide a detailed analysis on the time complexity of the existing nested NER techniques and conduct extensive experiments on two datasets. The results demonstrate that BOUNCE outperforms various nested NER techniques and achieves higher efficiency than the state-of-the-art method with comparable accuracy performance.

PAIRPQ: An Efficient Path Index for Regular Path Queries on Knowledge Graphs Baozhu Liu, Xin Wang, Pengkai Liu, Sizhuo Li, and Xiaofei Wang Tianjin University Abstract. With the growing popularity and application of knowledge-based artificial intelligence, the scale of knowledge graph data is dramatically increasing. A Regular Path Query (RPQ) allows for retrieving vertex pairs with the paths between them satisfying regular expressions. As an essential type of queries for RDF graphs, RPQs have been attracting increasing research efforts. Since the complexity of RPQs is in polynomial time with respect to the scale of the knowledge graphs, currently, there has been no efficient method to process RPQs on large-scale knowledge graphs. In this paper, we propose a novel indexing solution by leveraging frequent path mining. Unlike the existing RPQ processing methods, our approach makes full use of frequent paths as the basic indexing facility. The frequent paths extracted from data graphs will be indexed to accelerate RPQs. Meanwhile, since no RPQ benchmark available, we create a micro-benchmark on synthetic and real- world data sets. The experimental results show that PAIRPQ improves the query efficiency by orders of magnitude than the state-of-the-art RDF storage engines.

A Hybrid Semantic Matching Model for Neural Collective Entity Linking Baoxin Lei, Wen Li, Leung-Pun Wong, Lap-Kei Lee, Fu Lee Wang, and Tianyong Hao South China Normal University Abstract. The task of entity linking aims to correctly link mentions in a text fragment to a reference knowledge base. Most existing methods apply single neural network model to learn semantic representations on all granularities in contextual information, which neglecting the trait of different granularities. Also, these solely representation-based methods measure the semantic matching based on the abstract vector representation that frequently miss concrete matching information. To better capture contextual information, this paper proposes a new neural network model called Hybrid Semantic Matching (HSM) for the entity linking task. The model captures two different aspects of semantic information via representation and interaction-based neural semantic matching models. Furthermore, to consider the global consistency of entities, a recurrent random walk is applied to propagate entity linking evidences among related decisions. Evaluation was conducted on three publicly available standard datasets. Results show that our proposed HSM model is more effective compared with a list of baseline models.

Multi-space knowledge enhanced question answering over knowledge graph Ye Ji, Bohan Li, Yi Liu, Yuxin Zhang, and Ken Cai Nanjing University of Aeronautics and Astronautics Abstract. Knowledge graph question answering (KG-QA) accesses the substantial knowledge to return a comprehensive answer in a more user-friendly solution. Recently, the embedding-based methods to KG-QA have always been hot issues. Traditional embedding-based methods cannot make full use of knowledge since they incorporating the knowledge by using a single semantic translation model to embed entities and relations. Semantic translation models based on non-Euclidean spaces can capture more kinds of latent information because they can focus on the characteristics of different aspects of knowledge. In this paper, we propose the multi-space knowledge enhanced question answering model and mine the latent information of knowledge in different embedding spaces to improve the KG-QA. In addition, Transformer is used to replace the traditional Bi-LSTM to obtain the vector representation of question, and specially designed attention mechanism is used to calculate the score of candidate answers dynamically. The experiment conducted on the WebQuestions dataset shows that compared with other state-of-art QA systems, our method can effectively improve the accuracy.

Research Session 10: Emerging Data Processing Techniques Time: 13:30-15:10, August 24, 2021, Tuesday Chair: Bo TANG

A Distribution-Aware Training Scheme for Learned Indexes Youyun Wang, Chuzhe Tang, and Xujia Yao Shanghai Jiao Tong University Abstract. The recent proposal of the learned index leads us to a new direction to optimize indexes. With the help of learned models, it has demonstrated promising performance improvement compared with traditional indexes. However, the skewed query distribution and ever-changing data distribution common in real-world workloads pose additional challenges to the learned index. The missing consideration of these ‘distributions’ can notably obscure the learned index’s high performance. To solve this issue, we propose a Distribution-Aware Training scheme for the learned index (called DATum). DATum can produce a tuned model for a specific query and data distribution. Central to DATum are two designs. First, it stretches the training data according to access frequencies to incorporate the skewed query patterns. Second, it combines a model cache and a classic grid search to efficiently find the best model architecture for the ever-changing datasets. Our experimental results show that, DATum can improve the learned index’s performance by 51.1% and reduce its model rebuilding time to less than 1%.

AIR Cache: A Variable-size Block Cache Based on Fine-grained Management Method Yuxiong Li, Yujuan Tan, Congcong Xu, Duo Liu, Xianzhang Chen, Chengliang Wang, Mingliang Zhou, and Leong Hou U Chongqing University Abstract. Recently, adopting large cache blocks has received widespread attention in server-side storage caching. Besides reducing the management overheads of cache blocks, it can significantly boost the I/O throughput. However, although using large blocks has advantages in management overhead and I/O performance, existing fixed-size block management schemes in storage cache cannot effectively handle them under the complicated real-world workloads. We find that existing fixed-size block management methods will suffer from the fragmentation within the cache block and fail to identify hot/cold cache blocks correctly when adopting large blocks for caching. Therefore, aiming to solve this problem, we propose AIR cache, which is a variable-size block cache based on fine-grained management method. AIR cache contains three major parts, Multi-Granularity Writer (MGW), Multi-Granularity Eviction (MGE) and Fine-Grained Recorder (FGR) where FGR is dedicated to record the data popularity using fine-grained data sections, MGW writes data at different granularity, and MGE is responsible for evicting the data at dynamic granularity. Our experiments with real-world traces demonstrate that AIR cache can increase the read cache hit ratio by up to 6.97X and the cache space utilization rate by up to 3.63X over the traditional fixed-size block management methods.

Learning an Index Advisor with Deep Reinforcement Learning Sichao Lai, Xiaoying Wu, Senyang Wang, Yuwei Peng, and Zhiyong Peng Wuhan University Abstract. Indexes are crucial for the efficient processing of database workloads and an appropriately selected set of indexes can drastically improve query processing performance. However, the selection of beneficial indexes is a nontrivial problem and still challenging. Recent work in deep reinforcement learning (DRL) may bring a new perspective on this problem. In this paper, we studied the index selection problem in the context of reinforcement learning and proposed an end-to-end DRL-based index selection framework. The framework poses the index selection problem as a series of 1-step single index recommendation tasks and can learn from data. Unlike most existing DRL- based index selection solutions that focus on selecting single-column indexes, our framework can recommend both single-column and multi-column indexes for the database. A set of comparative experiments with existing solutions was conducted to demonstrate the effectiveness of our proposed method.

SardineDB: A Distributed Database on the Edge of the Network Min Dong, Haozhao Zhong, Boyu Sun, Sheng Bi, and Yi Cai South China University of Technology Abstract. In the past few years, the number of sensors is rapidly increasing on the edge of the network. IoT(Internet of things) devices play not only data producers but also data consumers. It is valuable to deploy a distributed database on the edge of the network. However, flash memory is the mainstream storage medium on the edge, which is different from cloud environment. Flash memory can wear out through repeated writes while large amount of data are written on the edge per day. Thus, in this paper, SardineDB is presented, which is a decentralized distributed database optimized for edge. The engine of SardineDB is SardineCore, which is a flash-optimized key-value separation storage based on LevelDB. SardineCore has low GC(garbage collection) burden, which can be used to low the write amplification and improve the write performance on the edge. From evaluation results, the write performance and random read performance of SardineDB have great advantages compared with existing distributed databases on the edge. As a result, SardineDB is very suitable for edge because it has high write performance, low GC burden and low write amplification.

DLSM: Distance Label based Subgraph Matching on GPU Shijie Jiang, Yang Wang, Guang Lu, and Chuanwen Li Northeastern University Abstract. Graphs have been prevalently used to represent complex data, such as social networks, citation networks and biological protein interaction networks. The subgraph matching problem has wide applications in the graph data computing area. Recently, many parallel matching algorithms have been proposed to speed up subgraph matching queries, among which the filter-join framework is attracting increasingly attentions in recent years. Existing filtering strategies are able to compress candidate vertex sets to a certain size. However, quite a few invalid vertices are still left, leading to unnecessary computation in later joining phases. We observed that the shortest distance between vertices can act as an important condition to further refine the candidate set. In this paper, we propose a method of shortest distance estimation based on the observation and design a new method based on distance coding. By this means we improve the efficiency of subgraph matching. The experimental results suggests that our method is more efficient and scalable than the state-of-the-art method.

Research Session 11: Information Extraction and Retrieval Time: 13:30-15:10, August 24, 2021, Tuesday Chair: Tianyong Hao

Distributed Top-k Pattern Mining Xin Wang, Mingyue Xiang, Huayi Zhan, Zhuo Lan, Yuang He, Yanxiao He, and Yuji Sha Southwest Petroleum University Abstract. Frequent pattern mining (FPM) on a single large graph has been receiving increasing attention since it is crucial to applications in a variety of domains including e.g., social network analysis. The FPM problem is defined as finding all the subgraphs (a.k.a. patterns) that appear frequently in a large graph according to a user-defined frequency threshold. In recent years, a host of techniques have been developed, while most of them suffers from high computational cost and inconvenient result inspection. To tackle the issues, in this paper, we propose an approach to mining top-k patterns from a single graph G under the distributed scenario. We formalize the distributed top- k pattern mining problem by incorporating viable support and interestingness metrics. We then develop a parallel algorithm, that preserves early termination property, to efficiently discover top-k patterns. Using real-life and synthetic graphs, we experimentally verify that our algorithm is rather effective and outperforms traditional counterparts in both efficiency and scalability.

SQKT: A Student attention-based and Question-aware model for Knowledge Tracing Qize Xie, Liping Wang, Peidong Song, and Xuemin Lin East China Normal University Abstract. The goal of Knowledge Tracing(KT) is to trace student’s knowledge states in relation to different knowledge concepts and make prediction of student’s performance on new exercises. With the growing number of online learning platforms, personalized learning is more and more urgently required. As a result, KT has been widely explored for recent decades. Traditional machine learning based methods and Deep Neural Network based methods have been constantly introduced for improving prediction accuracy of KT models and have achieved some positive results. However, there are still some challenges for KT research, such as information representation of high- dimentional question data, consideration of personalized learning ability, and so on. In this paper we propose a novel Student attention-based and Question-aware model for KT(SQKT), which can address the challenges by estimating student attention on different type of questions through history exercise trajectory. Firstly, we devise a weighted graph and propose a weighted deepwalk method to get the question embedding which is combined with the correlated skills as question representation. Secondly, we propose a novel student attention mechanism, which is dedicated for the updating of student’s knowledge state. Finally, comprehensive experiments are conducted on 4 real world datasets, the results demonstrate that our SQKT model outperforms the state-of-the-art KT models on all datasets.

Comparison Question Generation Based on Potential Compared Attributes Extraction Jiayuan Xie, Wenhao Fang, Yi Cai, and Zehang Lin South China University of Technology Abstract. Question generation (QG) aims to automatically generate questions from a given passage, which is widely used in education. Existing studies on the QG task mainly focus on the answer- aware QG, which only asks an independent object related to the expected answer. However, to prompt students to develop comparative thinking skills, multiple objects need to be simultaneously focused on the QG task, which can be used to attract students to explore the differences and similarities between them. Towards this end, we consider a new task named comparison question generation (CQG). In this paper, we propose a framework that includes an attribute extractor and an attribute-attention seq2seq module. Specially, the attribute extractor is based on Stanford CoreNLP Toolkit to recognize the attributes related to the multiple objects that can be used for comparison. Then, the attribute-attention seq2seq module utilizes an attention mechanism to generate questions with the assistance of the attributes. Extensive experiments conducted on the HotpotQA dataset manifest the effectiveness of our framework, which outperforms the neural-based model and generates reliable comparison questions.

Multimodal Encoders for Food-Oriented Cross-Modal Retrieval Ying Chen, Dong Zhou, Lin Li, and Jun-mei Han University of Science and Technology Abstract. The task of retrieving across different modalities plays a critical role in food-oriented applications. Modality Alignment remains a challenging component in the whole process, in which a common embedding feature space between two modalities can be learned for effective comparison and retrieval. Recent studies mainly utilize adversarial loss or reconstruction loss to align different modalities. However, insufficient features may be extracted from different modalities, resulting in low quality of alignments. Unlike these methods, this paper proposes a method combining multimodal encoders with adversarial learning to learn improved and efficient cross-modal embeddings for retrieval purposes. The core of our proposed approach is the directional pairwise cross-modal attention that latently adapts representations from one modality to another. Although the model is not particularly complex, experimental results on the benchmark Recipe1M dataset show that our proposed method is superior to current state-of-the-art methods.

Data Cleaning for Indoor Crowdsourced RSSI Sequences Jing Sun, Bin Wang, Xiaoxu Song, and Xiaochun Yang Northeastern University Abstract. Received Signal Strength Indication (RSSI) has been increasingly deployed in indoor localization and navigation. Comparing with traditional fingerprint-based methods, crowdsourced method can collect RSSIs without expert surveyors and designated fingerprint collection points low- costly and efficiently. However, the crowdsourced RSSIs may contain some false and incomplete data. In this paper, we focus on two quality types of indoor crowdsourced RSSI sequences: missing values and false values. For the received signal strength values, we propose a RSSI sequences alignment and matching method to complete the missing values. For the location labels, we construct an indoor logical graph to capture the indoor topology and spatial consistent. To repair the missing and false location labels, we design a AP distribution based mapping method to map crowdsourced RSSIs to floor plan.

Research Session 12: Recommender System Time: 13:30-15:10, August 24, 2021, Tuesday Chair: Hui LI

A Behavior-aware Graph Convolution Network Model for Video Recommendation Wei Zhuo, Kunchi Liu, Taofeng Xue, Beihong Jin, Beibei Li, Xinzhou Dong, He Chen, Wenhai Pan, Xuejian Zhang, and Shuo Zhou MX Media Co., Ltd., Singapore, Singapore Abstract. Interactions between users and videos are the major data source of performing video recommendation. Despite lots of existing recommendation methods, user behaviors on videos, which imply the complex relations between users and videos, are still far from being fully explored. In the paper, we present a model named Sagittarius. Sagittarius adopts a graph convolutional neural network to capture the influence between users and videos. In particular, Sagittarius differentiates between different user behaviors by weighting and fuses the semantics of user behaviors into the embeddings of users and videos. Moreover, Sagittarius combines multiple optimization objectives to learn user and video embeddings and then achieves the video recommendation by the learned user and video embeddings. The experimental results on multiple datasets show that Sagittarius outperforms several state-of-the-art models in terms of recall, unique recall and NDCG.

GRHAM: Towards Group Recommendation Using Hierarchical Attention Mechanism Nanzhou Lin, Juntao Zhang, Xiandi Yang, Wei Song, and Zhiyong Peng Wuhan University Abstract. Group recommendations extend individual user recommendations to groups, which have become one of the most prevalent topics in the recommender system and widely applied in catering, tourism, movies, and many other fields. The key to group recommendation is how to aggregate the preferences of different group members and calculate the group’s preferences. However, the existing aggregation strategy is static and simple. First, they ignore that the preferences of members will change over time. Second, they fail to consider that member’s influence is different when the group makes decisions on different activities. To this end, this paper proposes a new novel model for Group Recommendation using Hierarchical Attention Mechanism (GRHAM), which can dynamically adjust the weight of members in group decision-making. Our model consists of two layers of attention neural networks, the first attention layer learns the influence weights of members when the group makes the decision, and the second attention layer learns the influence weights between group members. Besides aggregating the preference of group members, we further learn group topic preferences from the historical data. We conduct experiments on two real datasets, and the experimental results show that our model outperforms other group recommendation models.

Multi-Interest Network Based on Double Attention for Click-through Rate Prediction Xiaoling Xia, Wenjian Fang, and Xiujin Shi Donghua University Abstract. Whether in the field of personalized advertising or recommender systems, click through rate (CTR) prediction is a very important task. In recent years, Alibaba Group has done a lot of advanced research on the prediction of click through rate, and proposed a technical route includes kinds of deep learning models. For example, in the deep interest network (DIN) proposed by the Alibaba Group, the sequence of users’ browsing behaviors is used to express their interest features, and this sequence is made up of items clicked by users. Usually, the item is mapped into a static vector, but the fixed length embedded vector is difficult to express the user’s dynamic interest features. In order to solve this problem, Alibaba Group introduced the attention mechanism in the field of natural language processing (NLP) into deep interest network (DIN), and designed a unique activation unit to extract the important informations in the user’s historical behavior sequence, and use these important informations to express user’s dynamic interest features. In this paper, we propose a novel deep learning model: Multi-Interest Network Based on Double Attention for Click- Through Rate Prediction (DAMIN), which based on Deep Interest Network (DIN) and combined with multi-head attention mechanism. In the deep interest network (DIN), the attention weight between the candidate item vector and the item vector of the user’s historical behavior sequence is learned by fully connected neural network. Different from deep interest network (DIN), we design a new method, which uses the reciprocal of Euclidean Distance to represent the attention weight between two item vectors. Then, the item vectors in user’s historical behavior sequence are weighted by the attention weights and meanwhile the candidate item vectors are also weighted by the attention weights. In the next, we can obtain new item vectors by add the weighted item vectors of user’s historical behavior sequence and weighted candidate item vectors, and those new item vectors are used to represent user’s dynamic interest feature vectors. In the end, the user’s dynamic interest features are send into the three multi-head attention layers, which can extract users’ various interest features. We have conducted a lot of experiments on three real-world datasets of Amazon and the results show that the model proposed by this paper acquires a better performance than some classical models. Compared with DIN, the model proposed in this paper improves the average of AUC by 4% - 5%, which proves that the model proposed in this paper is effective. In addition, a large number of ablation experiments have been carried out to prove that each module of the proposed model is effective.

Self-residual Embedding for Click-Through Rate Prediction Jingqin Sun, Yunfei Yin, Faliang Huang, Mingliang Zhou, and Leong Hou U Chongqing University Abstract. In the Internet, categorical features are high-dimensional and sparse, and to obtain its low- dimensional and dense representation, the embedding mechanism plays an important role in the click-through rate prediction of the recommendation system. Prior works have proved that residual network is helpful to improve the performance of deep learning models, but there are few works to learn and optimize the embedded representation of raw features through residual thought in recommendation systems. Therefore, we designed a self-residual embedding structure to learn the distinction between the randomly initialized embedding vector and the ideal embedding vector by calculating the self-correlation score, and applied it to our proposed SRFM model. Extensive experiments on four real datasets show that the SRFM model can achieve satisfactory performance compared with the superior model. Also, the self-residual embedding mechanism can improve the prediction performance of some existing deep learning models to a certain extent.

GCNNIRec: Graph Convolutional Networks with Neighbor complex Interactions for Recommendation Teng Mei, Tianhao Sun, Renqin Chen, Mingliang Zhou, and Leong Hou U Chongqing University Abstract. In recent years, tremendous efforts have been made to explore features contained in user- item graphs for recommendation based on Graph Neural Networks (GNN). However, most existing recommendation methods based on GNN use weighted sum of directly-linked node's features only, assuming that neighboring nodes are independent individuals, neglecting possible correlations between neighboring nodes, which may result in failure of capturing co-occurrence signals. Therefore, in this paper, we propose a novel Graph Convolutional Network with Neighbor complex Interactions for Recommendation (GCNNIRec) focused upon capturing possible co-occurrence signals between node neighbors. Specifically, two types of modules, the Linear-Aggregator module and the Interaction-Aggregator module are both inside GCNNIRec. The former module linearly aggregates the features of neighboring nodes to obtain the representation of target node. The latter utilizes the interactions between neighbors to aggregate the co-occurrence features of nodes to capture cooccurrence features. Furthermore, empirical results on three real datasets confirm not only the state-of-the-art performance of GCNNIRec but also the performance gains achieved by introducing Interaction-Aggregator module into GNN.

Research Session 13: Spatial and Spatio-Temporal Databases Time: 13:30-15:10, August 24, 2021, Tuesday Chair: Edison TZE

Velocity-Dependent Nearest Neighbor Query Xue Miao, Xi Guo, Xiaochun Yang, Lijia Yang, Zhaoshun Wang, and Aziguli Wulamu University of Science and Technology Beijing Abstract. Location-based services recommend points of interests (POIs) which are nearer to the user’s position q. In practice, when the user is moving with a velocity −→v , he may prefer the nearer POIs which match his moving direction. In this paper, we propose the velocity-dependent nearest neighbor query (VeloNN query), which selects the POIs that are nearer and best match the user’s moving direction. In the VeloNN query, if the direction of a POI o highly matches the direction of −→v , o is likely to be preferred. Since computing the directional preferences of all POIs is time-consuming, we propose rules to filter out the POIs with low directional preferences. We also divide the space into tiles, i.e., rectangular areas, and compute a candidate set for each tile in advance. The VeloNN candidates can be quickly prepared after finding the tile where the user is. We conduct experiments on both synthetic and real datasets and the results show the proposed algorithms can support VeloNN queries efficiently.

Finding Geo-Social Cohorts in Location-Based Social Networks Muhammad Aamir Saleem, Toon Calders, Torben Bach Pedersen, and Panagiotis Karras Aalborg University Abstract. Given a record of geo-tagged activities, how can we suggest groups, or cohorts of likely companions? A brute-force approach is to perform a spatio-temporal join over past activity traces to find groups of users recorded as moving together; yet such an approach is inherently unscalable. In this paper, we propose that we can identify and predict such cohorts by leveraging information on social ties along with past geo-tagged activities, i.e., geo-social information. In particular, we find groups of users that (i) form cliques of friendships and (ii) maximize a function of common pairwise activities on record among their members. We show that finding such groups is an NP-hard problem, and propose a nontrivial algorithm, COVER, which works as if it were enumerating maximal social cliques, but guides its exploration by a pruning-intensive activity-driven criterion in place of a clique maximality condition. Our experimental study with real-world data demonstrates that COVER outperforms a brute-force baseline in terms of efficiency and surpasses an adaptation of previous work in terms of prediction accuracy regarding groups of companions, including groups that do not appear in the training set, thanks to its use of a social clique constraint.

Modeling Dynamic Spatial Influence for Air Quality Prediction With Atmospheric Prior Dan Lu, Le Wu, Rui Chen, Qilong Han, Yichen Wang, and Yong Ge Harbin Engineering University Abstract. Air quality prediction is an important task benefiting both individual outdoor activities and urban emergency response. To account for complex temporal factors that influence long-term air quality, researchers have formulated this problem using an encoder-decoder framework that captures the non-linear temporal evolution. Besides, as air quality presents natural spatial correlation, researchers have proposed to learn the spatial relation with either a graph structure or an attention mechanism. As well supported by atmospheric dispersion theories, air quality correlation among different monitoring stations is dynamic and changes over time due to atmospheric dispersion, leading to the notion of dispersion-driven dynamic spatial correlation. However, most previous works treated spatial correlation as a static process, and nearly all models relied on only data-driven approaches in the modeling process. To this end, we propose to model dynamic spatial influence for air quality prediction with atmospheric prior. The key idea of our work is to build a dynamic spatial graph at each time step with physical atmospheric dispersion modeling. Then, we leverage the learned embeddings from this dynamic spatial graph in an encoder-decoder model to seamlessly fuse the dynamic spatial correlation with the temporal evolution, which is key to air quality prediction. Finally, extensive experiments on real-world benchmark data clearly show the effectiveness of the proposed model.

Learning Cooperative Max-Pressure Control by Leveraging Downstream Intersections Information for Traffic Signal Control Yuquan Peng, Lin Li, Qing Xie, and Xiaohui Tao Wuhan University of Technology Abstract. Traffic signal control problems are critical in urban intersections. Recently, deep reinforcement learning demonstrates impressive performance in the control of traffic signals. The design of state and reward function is often heuristic, which leads to highly vulnerable performance. To solve this problem, some studies introduce transportation theory into deep reinforcement learning to support the design of reward function e.g., max-pressure control, which have yielded promising performance. We argue that the constant changes of intersections’ pressure can be better represented with the consideration of downstream neighboring intersections. In this paper, we propose CMPLight, a deep reinforcement learning traffic signal control approach with a novel cooperative max- pressure-based reward function to leverage the vehicle queue information of neighborhoods. The approach employs cooperative max-pressure to guide the design of reward function in deep reinforcement learning. We theoretically prove that it is stabilizing when the average traffic demand is admissible and traffic flow is stable in road network. The state of deep reinforcement learning is enhanced by neighboring information, which helps to learn a detailed representation of traffic environment. Extensive experiments are conducted on synthetic and real-world datasets. The experimental results demonstrate that our approach outperforms traditional heuristic transportation control approaches and the state-of-the-arts learning-based approaches in terms of average travel time of all vehicles in road network.

Privacy-Preserving Healthcare Analytics of Trajectory Data Carson K. Leung, Anifat M. Olawoyin, and Qi Wen University of Manitoba Abstract. Technological advancements have led to generation and collection of big data from various data sources including mobile devices. For instance, to prevent, combat and detect COVID- 19, citizens of many countries were encouraged to use contact tracing apps on their mobile devices. Collection of their trajectories can be analyzed and mined for social goods. At the same time, their privacy also needs to be preserved. In other words, the advent of COVID-19 has made releasing of patient records become imperative and yet privacy of individuals must be protected. Releasing spatio-temporal COVID-19 data plays a significant role in contact tracing and may help in reducing the spread of the disease due to likelihood of increasing adherence to social distancing and other health related guidelines by the people around the cluster of the released data. In this paper, we examine the problem of preserving privacy of spatio-temporal trajectory data and introduce hierarchical temporal representative point (HTRP) differential privacy model. We evaluate our framework using a South Korean COVID-19 patient route dataset. Empirical results show a balance of utility and privacy provided by our framework with our HTRP for privacy-preserving healthcare data analytics.

Demo Session Time: 13:30-15:10, August 24, 2021, Tuesday Chair: Yanghui Rao

PARROT: An Adaptive Online Shopping Guidance System Da Ren, Yi Cai, Zhicheng Zhong, Zhiwei Wu, Zeting Li, Weizhao Li, and Qing Li The Hong Kong Polytechnic University Abstract. With the development of e-commerce, it is necessary to build an online shopping guidance system to help users to choose the products they desired. Task-oriented dialogue systems can be used as an online shopping guidance system in e-commerce websites. Current dialogue systems can only extract basic attributes which are the inherent attributes of products. These systems can not process users’ requests containing high level attributes which describe products’ functions and user experience. These requests, however, appear frequently in real scenarios. To solve this problem, we build PARROT, an adaptive online shopping guidance system. PARROT can extract both basic and high level attributes from dialogues and recommend suitable products to users. The novel features of PARROT are as follows: (1) We propose a new architecture of task-oriented dialogue systems which can extract both basic and high level products’ attributes (functional attributes and experience attributes). (2) We construct knowledge base to map from high level attributes to basic level attributes or products. (3) We build a task oriented dialogue system which can finish the task of shopping guidance in websites. We test PARROT in three main scenarios and these tests demonstrate that PARROT can successfully recommend suitable products to users by extracting both basic and high level attributes. gStore-C: A Transactional RDF Store With Light-Weight Optimistic Lock Zhe Zhang and Lei Zou Peking University Abstract. RDF systems are widely applied in many areas such as knowledge base, semantic web, social network. Traditional RDF systems focus on speed up SPARQL queries on large RDF data but disregarding the performance of updates and transaction processing. In this demonstration, we propose a new transactional RDF system based on multi-version and MVCC. We introduce a lightweight optimistic lock upon atomic variables and operations that provides fine-grained locking and avoids scalability issues. The methods are fully implemented in an open-source RDF system gStore. And it outperforms other state-of-art RDF systems solutions on transactional workloads.

Deep-gAnswer: A Knowledge Based Question Answering System Yinnian Lin, Minhao Zhang, Ruoyu Zhang, and Lei Zou Peking University Abstract. In this demonstration, we present Deep-gAnswer, a knowledge-based question answering system. gAnswer is based on semantic parsing and heuristic rules for entity recognition, relation recognition, and SPARQL generation. By making use of a pre-trained model, we implement new entity and relation recognition networks. Also, it is found that the traditional method works better when information of entity and relation is correctly given. Therefore, we combine entity and relation recognition networks with the previous SPARQL generation process to get Deep-gAnswer. Experimental results show that Deep-gAnswer outperforms the previous one, especially on Chinese dataset.

ALMSS: Automatic Learned Index Model Selection System Rui Zhu, Hongzhi Wang, Yafeng Tang, and Bo Xu Harbin Institute of Technology Abstract. Index is an indispensable part of database. As we enter the era of big data, the traditional index structure is found not to support large-scale data well. Although many index structures such as learned indexes based on machine learning have been proposed to solve such problems of traditional indexes, it is a great challenge to select the most suitable learned indexes for the specific application. To solve this problem, we design ALMSS, an automatic learned index model selection system, which provides a user-friendly interface and can help users automatically select the learned index model. In this paper, we introduce the overall architecture and main technologies of ALMSS, and show the demonstration of this system.

GPKRS: A GPU-Enhanced Product Knowledge Retrieval Systems Yuming Lin, Hao Song, Chuangxin Fang, and You Li Guilin University of Electronic Technology Abstract. In this demonstration, we present a GPU-enhanced product knowledge retrieve system called GPKRS, which stores product knowledge based on the sparse matrix compression, and introduces a query transformation module to transform the query operation into the corresponding matrix operations. By this way, we can take advantage of the powerful parallel computing power of GPU to accelerate the processing of SPARQL query. Further, GPKRS adopts an optimized pipeline query strategy to speed up the query execution. The experiments show that the GPKRS achieves state-of-the-art query performances on the LUMB dataset and a synthetic product knowledge dataset.

Standard-oriented Standard Knowledge Graph Construction and Applications System Haopeng Ren, Yi Cai, Mingying Zhang, Wenjian Hao, and Xin Wu South China University of Technology Abstract. Standard is an important normative files in different industries, which can effectively guide the production process and ensure the quality of the products. However, the establishment and applications of the standards are time-consuming and human-intensive. Motivated by this, the technique of knowledge graph can effectively model the text data with the multiple triples. Considering the special characteristics of standard documents, we propose an architecture framework for the construction of standard knowledge graph and design two applications in our system.

Workshops

The 4th International Workshop on Knowledge Graph Management and Applications (KGMA 2021)

Time: 9:00-12:00, August 25, 2021, Wednesday Co-Chairs: Qingpeng Zhang, City University of Hong Kong, China Xin Wang, Tianjin University, China

Invited Talk: Make It Easy to Query Knowledge Graphs: Issues and Technologies Abstract: As a large semantic repository, knowledge graphs have gained extensive attention from both academia and industrial communities. Generally, the huge knowledge graphs are automatically built based on the structured, semi-structured, or/and even unstructured data. To query these knowledge graphs accurately, it often demands the ability of using structured query language, e.g., SPARQL or Cypher. However, it is very difficult for common users to query and explore the knowledge graphs. Moreover, the schema-less problem further increases the difficulty of accessing these knowledge graphs. In this, talk, we will discuss the topic of knowledge graph usability, as well as current issues in the direction. Particularly, we will share some recent process on querying knowledge graphs using natural language questions.

Weiguo Zheng Professor, Fudan University Speaker Bio: Dr. Weiguo Zheng is an associate professor at School of Data Science, Fudan University, China. He has a broad interest in data management and understanding. His research focuses on graph data, such as knowledge graphs, natural language question answering, and social networks. More specifically, he is now conducting research into techniques that can make it easier to explore and query knowledge graphs. He is a regular invited reviewer for journals including IEEE TKDE, World Wide Web Journal, Information Sciences,Information Systems, and serves on the program committees including KDD, VLDB, IJCAI, AAAI, DASFAA and ICDM.

Accepted Papers: ·Should I Stay or Should I Go: Predicting Changes in Cluster Membership Evangelia Tsoukanara, Georgia Koloniari and Evaggelia Pitoura

·SMat-J: A Sparse Matrix-based Join for SPARQL Query Processing Ximin Sun, Ming Liu, Shuai Wang, Xiaoming Li, Bin Zheng, Dan Liu and Hongshen Yu

·A Distributed Engine for Multi-query Processing Based on Predicates with Spark Bin Zhang, Ximin Sun, Liwei Bi, Changhao Zhao, Xin Chen, Xin Li and Lei Sun

·Product clustering analysis based on the retail product knowledge graph Yang Ye and Qingpeng Zhang

The Third International Workshop on Semi-structured Big Data Management and Applications (SemiBDMA 2021)

Time: 9:00-12:00, August 25, 2021, Wednesday Chairs: Qun Chen, Northwestern Polytechnical University, China Jianxin Li, Deakin University, Australia

Accepted Papers: ·Survival effect of Internet macroscopic topology evolution He Tian1,2, Kaihong Guo1, Mingxi Cui1, Zheng Wu1 1 Liaoning University 2 Liaoning institute of Science and Technology

·The method for image noise detection based on the amount of knowledge associated with intuitionistic fuzzy sets Kaihong Guo1, Yongzhi Zhou1 1 Liaoning University

·Link Prediction of Heterogeneous Information Networks Based on Frequent Subgraph Evolution Dong Li1, Haochen Hou1, Tingwei Chen1, Xiaoxue Yu1, Xiaohuan Shan1, Junlu Wang1 1 Liaoning University

·Memory Attentive Cognitive Diagnosis for Student Performance Prediction Congjie Liu1, Xiaoguang Li1 1 Liaoning University

The 2nd International Workshop on Deep Learning in Large-scale Unstructured Data Analytics (DeepLUDA 2021)

Time: 14:00-17:00, August 25, 2021, Wednesday Organizers: Tae-Sun Chung, Ajou University, Korea Jianming Wang, Tiangong University, China Zhetao Li, Xiangtan University, China Chair: Rize Jin, Tiangong University, China

Invited Speech: Accurate Deep Medical Image Segmentation Speaker: Prof. Zhenghua Xu, Hebei University of Technology Abstract: Accurate medical image segmentation is one of the important tasks of computer-aided diagnosis and treatment, which is of great significance in clinical practice. Generally, there exists three kinds of problems for the existing deep learning based medical image segmentation techniques: small number of samples, incomplete annotation, and small object. Consequently, this talk will first presents our recent research of using advanced CycleGAN model for generative data augmentation, and then introduce a semi-supervised multi-modality contrastive mutual learning method to remedy the problem of incomplete annotation. Finally, a novel deep medical image segmentation framework that integrates hybrid learning signal, efficient multi-dimensional attention mechanism, and complicated multi-scale convolution for accurate small object segmentation will be discussed.

Accepted Papers: ·A Study on the Privacy Threat Analysis of PHI-code Dongyue Cui, Yanji Piao

·Logistics Policy Evaluation Model Based on Text Mining Guangwei Miao, Shuaiqi Wang, Chengyou Cui

·MRRVOS: Modular Refinement Referring Video Object Segmentation Zhijiang Duan, Yukuan Sun and Jianming Wang