Visual Analytics for Anomaly Detection and Examination of Air Quality Data
Total Page:16
File Type:pdf, Size:1020Kb
AQEyes: Visual Analytics for Anomaly Detection and Examination of Air Quality Data Dongyu Liu Kalyan Veeramachaneni Alexander Geiger Victor O.K. Li Huamin Qu CSE LIDS LIDS EEE CSE HKUST MIT MIT HKU HKUST Hong Kong, China Cambridge, MA, USA Cambridge, MA, USA Hong Kong, China Hong Kong, China [email protected] [email protected] [email protected] [email protected] [email protected] Abstract—Anomaly detection plays a key role in air quality of unexpected air pollutant concentrations at a certain time analysis by enhancing situational awareness and alerting users period and place compared with previous observations from to potential hazards. However, existing anomaly detection ap- that location [1]. Based on experience, air quality events can proaches for air quality analysis have their own limitations regarding parameter selection (e.g., need for extensive domain result from unusual weather conditions or some local sources knowledge), computational expense, general applicability (e.g., such as trash burning. require labeled data), interpretability, and the efficiency of anal- Currently, anomaly detection for air quality data primarily ysis. Furthermore, the poor quality of collected air quality data utilizes statistical and threshold-based [2], density-based [3], (inconsistently formatted and sometimes missing) also increases learning-based (need labeled data) [4], [5], and visualization- the difficulty of analysis substantially. In this paper, we system- atically formulate design requirements for a system that can based approaches [6], [7]. Each of these approaches has solve these limitations and then propose AQEyes, an integrated drawbacks. Statistical and threshold-based methods require visual analytics system for efficiently monitoring, detecting, and extensive human knowledge to specify model parameters and examining anomalies in air quality data. In particular, we propose become problematic when parametric assumptions are vio- a unified end-to-end tunable machine learning pipeline that lated. Density-based methods are computationally intensive, includes several data pre-processors and featurizers to deal with data quality issues. The pipeline integrates an efficient and it is challenging to define the distance between multivari- unsupervised anomaly detection method that works without the ate measurement data. Both statistics-based and density-based use of labeled data and overcomes the limitations of existing methods are unable to capture anomalies that are characterized approaches. Further, we develop an interactive visualization by temporal trends. Moreover, learning-based methods require system to visualize the outputs from the pipeline. The system high-quality labeled data that is often unavailable or too incorporates a set of novel visualization and interaction designs, allowing analysts to visually examine air quality dynamics and time-consuming to collect. Visualization-based methods allow anomalous events in multiple scales and from multiple facets. We to flexibly and adaptively identify and interpret anomalies, demonstrate the performance of this pipeline through a quanti- but they also require analysts to manually observe multiple tative evaluation and show the effectiveness of the visualization air quality variables, which becomes more impractical as system using qualitative case studies on real-world datasets. the amount of data increases. Hence, to ensure efficiency, Index Terms—anomaly detection, air quality, multiple time- series, visualization accuracy, and interpretability, a system that integrates a more accurate, scalable, and intelligent anomaly detection method I. INTRODUCTION with application-tailored visualization techniques is needed. The following four key technical challenges for designing The rapid processes of industrialization and urbanization that system can be identified. First, air quality analysis usually have greatly improved the economy while also intensifying air arXiv:2103.12910v1 [cs.HC] 24 Mar 2021 involves pollutant and weather data collected in multiple air pollution issues, a condition which causes tremendous physical quality monitoring stations and in different time granularities, and psychological harms to humans. A report from WHO has leading to inconsistently-formatted data. The data is also shown that ambient air pollution caused around 4.2 million often missing due to unavoidable factors (e.g., malfunction premature deaths worldwide in 20161. Thus, monitoring the of sensors). Hurdles like these increase the difficulty of ob- dynamics of air pollutants is an important and pressing task. taining high-quality inputs for the use of anomaly detection. Failure to detect and respond to unusual changes (e.g., an Second, the lack of labeled anomalies necessitates the use of outbreak) of air pollutants will both create enormous risks to unsupervised or semi-supervised approaches. Third, a single human health and cause great loss to our economy. Anomaly model is insufficient to handle all situations, because stations detection, therefore, becomes an essential part of air quality in different regions have different environments and standards analysis. Detecting anomalies is useful in quickly identifying concerning anomalies. Lastly, the large scale of the data intro- an air pollution event that is defined as a valid observation duces many obstacles in designing a visualization system to 1https://www.who.int/en/news-room/fact-sheets/detail/ambient-(outdoor) support efficient anomaly pattern exploration and examination. -air-quality-and-health In this paper, we systematically formulate the system design requirements and then propose AQEyes, an integrated visual introduce a modular end-to-end machine learning pipeline that analytics system for efficiently monitoring, detecting, and ex- incorporates various preprocessing steps, an LSTM model and amining anomalies in air quality data. The major contributions a dynamic error processing, to detect anomalous sequences in of our work are summarized as follows: the time series in an unsupervised manner. The advantage of • We propose a unified end-to-end tunable machine learn- the pipeline is the simplicity by which the single module of the ing pipeline, which solves the problems of missing and pipeline can be changed, allowing analysts to easily develop differently-granularized data and integrates an efficient multiple pipelines and evaluate their performances. unsupervised anomaly detection method adapted and ex- 2) Visual analytics is an important tool for air quality tended from other domains. analysis: Visualization exploits humans excellent ability to • We propose several novel visualization and interaction perceive visual patterns, thereby seamlessly connecting hu- designs to cooperate seamlessly with the machine learn- mans to the data analysis process. Visualizations incorporating ing pipeline. The designs enable analysts to efficiently appropriate data reduction techniques can provide analysts explore and examine air quality dynamics and anomalous with a straightforward and natural way to monitor multiple air events from different perspectives and levels of detail. quality variables and their evolutions [6]. Qu et al. [22] present • We evaluate the machine learning pipeline and visual- a comprehensive system to analyze the air pollution problem in ization designs through both quantitative and qualitative Hong Kong, where a series of novel visualizations such as cir- case studies on two real-world air quality data sets. cular bar charts and weighted complete graphs, are integrated to investigate the correlation between multiple attributes. Chen II. RELATED WORK et al. [23] introduce a novel tree structure to organize the correlations among air quality variables, enabling analysts A. Air Quality Analysis to monitor the evolving correlations among these variables. For many years, researchers from various domains have Quinan and Meyer [24] propose a set of encoding choices and spent a great deal of effort on the development of data analysis interaction methods to interpret how multiple weather features techniques for understanding air quality. Several excellent relate to forecasting outcomes for more precise results. Du et surveys summarize the techniques well [3], [8]–[10]. In the al. [25] develop an interactive visualization system to support following we will focus on the most relevant work. efficient exploration of air quality data at multiple scales. 1) Anomaly detection is a key task in air quality anal- To the best of our knowledge, our system is the first compre- ysis: The general approach of anomaly detection is to find hensive visual analytics system that is primarily designed for unexpected patterns in data. The simplest anomaly detection anomaly detection, exploration and assessment of air quality approaches are out-of-limits methods which flag locations data. The system is built on a machine learning pipeline, which surpassing predefined thresholds on raw values. A number can not only produce comprehensible results more efficiently of other more complex anomaly detection techniques have but also provide various ways for analytsts to interact with the been proposed as improvements on out-of-limits approaches. results in rich spatiotemporal context. These can be divided into four categories, including statistics- based [2], [11], [12], density-based [13]–[16], learning-based B. Multivariate Spatial Time-series Visualization (usually requiring labeled data) [4], [5], and visualization- Air quality data can be regarded as multivariate time-series based