Quda: Natural Languageueries Q for Visual Data Analytics

SIWEI FU, Zhejiang Lab, China KAI XIONG, Zhejiang University, China XIAODONG GE, Zhejiang University, China SILIANG TANG, Zhejiang University, China WEI CHEN, Zhejiang University, China YINGCAI WU, Zhejiang University, China

1. Expert Query Collection 2. Paraphrase Generation 3. Paraphrase Validation

36 data tables from 11 domains Crowd sourcing Crowd Machine

Inteview with 20 experts 20 paraphrases per expert query Validation results

920 expert queries Paraphrasing a1. Which video is the most liked? Sorting a1. Which video is the most liked? a. What is the most liked video? a2. What video has the most likes? a2. What video has the most likes? b.Rank the country by population a3. Please name the most liked video. a3. Please name the most liked video. in 2000 a4. This is the most like video a4. Which video has the highest . Do app sizes fall into a few clusters? amount of likes? a5. Which video has the highest d. Is the age correlated with amount of likes? a5. This is the most like video …the market value?

Fig. 1. Overview of the entire data acquisition procedure consisting of three stages. In the first stage, we collect 920 queries by interviewing 20 data analysts. Second, we expand the corpus by collecting paraphrases using crowd intelligence. Third, we borrow both the crowd force and a to score and reject paraphrases with low quality.

The identification of analytic tasks from free text is critical for visualization-oriented natural language interfaces (V-NLIs) tosuggest effective visualizations. However, it is challenging due to the ambiguity and complexity nature of human language. To addressthis challenge, we present a new dataset, called Quda, that aims to help V-NLIs recognize analytic tasks from free-form natural language by training and evaluating cutting-edge multi-label classification models. Our dataset contains 14, 035 diverse user queries, and each is annotated with one or multiple analytic tasks. We achieve this goal by first gathering seed queries with data analysts and then employing extensive crowd force for paraphrase generation and validation. We demonstrate the usefulness of Quda through three applications. This work is the first attempt to construct a large-scale corpus for recognizing analytic tasks. With the release ofQuda, we hope it will boost the research and development of V-NLIs in data analysis and visualization.

CCS Concepts: • Human-centered computing → Visualization. arXiv:2005.03257v5 [cs.CL] 3 Dec 2020

Additional Key Words and Phrases: natural language, dataset, analytical tasks

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2018 Association for Computing Machinery. Manuscript submitted to ACM

1 Woodstock ’18, June 03–05, 2018, Woodstock, NY Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu

ACM Reference Format: Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu. 2018. Quda: Natural Language Queries for Visual Data Analytics. In Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY. ACM, New York, NY, USA, 17 pages. ://doi.org/10.1145/1122445.1122456

1 INTRODUCTION Research in visualization-oriented natural language interface (V-NLI) has attracted increasing attention in the visualization community. Data analytics or dataflow systems such as Articulate [42] and FlowSense [47] leverage V-NLIs to provide easy-to-use, convenient, and engaging user experience, and facilitate the flow of data analysis. Existing V-NLIs, however, face two key challenges [47]. On the one hand, making effective design decisions, e.g., choosing a proper visualization, is challenging because of the large design space. V-NLIs can address this issue by employing findings of empirical studies that provide design recommendations given analytic tasks and36 datasets[ ]. However, addressing this issue requires V-NLIs to distill analytic tasks from queries before making a design choice. On the other hand, due to the complex nature of human language, understanding free-form natural language input (or “query” in the following article) is non-trivial. Recent research in V-NLI has focused mainly on rule-based language parsers to understand users’ intent [39, 41, 47]. Although effective, these approaches are narrow in usage scenarios because of limited grammar rules. Moreover, establishing rules is cumbersome and error-prone, and may require extensive knowledge in both visualization and natural language processing (NLP). Learning-based approaches have reached state-of-the-art performance in various NLP tasks and have great potential in understanding free-form queries in V-NLIs. Large-scale training examples catered to visual data analytics are critical to developing and benchmarking learning-based models to be deployed for V-NLIs. For example, multi-label text classification techniques trained with large amount of query-task pairs are able to categorize user queries into analytic tasks, which contributes to the choice of proper visualization. Admittedly, the acquisition of such a corpus is difficult for two reasons. First, the target users of V-NLIs are often identified as data analysts who are knowledgeable in analytic tasks [41, 47]. Therefore, a high-quality corpus should reflect how analysts interact with data under various tasks. Second, compared to general-purpose datasets, such as existing corpora collected via social media [24] or online news corpora [48], expert queries for V-NLIs are not naturally occurring. In this paper, we present a textual corpus, Quda1, that is the first trail to bridge learning-based NLP techniques with V-NLIs from the perspective of dataset. Quda consists of a large number of query-task pairs, aiming to train and evaluate deep neural networks to categorize queries into analytic tasks. Our corpus features high-quality and large-volume queries designed in the context of V-NLIs. To accomplish this goal, we first design a lab study to collect seed queries by interviewing data analysts. Next, to augment data volume and diversify the ways of saying something, we employ the crowd to collect a number of sentential paraphrases for each query. These queries are restatements with approximately the same meaning. Finally, we design a validation procedure for quality control. The Quda dataset contains 920 utterances created by data analysts, and each are accompanied by 14 paraphrases on average. All queries are annotated with at least one of ten analytics tasks. To present the usefulness of Quda, we run three experiments. In the first experiment, we train and evaluate a multi-label classification model on Quda with different data splits. We reveal that Quda is a challenging datasetfor multi-label classification, and leaves room for developing new classification models for V-NLIs. Second, weusethe model trained on Quda to augment an existing V-NLI toolkit, showing that the model is more robust in task inference.

1We make our corpus and related materials publicly available on https://freenli.github.io/quda/ 2 Quda: Natural Languageueries Q for Visual Data Analytics Woodstock ’18, June 03–05, 2018, Woodstock, NY

Third, with the help of Quda, we build a query recommendation technique that enables V-NLIs to suggest valuable queries given partial or vague user input. To summarize, the primary contributions of this paper are as follows:

• By employing both the expert and crowd intelligence, we construct a large-scale and high-quality corpus, named Quda, that contains 14, 035 queries in the domain of visual data analytics. • Building upon existing research on V-NLIs and task taxonomies, we define the space of queries for V-NLIs and highlight the scope of our corpus from six characteristics. • We conduct three experiments to show that Quda can benefit a wide range of applications, including 1) training and benchmarking deep neural networks, 2) improving the robustness of task inference for V-NLIs, and 3) providing query recommendation for V-NLIs.

With the release of Quda, we hope it will stimulate the publication of similar large-scale datasets, and boost research and development of V-NLIs in data analysis and visualization.

2 RELATED WORK 2.1 NLI for Data Visualization Our research is motivated by recent progress in V-NLIs, which provides engaging and effective user experience by combining direct manipulation and natural language as input. Some research emphasizes on the multi-modal interface that interacts with users using natural language. Back in 2010, Articulate [42] identifies nine task features and leverages keyword-based classification methods to understand how user’s imprecise query weighted in each feature. Sisl [14] is a multi-modal interface that accepts various input modes, including point-and-click interface, NL input in textual form, and NL using speech. Analyza [17] is a system that combines V-NLI with a structured interface to enable effective data exploration. To address the ambiguity and complexity of natural language, Datatone [19] presents a mixed-initiative approach to manage ambiguity in the user query. Flowsense [47] is a V-NLI designed for the dataflow visualization system. It applies semantic parsing to support natural language queries and allows users to manipulate multiple views in a dataflow visualization system. Several commercial tools, such as Power BI [2], IBM Analytics [1], Wolfram Alpha [5], Tableau [3], and ThoughtSpot [4], integrate V-NLIs to provide better analytic experience for novice users. Another line of research targets the conversational nature of V-NLI. Fast et al. proposed Iris [18], a conversational user interface that helps users with data science tasks. Kumar et al. [23] aimed to develop a data analytics system to automatically generate visualizations using a full-fledged conversational interface. Eviza [39] and Evizeon [21] are visualization systems that enable natural language interactions to create and manipulate visualizations in a cycle of visual analytics. Similarly, Orko [41] is a prototype visualization system that combines both natural language interface and direct manipulation to assist visual exploration and analysis of graph data. The aforementioned approaches are mainly implemented using rule-based language parsers, which provide limited support for free-form NL input. Instead of generating visualizations, Text-to-Viz [15] aims to generate infographics from natural language statements with proportion-related statistics. The authors sampled 800 valid proportion-related statements and built a machine learning model to parse utterances. Although promising, Text-to-Viz does not support queries for a broad range of analytic activities. The performance and usage scenario of V-NLIs highly depend on language parsers. The cutting-edge NLP models have reached close-to-human accuracies in various tasks, such as semantic parsing, text classification, paraphrase 3 Woodstock ’18, June 03–05, 2018, Woodstock, NY Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu generation, etc. However, few have been applied to visualization-oriented V-NLIs. We argue that the release of the high-quality textual dataset would assist the training and evaluation of NLP models in the domain of data analytics.

2.2 Datasets for visualization research An increasing number of studies in the visualization community have employed supervised learning approaches in various scenarios such as mark type classification38 [ ], reverse engineering [33], color extraction, etc. The capability of these approaches, however, highly relies on the availability of massive datasets that are used for training and evaluation [27]. Some datasets consisting of visual diagrams are constructed and open-sourced for visualization research. For example, Savva et al. [38] compiled a dataset of over 2,500 chart images labeled by chart type. Similarly, the Massvis dataset [11] contains over 5,000 static chart images, and over 2,000 are labeled with chart type information. Beagle [10] embeds a extracting more than 41,000 SVG-based visualizations from the web, and all are labeled by visualization type. Viziometrics [28] collects about 4.8 million images from scientific literature and classifies them into five categories, including equation, diagram, photo, plot, and table. Poco and Heer [33] compiled a chart corpus where each chart image is coupled with bounding boxes and transcribed text from the image. Instead of compiling a corpus containing chart images, Quda includes queries accompanied by analytic tasks for visualization-oriented V-NLIs. Viznet, on the other hand, collects over 31 million real-world datasets that provides a common baseline for comparing visual designs. Though containing rich textual information, Viznet can hardly be applied to provide training samples for learning-based NLP models. On the contrary, Quda is designed for helping V-NLIs understand utterances describing analytic tasks in information visualization.

2.3 Corpora for Multi-label Text Classification

# Instances # Categories Cardinality Density Diversity 20 NG 19,300 20 1.029 0.051 0.003 Ohsumed 13,930 23 1.663 0.072 0.082 IMDB 120,900 28 2.000 0.071 0.037 TMC2007 28,600 22 2.158 0.098 0.047 Quda 14,035 10 1.331 0.133 0.054 Table 1. Compare Quda with four well-known multi-label text classification corpora on five criteria, including the number of instances, the number of categories, cardinality, density, and diversity.

Our corpus is primarily designed for helping V-NLIs classify queries into multiple analytic tasks. Therefore we survey related corpora for multi-label text classification, which is the task of categorizing a piece of textual content into 푛 possible categories, where 푛 can be 0 to 푛 inclusive. The [6] lists a large number of datasets for multi-label text classification. We selectively introduce four datasets that are comparable to Quda. For example, the20NGor Newsgroups dataset [25] includes about 20, 000 documents collected and categorized from 20 different newsgroups. Ohusumed [22] contains 13, 930 medical abstracts belonging to 23 cardiovascular disease categories. The IMDB dataset contains 120, 919 text summaries of the movie plot, and each is labeled with one or more genres. The TMC2007 dataset contains 28, 569 aviation safety reports that are associated with 22 problem types. Different from the aforementioned datasets, Quda is the first corpus targeting queries in V-NLIs.

4 Quda: Natural Languageueries Q for Visual Data Analytics Woodstock ’18, June 03–05, 2018, Woodstock, NY

Table 1 shows the statistical comparison between Quda and the aforementioned datasets. Inspired by the prior work [29], we use five metrics to characterize each dataset. Besides the number of instances and categories, wealso compute cardinality (i.e., the average number of categories for each instance), density (i.e., the cardinality divided by the number of categories), and diversity (i.e., the percentage of category sets present in the dataset divided by the number of possible category sets.) Detailed definitions and calculations are described29 in[ ]. We notice that Quda is comparable with other corpora in most metrics. Especially, Quda has the highest density among all corpora, which means instances in Quda have higher probability of being assigned with multiple categories.

3 PROBLEM DEFINITION Analytic tasks play a pivotal role in visualization design and evaluation. The understanding of tasks is beneficial for a wide range of applications in the visualization community such as visualization recommendation, computational linguistics for data analytics, etc. The goal of this paper is to propose a corpus in the domain of visual data analytics that facilitates the deployment of learning-based NLP techniques for V-NLIs. Specifically, our corpus focuses on queries that reflect how data analysts ask questions in V-NLIs. Due to the variation and ambiguity of natural language,the space of queries is intangibly large. To narrow down the scope of our corpus, we borrow the idea from prior work in visualization framework [7, 30, 35, 40] and V-NLI for visualization [41, 43] and identify Quda from six characteristics, i.e., abstraction level, composition level, perspective, type of data, type of task, and context-dependency. Abstraction: Concrete. The abstraction level is one characteristic of analytic tasks describing the concreteness of a query [35]. Abstract queries are generic and are useful for generalizing the concept behind a concrete task (an example is, “Find maximum”). These queries can be addressed in multiple ways based on the interpretation. However, to obtain a reasonable response from V-NLIs, an analyst should author a query that provides sufficient information41 [ ], such as “Find top movies with most stars” and “Retrieve the country with the most population”. Therefore, our corpus focuses on the queries at the low abstraction level which expresses both tasks and values explicitly. Composition: Low. The composition level describes the extent to which a query encompasses sub-queries [35]. Composition is a continuous scale from low-level to high-level. A query with a high composition level consists of multiple sub-queries and a V-NLI may answer it using multiple steps. For example, “For the movie with the most stars, find the distribution of salaries of the filming team.” A V-NLI first needs to identify the movie with the most stars. Then display the salary distribution of all staff in the filming team. As our corpus is the first attempt to collect queriesfor V-NLI, we focus on queries that are low in the composition level at this stage. We plan to include with high composition level in future research. Perspective: Objectives. Queries in V-NLIs can be classified into two categories, i.e., objectives and actions35 [ ]. Objectives are queries raised by analysts seeking to satisfy a curiosity or solve a problem. The actions in V-NLIs are executable steps towards achieving objects, and usually relate to interactive features of visualization artifacts, such as “Show the distribution using a bar chart,” or “Map the popularity of applications to x-axis.” However, to assume V-NLI users being knowledgeable in constructing effective visualization is virtually arbitrary. Instead of enumerating actions, we aim to collect queries that are objectives raised by data analysts. Type of Data: Table. The type of dataset affects the syntax and semantics of queries. For example, analysts may seek to identify the shortest path between two nodes for a network dataset. However, such query rarely occurs in the tabular dataset because links between items are not supported explicitly. Data and dataset can be categorized into five types [30], including table, networks & trees, fields, geometry, and clusters & sets & lists. At the date ofpaper submission, our corpus target at queries based on the tabular dataset, we argue that our corpus can support other types 5 Woodstock ’18, June 03–05, 2018, Woodstock, NY Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu of data to some extent. For example, the task of “Find Nodes” in networks is similar to “Retrieve Value” in tabular data. We plan to extend our research to support other data types comprehensively in the future. Type of Tasks: 10 Low-level Tasks. Collecting a corpus for all possible tasks is not feasible. Therefore, we turned to the related studies to identify analytic tasks on which to focus. In this study, we adopt the taxonomy proposed by Amar et al. [7] that categorized ten low-level analytic activities, such as Retrieve Value, Sort, etc. These tasks serve a good starting point for our research despite not being comprehensive. Context Dependency: Independent. The conversational nature of V-NLI may result in queries that rely on contextual information [41, 43]. For example, a query “Find the best student in the class” may be followed by “Obtain the English score of the student.” The second query appears to be incomplete individually and is a follow-up query of the first query. Such a query is referred to as contextual queries, which is not the focus of our research. Ourcorpusfocuses on queries that have complete references to tasks or values associated with a task.

4 CONSTRUCTING QUDA In this work, we incorporate both expert and crowd intelligence in data acquisition, which can be divided into three stages. In the first stage, we conduct an interview study involving 20 data analysts to author queries based on 36 data tables and 10 low-level tasks. We derive 920 expert queries from the study results. Some example queries are shown in Figure 1 (Expert Query Collection). For the second stage, we perform a large-scale crowd-sourced experiment to collect sentential paraphrases for each expert query, which are restatements of the original query with approximately the same meaning [45]. Example crowd-crafted queries are shown in Figure 1 (Paraphrase Generation). That is, for an expert sentence “what is the most liked video?”, five paraphrases (a1—a5) are crafted by the crowd. Finally, we designa validation stage to estimate the semantic equivalence score for crowd-generated queries, and accepted those with high scores. The third stage results in 13, 115 paraphrases where each expert sentence is accompanied with 14 paraphrases on average. An example is depicted in Figure 1 (Paraphrase Validation), the five queries are re-ordered and the last one is filtered out due to low semantic similarity.

5 EMPLOYING EXPERT INTELLIGENCE In this section, we describe expert interviews that aim to collect professional queries from data analysts, given analytic tasks and data tables in different domains.

5.1 Participants and Apparatus To understand how data analysts raise questions from data tables, we recruit 20 individuals, 10 females, and 10 males, with ages ranging from 24—31 years (휇 = 25.45, 휎 = 2.80). In our study, we identify “data analysts” as people who are experienced in data mining or visual analytics and have at least one publication in related fields. Most participants (16/20) are postgraduate students majoring in Computer Science or Statistics, and the rest are working as data specialists in IT companies. All of them are fluent in English and use it as working language. The experiments are conducted ona laptop (2.8GHz 4-Core Intel Core i7, 16 GB memory) on which participants read documents and create queries.

5.2 Tasks and Data Tables Tasks play a vital role in authoring queries. Our study begins with the taxonomy of 10 low-level visualization tasks [7], including Retrieve Value, Compute Derived Value, Find Anomalies, Correlate, etc. The pilot study shows that participants may be confused about some tasks. For example, one commented, “Determine Range is to find the maximum and minimum 6 Quda: Natural Languageueries Q for Visual Data Analytics Woodstock ’18, June 03–05, 2018, Woodstock, NY values in one data field, which is similar to Find Extremum.” To help participants clarify the scope of each task, we compiled a document presenting each task from three aspects, i.e., general description, definition, and example sentences. The document is shown in the supplementary material. Data tables provide a rich context for participants to create queries. Hence, to diversify the semantics and syntax of queries, we have prepared 36 real-world data tables covering 11 different domains, including health care, sports, entertainment, etc. Tables with insufficient data fields may not support some types of queries. For example, assumea table about basketball players has two columns, i.e., players’ name and their nationality. Participants may find it hard to author queries in the “Find Extremum” or “Correlate” categories which usually require numeric fields. Hence, instead of skipping the tasks, we allow participants to revise tables by adding new columns or editing existing ones if necessary. Moreover, we selectively choose data tables that have rich background information and explanation for each column so that participants can get familiar with them in a short time. All data were collected from Kaggle 2, Data World 3, and Google Datasetsearch 4.

5.3 Methodology and Procedure The interview began with a brief introduction to the purpose of the study and the rights of each participant. We then collected the demographic information of each participant, such as age, sex, experience in data mining and visual analytics. After that, participants were asked to familiarize themselves with 10 analytic tasks by reading the document. In the training stage, participants were asked to author queries for 10 tasks based on a sample data table, which differs from those used in the main experiment. We instructed them to think aloud and resolved any difficulties they encountered. We encourage participants to author queries with diverse syntax structure and semantics. The pilot results indicate that the participants were more enthusiastic in generating sentences with diverse syntax if they are interested in the context of the table. At the beginning of the main experiment, we presented 10 tables with a detailed explanation of the context and data fields to participants and encouraged them to choose two based ontheir interest. We randomized the presentation order of the 10 tasks. Participants were guided to author at least two queries given a table and a task, and no time limit was given to complete each query. To summarize, the target queries collected at this stage is 20(푝푎푟푡푖푐푖푝푎푛푡푠) × 2(푡푎푏푙푒푠) × 10(푡푎푠푘푠) × 2(푞푢푒푟푖푒푠) = 800(푞푢푒푟푖푒푠) However, because some of the participants enjoyed the authoring experience and generated more sentences for some tasks, the total number of queries exceeded 800. After the main experiment, participants were asked to re-assign task labels for each query. Because some queries belong to multiple tasks in nature. For example, The sentence “For applications developed in the US, which is the most popular one?” falls in both “Filter” and “Find Extremum” categories. The identification of all task labels is important to reflect the characteristics of each query. Finally, a post-interview discussion was conducted to collect their feedbacks. Interviews were audio recorded and we took notes of the participants’ comments for further analysis. Each interview lasted approximately two hours, and we sent a $15 gift card to interviewees for their participation.

2https://www.kaggle.com/datasets 3https://data.world/ 4https://datasetsearch.research.google.com/

7 Woodstock ’18, June 03–05, 2018, Woodstock, NY Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu

5.4 Results The result of our interview study is a corpus containing 920 queries generated by data analysts. The length of queries ranged from 4 to 35, with an average of 11.9 words (휎 = 3.9). We observed that participants found it enjoyable to write queries for these tasks. For example, one participant commented in the post-interview discussion, “Find Anomalies is an interesting task because it requires an in-depth understanding of the data table.” She further added, “I asked ‘Based on the relationship among rating, installs, and the number of reviews, which app’s rating does not follow the trend?’ because this kind of abnormal apps is worth investigating.”

6 BORROWING CROWD INTELLIGENCE Given a corpus with 920 queries, the language features are limited. To construct a high-quality corpus that reflect the ambiguity and complexity of human language, we extend our corpus by collecting paraphrases for expert sentences using crowd intelligence. This stage can be divided into two steps, i.e., paraphrase generation and validation. All experiments were conducted on MTurk.

6.1 Paraphrase Acquisition at Scale In this step, we aim to acquire at least 20 paraphrases for each expert sentence. We followed the crowdsourcing pipeline [12] to collect paraphrases using crowd intelligence. We developed a web-based system that the crowd can access through MTurk. Our system records both crafted queries and user behavior, e.g., keystrokes and duration, for further analysis. We describe our task as [34], “Rewrite the original text found below so that the rewritten version has the same meaning, but a completely different wording and phrasing” to help crowd workers understand our task. The interface also encourages the crowd to craft paraphrases that differ from the original one in terms of sentence structure. We demonstrate bothvalid and invalid examples to explain our requirements. The interface displays an experts’ sentence and instructs workers to rewrite it. After a sentence is submitted, workers are allowed to author paraphrases for other expert sentences, skip sentences, or quit the interface by clicking “Finish”. Finally, we use an open-ended question to collect workers’ comments and suggestions for this job. We sent 0.1$ as a bonus for each sentence after validation. Based on prior work in corpus collection [12, 26] and the pilot experiments, we established a set of rules in the main experiment to ensure the quality of the collected sentences. First, to collect paraphrases that are diverse in structure, we limited the maximum number of sentences one could create (20 in our study) to involve more crowd intelligence. Second, to obtain sentences with high-quality, workers were informed that the results would be reviewed and validated before they receive a bonus. Third, the interface recorded and analyzed their keystrokes and duration in crafting sentences to reject invalid inputs. For example, we rejected sentences crafted within 5 seconds or with less than 10 keystrokes. Fourth, to avoid duplicate sentences with other workers, we compared the input with all sentences in the database.

6.2 Validation via Pairwise Comparison The result of the first step is a corpus containing 18, 400 crowd-generated paraphrases accompanied by 920 expert- generated sentences. The goal of this step is to filter out paraphrases that are semantically inequivalent compared to the expert sentence. Due to the ambiguity in human language, asking workers directly to identify whether paraphrases are semantically equivalent to the expert sentence would be restrictive and unreliable. For example, given an expert’s sentence, “How 8 Quda: Natural Languageueries Q for Visual Data Analytics Woodstock ’18, June 03–05, 2018, Woodstock, NY many faculties are there at Harvard University?”, and a worker’s, “How many departments are there at Harvard?”, it is not clear whether the two sentences have the same meaning because “faculty” could mean both “department” and “professor”. Instead of asking whether a paraphrase is semantically equivalent, we asked comparison questions to capture the equivalence strength of one paraphrase with respect to others. To be specific, given an expert sentence, a worker is shown a pair of paraphrases written by the crowd and asked to choose which one has closer meaning to the expert’s sentence. Pairwise comparison is effective in capturing semantic relationships32 [ ]. However, exhaustive comparisons, i.e., each paraphrase compares to all other paraphrases using a large number of workers, is unnecessary and cost-prohibitive. Hence, borrowing from the idea of relative attributes [32], we design a strategy to collect pairwise comparison dataset for all paraphrases, and train scoring functions to estimate the semantic equivalence scores of them.

6.2.1 Collecting Pairwise Dataset. Each expert sentence is associated with 20 paraphrases, and thus, we collect pairwise 푚·푛 data among them. The total number of comparisons for each expert sentence is 2 , where 푚 is the number of paraphrases, and 푛 is the number of pairwise comparisons per paraphrase. In our experiment, 푚 = 20, and 푛 = 5. To 20·5 alleviate randomness, each comparison is completed by 5 unique workers, providing 2 · 5 = 250 comparison tasks for each expert sentence. Given 920 expert sentences, this produces a 250 × 920 = 230, 000 tasks for 6, 304 unique workers. We set up a web-based system to assist crowd workers in conducting comparison jobs. The system displays job descriptions, examples and requirements. Each job consists of 10 comparison tasks based on distinct expert sentences, along with two gold standard instances for quality control. We manually crafted a set of gold standard tasks that are unambiguous and have a clear answer. We rejected any jobs where one of the gold tasks was failed. To measure subtle differences in the semantics, we used a two-alternative forced-choice design for each comparison in whichworkers cannot answer “no difference”. The workers were paid $0.15 per job.

6.2.2 Estimating Semantic Equivalence Score. In this step, we aim to estimate the semantic equivalence scores for each paraphrase and filter out those that are low in the score. After collecting a set of pairwise comparison data,we employed relative attribute [32] to learn ranking functions which estimate the strength of the semantic equivalence of a paraphrase with respect to other paraphrases. 푛 Given an expert sentence 푒, we had 20 training paraphrases {푝푖 } represented by feature vectors {푥푖 } in R . We have collected a set of ordered pairs of paraphrases 푂 = {(푝푖, 푝 푗 )} such that (푝푖, 푝 푗 ) ∈ 푂 =⇒ 푝푖 ≻ 푝 푗 , i.e., paraphrase 푖 is more close to the expert sentence 푒 in semantics than 푗. Our goal was to learn a scoring function:

⊺ 푠푒 (푥푖 ) = 푤 푥푖 (1) such that the maximum number of the following constraints is satisfied:

⊺ ⊺ ∀(푝푖, 푝 푗 ) ∈ 푂 : 푤 푥푖 > 푤 푥 푗 (2)

To capture semantic features of each paraphrase, we employed Universal Sentence Encoder [13] and obtained a ⊺ 512-dimension feature vector 푥푖 for each paraphrase. We used a SVM-based solver [32] to approximate 푤 , and obtained the estimated score for 20 paraphrases using Equation 1. We set a threshold 푇 , so that paraphrases with scores lower than 푇 are filtered out. The scores range from −0.8 to 1.05, and we set 푇 = 0 to reject all paraphrases with negative scores. Finally, we obtained 13, 115 high-quality paraphrases for 920 expert sentences.

9 Woodstock ’18, June 03–05, 2018, Woodstock, NY Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu

7 QUDA APPLICATIONS We present three cases showing that Quda can benefit the construction of V-NLIs in various scenarios. The first case demonstrates the characteristics of Quda by training and benchmarking a well-performing multi-label classification model. We then present an example to showcase how Quda can improve an existing NLI toolkit in task inference. Finally, we explore how Quda can assist query recommendation in a V-NLI interface.

7.1 Multi-label Text Classification Assigning all task labels to a query is critical to V-NLIs. Therefore, we conduct an experiment to evaluate how a well-performed multi-label classification model performs on our corpus, and reveal the characteristics of ourcorpus with different experiment settings.

7.1.1 Model. Recent research in pre-trained models [16, 46] has achieved state-of-the-art performance in many of the NLP tasks. These models are trained on large corpora, e.g., all articles, and allow us to fine-tune them on downstream NLP tasks, such as multi-label text classification. Here, we conduct our experiment with a well-known pre-trained model, Bert, which obtained new state-of-the-art results on eleven NLP tasks when it was proposed [16]. We trained the model for 3 epochs, and use a batch size of 2 and sequence length as 41, i.e., the maximum possible in our corpus. The learning rate was set as 3푒 − 5, the same as in the original paper.

7.1.2 Dataset and Evaluation Metrics. Different data splits (train and test) may affect the performance of models.We split Quda in two ways. First, we randomly split it into the train (11228 samples) and test (2807 samples), and name it as Quda_random. Second, we split the dataset by experts (Quda_expert). That is, 20 experts are split into the train (16 experts) and test sets (4 experts), and all queries created by the same experts and their paraphrases are in the same set. We obtain 11065 samples for training and 2970 samples for test. Following the work [44], we employ precision and recall as evaluation metrics.

7.1.3 Results. The precision and recall of the model on Quda_random are 97.7% and 96.0%, respectively. On the other hand, the model performance trained on Quda_expert only achieves 84.2% (precision) and 74.3% (recall). Since the train and test samples of Quda_expert are crafted based on different data tables and experts, the syntax space may differ. As a result, Quda_expert is more challenging. Though models trained on Quda_random have performed better, the performance on Quda_expert is close to real-world scenarios since the data tables and users may vary. This leaves room for developing new methods that are specifically designed for this task. To understand how the model on Quda_expert performs across different analytic tasks, we investigate the per-class precision and recall. As shown in Table 2, the model performs well on Find Extremum, Sort, and Cluster. However, it performs poorly in Filter and Characterize Distribution. In some tasks, such as Compute Derived Value (precision: 0.82, recall: 0.51) and Retrieve Value (precision: 0.92, recall: 0.64), the model is high in precision while low in recall. This means the model returns fewer results compared to the ground-truth labels, but most of the predicted labels are correct. The understanding of why the performance varies across analytic tasks is beyond the scope of our paper. We leave the problem as future research which involves a deep understanding of deep learning models as well as syntax space of different tasks.

10 Quda: Natural Languageueries Q for Visual Data Analytics Woodstock ’18, June 03–05, 2018, Woodstock, NY

Task Precision Recall Retrieve Value 0.92 0.64 Filter 0.79 0.79 Compute Derived Value 0.82 0.51 Find Extremum 0.86 0.88 Sort 0.94 0.81 Determine Range 0.9 0.7 Characterize Distribution 0.73 0.69 Find Anomalies 0.82 0.75 Cluster 0.88 0.91 Correlate 0.9 0.76 Table 2. The per-class precision and recall of Bert on Quda_expert.

7.2 Improving an Existing V-NLI Toolkit The multi-label classification model trained with Quda can benefit existing NLI toolkits in task inference for free-form natural language. We choose NL4DV [31] as an example, which is the first python-based library supporting prototyping V-NLIs. NL4DV takes NL queries and a dataset as input, and returns an ordered list of Vega-Lite specifications to respond to these queries. Its query interpretation pipeline includes four modules, such as query parsing, attribute inference, task inference, and visualization generation. The task inference module of NL4DV focuses on the detection of 1) task type, 2) the parameters associated with a task, and 3) ambiguities of operators and values. The toolkit currently supports five base tasks (i.e., Correlation, Distribution, Derived Value, Trend, Find Extremum), and Filter. To infer tasks, the authors set rules according to task keywords, POS tags, dependency tree, and data values. Although effective in some scenarios, the rule-based approach is hard to extend and limited for free-form natural language. We aim to improve the precision and range of task type inference with the help of Quda. We first use Bert [16] trained on Quda_expert (details are introduced in Section 7.1) to detect task type. Our model supports ten low-level tasks and can assign multiple task labels for each query. Though our model cannot explicitly detect the Trend task supported by NL4DV, Trend is usually identified as Correlation and Distribution, and appropriate visualizations are recommended. For example, “Illustrate the distribution of the population of Ashley across the five years.” is identified as Distribution and Filter. Second, we setup a set of rules to generate visualization for all tasks. Similar to NL4DV, the ten tasks are divided into 9 base tasks and Filter. For base tasks, we take into account the type of data attributes and tasks to identify proper visualizations and visual encodings. For Filter and its associated attributes, we highlight information in the Bar chart rather than filter to few bars according to the design guideline [20].

7.2.1 Case Study. We demonstrate that Quda can improve the robustness of NL4DV in task inference. We compare the results generated by NL4DV and the augmented library (named NL4DV_Quda) using a movie dataset. Admittedly, NL4DV can generate reasonable visualizations in most cases. However, it fails when tasks are not correctly detected. In the first example, we demonstrate that incorrect task inference leads to ineffective visualizations. Figure 2(a)shows the results of “How does the budget distribute across different genres?” NL4DV identifies the sentence as Derived Value according to the type of attributes, and calculates the mean budget for each genre. However, the mean budget cannot characterize budget distribution for each genre. NL4DV fails to detect the query as Distribution because “distribute” is not included in the task keywords. NL4DV_Quda, on the other hand, detects the task as Distribution and responds to

11 Woodstock ’18, June 03–05, 2018, Woodstock, NY Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu

a How does the budget distribute across different genres? b How many drama movies are there in the table?

NL4DV NL4DV_Quda NL4DV NL4DV_Quda

1.0 200 110M 300M 0.9 280M 180 100M 260M 0.8 160 90M 240M 0.7 220M 140 80M 200M 0.6 70M 120 180M 0.5 60M 160M 100

140M

Count of Records 0.4 50M

Count of Records 80 120M Production Budget 0.3 40M 100M 60 80M 30M 0.2

Mean of Production Budget 40 60M 20M 0.1 40M 20 20M 10M 0.0

Ali 0

0M Jack Blow Holes Shine 8 Mile Crash Traffic Bobby Closer Honey K-PAX Instinct John Q Twilight Flyboys 0M Contact Amistad Big Fish Step Up Beloved Jarhead City Hall Hardball Set It Off Chocolat Sleepers Fireproof Magnolia Elizabeth Unfaithful Billy Elliot Stepmom Baby Boy Dragonfly Sideways Redacted The Town Rounders Cop Land Fight Club Lucky You Action Seabiscuit Horror The Hours Cast Away Action Horror Drama Atonement The Beach The Patriot Thriller Drama Miss Potter The Insider The Queen Thriller The Aviator Frost/Nixon Musical Out of Sight The Reader The Apostle Concert Musical Simon Birch Hope Floats The Truman Mystic River Concert Eve's Bayou Into the Wild Western Comedy A Good Year A Training Day Training My Dog Skip The Majestic Western Comedy The Wrestler Trainspotting Varsity Blues Varsity The Terminal A Action Civil A to Kill Time A Wag the Dog Wag Georgia Rule The Fountain Dante's Peak Jackie Brown Garden State Phenomenon Walk the Line Walk Four Brothers City of Angels Little Children The Departed He Got Game Thirteen Days A Simple Plan A Boogie Nights Elizabethtown Monster's Ball Dreamcatcher Gridiron Gang The Notebook Action The Hurricane Horror Pay it Forward Cold Mountain Adventure About Schmidt Adventure Drama A Mighty Heart A Donnie Brasco Thriller Reign Over Me Cinderella Man Meet Joe Black The Rainmaker In the Bedroom The Green Mile One True Thing One True You've Got Mail You've Erin Brockovich Musical Random Hearts Cruel Intentions Charlotte's Web Concert Eyes Wide Shut Lions for Lambs Summer of Sam A Beautiful Mind A Western The Kite Runner Comedy L.A. Confidential Mona Lisa Smile Freedom Writers Finding Forrester American Beauty Bicentennial Man Man on the Moon The Nativity Story Documentary Documentary All the King's Men We Own the Night We Taking Woodstock Taking The Perfect Storm Tea with Mussolini Tea Finding Neverland Lost in Translation Adventure The Thin Red Line The Anna and the King Million Dollar Baby Any Given Sunday Friday Night Lights The Good German My Sister's Keeper Extreme Measures Black Comedy Black Comedy That Thing You Do! That Thing You Gangs of New York Gangs of New American Gangster Anywhere But Here The English Patient Slumdog Millionaire In the Valley of Elah In the Valley Courage Under Fire The Great Debaters Enemy at the Gates The Damned United Autumn in New York Autumn in New Love and Basketball The White Countess Saving Private Ryan Brokeback Mountain Catch Me if You Can You Catch Me if Seven Years in Tibet Seven Years Memoirs of a Geisha A Walk to Remember Walk A The Horse Whisperer Letters from Iwo Jima Remember the Titans Documentary Me and Orson Welles For Love of the Game Black Comedy Romantic Comedy Romantic Comedy Rachel Getting Married The Talented Mr. Ripley Mr. Talented The One Night with the King The General's Daughter Riding in Cars with Boys The Passion of the Christ The Thomas Crown Affair The Last King of Scotland The Pursuit of Happyness Captain Corelli's Mandolin The Count of Monte Cristo Romantic Comedy Love in the Time of Cholera Time Love in the

Genre Genre Title The Legend of Bagger Vance Genre Nick and Norah's Infinite Playlist The City of Your Final Destination Your The City of Perfume: The Story of a Murderer Perfume: Midnight in the Garden of Good an… The Curious Case of Benjamin Button c What is the range of budget? d Are there any outliers in the budget?

NL4DV NL4DV_Quda NL4DV NL4DV_Quda

350 350

300 300

250 250

200 200

150 150 Count of Records Count of Records

100 100

50 50

0 0 0M 50M 100M 150M 200M 250M 300M 0M 20M 40M 60M 80M 100M 120M 140M 160M 180M 200M 220M 240M 260M 280M300M 0M 50M 100M 150M 200M 250M 300M 0M 20M 40M 60M 80M 100M 120M 140M 160M 180M 200M 220M 240M 260M 280M300M

Production Budget (binned) Production Budget Production Budget (binned) Production Budget

Fig. 2. Four examples showcase that NL4DV_Quda can improve NL4DV in identifying tasks with higher precision (a andb) supporting more analytic tasks (c and d). the query using a strip plot. We observe that the budget of action movies distributes evenly between 20 million and 150 million. While most comedy movies have a budget of around 20 million. The second example shows that NL4DV may miss tasks associated with a query. The query, “How many Drama movies are there in this table?” belongs to Derived Value and Filter because drama movies should be filtered from all genres and the amount is derived through a count operator. The attribute inference module of NL4DV detects one data value (drama) only. Hence, NL4DV merely identifies it as Filter and shows a bar chart with severe visual clutter (Figure 2(b)). On the other hand, NL4DV_Quda correctly recognizes both tasks. It shows a bar chart with count operator to respond to the Derived Value task, and further highlights drama movies to Filter. Thirdly, we illustrate that NL4DV_Quda supports a broader range of analytic tasks compared to NL4DV, including Determine Range, Retrieve Value, Sort, and Find Anomalies. Taking the query “what is the range of budget” as an example, NL4DV detects it as Distribution based on the type of data attribute, and shows a histogram for the production budget (Figure 2(c)). However, the binning operation hinders users from identifying the minimum and maximum value. Similarly, NL4DV detects the query, “Are there any outliers in the budget” as distribution and depicts a histogram by default (Figure 2(d)). The histogram, however, fails to reveal movies with an abnormal budget. NL4DV_Quda can correctly detect the tasks and present an appropriate visualization in each case. As shown in Figure 2(c) and (d), NL4DV_Quda shows boxplots with different parameter settings to demonstrate the range and outliers, respectively.

12 Quda: Natural Languageueries Q for Visual Data Analytics Woodstock ’18, June 03–05, 2018, Woodstock, NY

To conclude, due to the defects in task inference, NL4DV may result in ineffective visualizations. The model trained on Quda, on the other hand, is more robust in multi-label task classification. Further, the model supports more analytic tasks which extends NL4DV to a broader range of usage scenarios.

7.3 Query Recommendation for V-NLIs Some V-NLIs, such as Eviza [39], Articulate [42], and FlowSense [47], have embedded template-based auto-completion to improve user experience [19]. That is, after a user query is partially completed, the interface provides hints for possible queries. Though useful, these techniques are tailored to specific applications and grammars, and can hardly be generalized to other V-NLIs. Query recommendation techniques in [8, 9, 49] suggest a list of related queries based on previously issued queries, which are independent of applications and have wide usage scenarios. With the help of Quda, we aim to develop a query recommendation technique for V-NLIs, which recommends semantically meaningful queries based on user input, data table, and Quda.

7.3.1 Technique Overview. We first present the data annotation step that adapts Quda to our query recommendation technique. Then, we introduce our technique comprised of three modules: (1) candidate generation, (2) candidate refinement, and (3) interactive interface. In data annotation, we manually label 290 expert-generated queries for demonstration, and plan to annotate all 14, 035 queries in Quda in future research. For each query, we distinguish phrases or words that are data attributes and data values. We further annotate the type of attributes and values as quantitative, nominal, ordinal, and temporal. For example, given a query “what is the birth distribution of the different districts in 2017 ”, we annotate “birth” as a data attribute with type quantitative, “districts” as a data attributes with type nominal, and “2017” as a data value with type temporal. The three modules of our technique interact with each other. Given a user input, the goal of the candidate generation module is to identify candidate queries among 290 queries that are semantically equivalent to the input. We borrow universal sentence encoder [13] to vectorize 290 queries and the user input, and compute the cosine similarity between them. We then accept the top 5 candidate queries that are semantically “close” to the user input. Since candidate queries do not relate to the data table, the recommendation, however, may not be meaningful for users. The candidate refinement module aims to refine the candidate queries by replacing data attributes (or values) in candidate queries withthosein the data table. We setup two criteria in the replacement: 1) the type of data attributes (or values) should be the same, and 2) the semantics should be as similar as possible. Finally, we borrow NL-driven Vega-Lite editor [31] as a testbed to showcase our query recommendation technique. The interface is a Vega-Lite editor [37] enhanced with NL4DV which is a python-based NLI toolkit, which consists of an input bar, an editor for Vega-Lite specification, a panel for visualization results, and a panel for design alternatives. When a user stops typing in the input bar, the input text is sent to the candidate generation module. After candidate queries are tuned by the candidate refinement module, the results are shown in a dropdown menu.

7.3.2 Usage Scenario. Suppose Jane is a fan of Hollywood movies, and she exhibits considerable interest in analyzing a movie dataset containing 709 records with 10 fields, including Running Time, Title, Genre, etc (Figure 3(a)). After selecting the Movies dataset, Jane wants to see how IMDB ratings are distributed. Then she types “what is the distribution”. A dropdown menu is then displayed showing the top 5 queries recommended by our technique (Figure 3(b)). The recommendation accelerates the input process, and Jane simply selects the third query for investigation. Next, Jane wants to find abnormal movies in the dataset. However, she has no idea where to start because there are multiple ways 13 Woodstock ’18, June 03–05, 2018, Woodstock, NY Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu a b

c d

Fig. 3. Given a data table (a), our technique recommends queries for three user inputs (b, c, d). to define whether a movie is abnormal. She types “any outlier” and pauses. Our technique returns five queries to inspire her with the possible definition (Figure 3(c)). The first sentence catches her eye. IMDB rating is a commonly-used criterion to evaluate the quality of a movie. She then chooses the first sentence for detailed inspection. Finally, Jane wonders whether there are correlations between any data attributes, and types “correlation”’. The dropdown menu shows five possible results (Figure 3(d)). Two of them suggest the correlation between release year and production budget. Hence, Jane decides to start with the first query. Finally, she concludes, “The query recommendation not only accelerates my exploration process, but also give me valuable suggestions when I do not exactly know what to see.”

8 DISCUSSIONS Quda is the first attempt to construct a large-scale corpus for V-NLIs. And we have demonstrated that Quda canbenefit the construction of V-NLIs in various scenarios. Admittedly, some limitations exist in the current version of the corpus. First, according to the six characteristics discussed in Section 3, our corpus covers a limited scope of queries. For example, Quda does not include contextual queries that frequently appear in V-NLIs and systems with multimodal interaction [41]. Quda is a start for corpus collection in the domain of V-NLI. We plan to construct more corpora to facilitate the research and development of V-NLIs. Second, data tables have a strong influence on queries. Data tables with various domains would diversify queriesin semantics and syntax. Our corpus is constructed with 36 data tables in 11 domains. Though enumerating all data tables is infeasible, we should collect queries based on the most representative ones and cover as many domains as possible. In future research, we will explore the datasets used in the domain of visual data analytics, and extend Quda with more data tables involved. Third, paraphrase validation ensures the quality of crowd-authored queries. However, our approach requires extensive crowd force for pairwise comparisons, which would not be scalable if the number of paraphrases increases. As Quda is a large-scale corpus with human validation, we plan to train a machine learning model on our corpus to distinguish paraphrases with low quality. Then the scalability issue can be alleviated. 14 Quda: Natural Languageueries Q for Visual Data Analytics Woodstock ’18, June 03–05, 2018, Woodstock, NY

9 CONCLUSION AND FUTURE WORK In this work, we present a large-scale corpus named Quda, to help V-NLIs categorize analytic tasks by training and benchmarking machine/deep learning techniques. Quda contains 14, 035 queries labeled with some of 10 low-level analytic tasks. To construct Quda, we first collect 920 expert queries by recruiting 20 data analysts. Then, we employ crowd force to author 20 paraphrases for each expert query. Next, to ensure the quality of paraphrases, we borrow the crowd to collect the pairwise comparison dataset and compute semantic equivalence scores using relative attribute [32]. Finally, paraphrases with low scores are filtered out. We present three experiments to illustrate the usefulness and significance of Quda in different usage scenarios. In the future, we plan to explore more research opportunities based on Quda. First, we plan to continuously expand the scope of Quda in several aspects, such as 1) supporting abstract queries, 2) including queries that have high composition level, 3) including queries with actions, and 4) involve more data tables and experts. Second, we plan to improve the process of corpus construction to balance data quality and cost. The cost for expert query collection and paraphrase generation is reasonable. However, the process of paraphrase validation is expensive, and the cost will increase quadratically with the number of paraphrases. With the help of Quda, we plan to train a classification model to investigate scalable approaches for quality control. Third, a promising direction of V-NLI is the support of multi-turn conversations, which involves the inference of contextual information of a query. We plan to investigate how Quda can support multi-turn conversations.

REFERENCES [1] Accessed February 3, 2020. IBM Watson Analytics. https://www.ibm.com/watson-analytics. [2] Accessed February 3, 2020. Microsoft Power BI. https://powerbi.microsoft.com/. [3] Accessed February 3, 2020. Tableau Software. http://www.tableausoftware.com/. [4] Accessed February 3, 2020. Thoughtspot. http://www.thoughtspot.com/. [5] Accessed February 3, 2020. Wolfram Alpha. http://www.wolframalpha.com/. [6] Accessed Sept 3, 2020. Multi-Label Classification Dataset Repository. http://www.uco.es/kdis/mllresources/. [7] R. Amar, J. Eagan, and J. Stasko. 2005. Low-level components of analytic activity in information visualization. In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. IEEE, 111–117. https://doi.org/10.1109/INFVIS.2005.1532136 [8] Aris Anagnostopoulos, Luca Becchetti, Carlos Castillo, and Aristides Gionis. 2010. An Optimization Framework for Query Recommendation. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (New York, New York, USA) (WSDM ’10). Association for Computing Machinery, New York, NY, USA, 161–170. https://doi.org/10.1145/1718487.1718508 [9] Ricardo Baeza-Yates, Carlos Hurtado, and Marcelo Mendoza. 2005. Query Recommendation Using Query Logs in Search Engines. In Current Trends in Database Technology - EDBT 2004 Workshops, Wolfgang Lindner, Marco Mesiti, Can Türker, Yannis Tzitzikas, and Athena I. Vakali (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 588–596. [10] Leilani Battle, Peitong Duan, Zachery Miranda, Dana Mukusheva, Remco Chang, and Michael Stonebraker. 2018. Beagle: Automated Extraction and Interpretation of Visualizations from the Web. In Proceedings of the CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, Article 594, 8 pages. https://doi.org/10.1145/3173574.3174168 [11] Michelle A. Borkin, Azalea A. Vo, Zoya Bylinskii, Phillip Isola, Shashank Sunkavalli, Aude Oliva, and Hanspeter Pfister. 2013. What Makes a Visualization Memorable? IEEE Transactions on Visualization and Computer Graphics 19, 12 (Dec. 2013), 2306–2315. https://doi.org/10.1109/TVCG. 2013.234 [12] Steven Burrows, Martin Potthast, and Benno Stein. 2013. Paraphrase Acquisition via Crowdsourcing and Machine Learning. ACM Trans. Intell. Syst. Technol. 4, 3, Article 43 (7 2013), 21 pages. https://doi.org/10.1145/2483669.2483676 [13] Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. arXiv:1803.11175 [cs.CL] [14] Kenneth Cox, Rebecca E. Grinter, Stacie L. Hibino, Lalita Jategaonkar Jagadeesan, and David Mantilla. 2001. A Multi-Modal Natural Language Interface to an Information Visualization Environment. International Journal of Speech Technology 4, 3/4 (07 2001), 297–314. https://doi.org/10.1023/A: 1011368926479 [15] Weiwei Cui, Xiaoyu Zhang, Yun Wang, He Huang, Bei Chen, Lei Fang, Haidong Zhang, Jian-Guan Lou, and Dongmei Zhang. 2020. Text-to-Viz: Automatic Generation of Infographics from Proportion-Related Natural Language Statements. IEEE Transactions on Visualization and Computer

15 Woodstock ’18, June 03–05, 2018, Woodstock, NY Siwei Fu, Kai Xiong, Xiaodong Ge, Siliang Tang, Wei Chen, and Yingcai Wu

Graphics 26, 1 (Jan 2020), 906–916. https://doi.org/10.1109/TVCG.2019.2934785 [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423 [17] Kedar Dhamdhere, Kevin S. McCurley, Ralfi Nahmias, Mukund Sundararajan, and Qiqi Yan. 2017. Analyza: Exploring Data with Conversation. In Proceedings of the 22nd International Conference on Intelligent User Interfaces (Limassol, Cyprus) (IUI ’17). Association for Computing Machinery, New York, NY, USA, 493–504. https://doi.org/10.1145/3025171.3025227 [18] Ethan Fast, Binbin Chen, Julia Mendelsohn, Jonathan Bassen, and Michael S. Bernstein. 2018. Iris: A Conversational Agent for Complex Tasks. In Proceedings of the CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, Article 473, 12 pages. https://doi.org/10.1145/3173574.3174047 [19] Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu, and Karrie G. Karahalios. 2015. DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization. In Proceedings of the Annual Symposium on User Interface Software and Technology (Charlotte, NC, USA) (UIST ’15). Association for Computing Machinery, New York, NY, USA, 489–500. https://doi.org/10.1145/2807442.2807478 [20] . Hearst, M. Tory, and V. Setlur. 2019. Toward Interface Defaults for Vague Modifiers in Natural Language Interfaces for Visual Analysis. In 2019 IEEE Visualization Conference (VIS). 21–25. [21] Enamul Hoque, Vidya Setlur, Melanie Tory, and Isaac Dykeman. 2018. Applying Pragmatics Principles for Interaction with Visual Analytics. IEEE Transactions on Visualization and Computer Graphics 24, 1 (Jan 2018), 309–318. https://doi.org/10.1109/TVCG.2017.2744684 [22] Thorsten Joachims. 1998. Text categorization with Support Vector Machines: Learning with many relevant features. In Machine Learning: ECML-98, Claire Nédellec and Céline Rouveirol (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 137–142. [23] Abhinav Kumar, Jillian Aurisano, Barbara Di Eugenio, Andrew Johnson, Alberto Gonzalez, and Jason Leigh. 2016. Towards a dialogue system that supports rich visualizations of data. In Proceedings of the Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, Los Angeles, 304–309. https://doi.org/10.18653/v1/W16-3639 [24] Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. A Continuously Growing Dataset of Sentential Paraphrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 1224–1234. https: //doi.org/10.18653/v1/D17-1126 [25] K. Lang. 1995. NewsWeeder : Learning to Filter Netnews ( To appear in ML 95 ). [26] Walter Stephen Lasecki, Ece Kamar, and Dan Bohus. 2013. Conversations in the crowd: Collecting data for task-oriented dialog learning. In First AAAI Conference on Human Computation and Crowdsourcing. [27] B. Lee, K. Isaacs, D. A. Szafir, G. E. Marai, C. Turkay, M. Tory, S. Carpendale, and A. Endert. 2019. Broadening Intellectual Diversity in Visualization Research Papers. IEEE Computer Graphics and Applications 39, 4 (July 2019), 78–85. https://doi.org/10.1109/MCG.2019.2914844 [28] Po-Shen Lee, Jevin D. West, and Bill Howe. 2018. Viziometrics: Analyzing Visual Information in the Scientific Literature. IEEE Transactions on Big Data 4, 1 (March 2018), 117–129. https://doi.org/10.1109/TBDATA.2017.2689038 [29] Jose M. Moyano, Eva L. Gibaja, and Sebastián Ventura. 2017. MLDA: A tool for analyzing multi-label datasets. Knowledge-Based Systems 121 (2017), 1 – 3. https://doi.org/10.1016/j.knosys.2017.01.018 [30] Tamara Munzner. 2014. Visualization analysis and design. CRC press. [31] Arpit Narechania, Arjun Srinivasan, and John Stasko. 2020. NL4DV: A Toolkit for Generating Analytic Specifications for Data Visualization from Natural Language Queries. arXiv:2008.10723 [cs.HC] [32] Devi Parikh and Kristen Grauman. 2011. Relative attributes. In International Conference on Computer Vision. IEEE, 503–510. https://doi.org/10.1109/ ICCV.2011.6126281 [33] Jorge Poco and Jeffrey Heer. 2017. Reverse-Engineering Visualizations: Recovering Visual Encodings from Chart Images. Comput. Graph. Forum 36, 3 (June 2017), 353–363. https://doi.org/10.1111/cgf.13193 [34] Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. 2010. An Evaluation Framework for Plagiarism Detection. In Proceedings of the International Conference on Computational Linguistics: Posters (Beijing, China) (COLING ’10). Association for Computational Linguistics, USA, 997–1005. [35] Alexander Rind, Wolfgang Aigner, Markus Wagner, Silvia Miksch, and Tim Lammarsch. 2016. Task Cube: A three-dimensional conceptual space of user tasks in visualization design and evaluation. Information Visualization 15, 4 (jul 2016), 288–300. https://doi.org/10.1177/1473871615621602 arXiv:https://doi.org/10.1177/1473871615621602 [36] Bahador Saket, Alex Endert, and Cagatay Demiralp. 2019. Task-Based Effectiveness of Basic Visualizations. IEEE Transactions on Visualization and Computer Graphics 25, 7 (July 2019), 2505–2512. https://doi.org/10.1109/TVCG.2018.2829750 [37] A. Satyanarayan, D. Moritz, K. Wongsuphasawat, and J. Heer. 2017. Vega-Lite: A Grammar of Interactive Graphics. IEEE Transactions on Visualization and Computer Graphics 23, 1 (Jan 2017), 341–350. https://doi.org/10.1109/TVCG.2016.2599030 [38] Manolis Savva, Nicholas Kong, Arti Chhajta, Li Fei-Fei, Maneesh Agrawala, and Jeffrey Heer. 2011. ReVision: Automated Classification, Analysis and Redesign of Chart Images. In Proceedings of the Annual Symposium on User Interface Software and Technology (Santa Barbara, California, USA) (UIST ’11). Association for Computing Machinery, New York, NY, USA, 393–402. https://doi.org/10.1145/2047196.2047247 [39] Vidya Setlur, Sarah E. Battersby, Melanie Tory, Rich Gossweiler, and Angel X. Chang. 2016. Eviza: A Natural Language Interface for Visual Analysis. In Proceedings of the Annual Symposium on User Interface Software and Technology (Tokyo, Japan) (UIST ’16). Association for Computing Machinery, 16 Quda: Natural Languageueries Q for Visual Data Analytics Woodstock ’18, June 03–05, 2018, Woodstock, NY

New York, NY, USA, 365–377. https://doi.org/10.1145/2984511.2984588 [40] Arjun Srinivasan and John Stasko. 2017. Natural Language Interfaces for Data Analysis with Visualization: Considering What Has and Could Be Asked. In Proceedings of the Eurographics/IEEE VGTC Conference on Visualization: Short Papers (Barcelona, Spain) (EuroVis ’17). Eurographics Association, Goslar, DEU, 55–59. https://doi.org/10.2312/eurovisshort.20171133 [41] Arjun Srinivasan and John Stasko. 2018. Orko: Facilitating Multimodal Interaction for Visual Exploration and Analysis of Networks. IEEE Transactions on Visualization and Computer Graphics 24, 1 (Jan 2018), 511–521. https://doi.org/10.1109/TVCG.2017.2745219 [42] Yiwen Sun, Jason Leigh, Andrew Johnson, and Sangyoon Lee. 2010. Articulate: A Semi-Automated Model for Translating Natural Language Queries into Meaningful Visualizations. In Proceedings of the 10th International Conference on Smart Graphics (Banff, Canada) (SG’10). Springer-Verlag, Berlin, Heidelberg, 184–195. [43] Melanie Tory and Vidya Setlur. 2019. Do What I Mean, Not What I Say! Design Considerations for Supporting Intent and Context in Analytical Conversation. IEEE Conference on Visual Analytics Science and Technology (VAST) (2019). [44] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. 2016. CNN-RNN: A Unified Framework for Multi-label Image Classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2285–2294. https://doi.org/10.1109/CVPR.2016.251 [45] Su Wang, Rahul Gupta, Nancy Chang, and Jason Baldridge. 2018. A task in a suit and a tie: paraphrase generation with semantic augmentation. CoRR abs/1811.00119 (2018). arXiv:1811.00119 http://arxiv.org/abs/1811.00119 [46] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 5753–5763. http://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining- for-language-understanding. [47] Bowen Yu and Claudio T. Silva. 2020. FlowSense: A Natural Language Interface for Visual Data Exploration within a Dataflow System. IEEE Transactions on Visualization and Computer Graphics 26, 1 (Jan 2020), 1–11. https://doi.org/10.1109/TVCG.2019.2934668 [48] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (Montreal, Canada) (NIPS’15). MIT Press, Cambridge, MA, USA, 649–657. [49] Zhiyong Zhang and Olfa Nasraoui. 2006. Mining Search Engine Query Logs for Query Recommendation. In Proceedings of the 15th International Conference on (Edinburgh, Scotland) (WWW ’06). Association for Computing Machinery, New York, NY, USA, 1039–1040. https://doi.org/10.1145/1135777.1136004

17