Incorporating User's Preference Into Attributed Graph Clustering

Total Page:16

File Type:pdf, Size:1020Kb

Incorporating User's Preference Into Attributed Graph Clustering Incorporating User’s Preference into Attributed Graph Clustering Wei Ye1, Dominik Mautz2, Christian Böhm2, Ambuj Singh1,Claudia Plant3 1University of California, Santa Barbara 2Ludwig-Maximilians-Universität München, Munich, Germany 3University of Vienna, Vienna, Austria {weiye,ambuj}@cs.ucsb.edu {mautz,boehm}@dbs.ifi.lmu.de [email protected] ABSTRACT 1 INTRODUCTION Graph clustering has been studied extensively on both plain graphs Data can be collected from multiple sources and modeled as attrib- and attributed graphs. However, all these methods need to partition uted graphs (networks), in which vertices represent entities, edges the whole graph to find cluster structures. Sometimes, based on represent their relations and attributes describe their own char- domain knowledge, people may have information about a specific acteristics. For example, proteins in a protein-protein interaction target region in the graph and only want to find a single cluster network may be associated with gene expressions in addition to concentrated on this local region. Such a task is called local clus- their interaction relations; users in a social network may be asso- tering. In contrast to global clustering, local clustering aims to find ciated with individual attributes such as interests, residence and only one cluster that is concentrating on the given seed vertex demographics in addition to their friendship relations. (and also on the designated attributes for attributed graphs). Cur- One of the major data mining tasks in graphs (networks) is the rently, very few methods can deal with this kind of task. To this detection of clusters. Existing methods for cluster detection in at- end, we propose two quality measures for a local cluster: Graph tributed graphs can be divided into two categories, i.e., full space Unimodality (GU) and Attribute Unimodality (AU). The former attributed graph clustering methods [1, 52] and subspace attrib- measures the homogeneity of the graph structure while the latter uted graph clustering methods [6, 11, 47]. The methods belonging measures the homogeneity of the subspace that is composed of the to the first category treat all attributes equally important tothe designated attributes. We call their linear combination as Compact- graph structure, while the methods belonging to the second cate- ness. Further, we propose LOCLU to optimize the Compactness gory consider varying relevance of attributes to the graph structure. score. The local cluster detected by LOCLU concentrates on the All these methods need to partition the whole graph to find clus- region of interest, provides efficient information flow in the graph ter structures. However, based on domain knowledge, sometimes and exhibits a unimodal data distribution in the subspace of the people may have information about a specific target region in the designated attributes. graph and are only interested in finding a cluster surrounding this local region. Such a task is called local cluster detection, which KEYWORDS has aroused a great deal of attention in many applications, e.g., targeted ads, medicine, etc. Without considering scalability, one Local clustering, user’s preference, attributed graphs, dip test, uni- may think we can first use full space or subspace attributed graph modal, NCut, power iteration. clustering techniques and then return the cluster that contains the target region. However, it is hard to set the number of clusters in ACM Reference Format: real-world graphs. And the cluster content depends on the chosen Wei Ye1, Dominik Mautz2, Christian Böhm2, Ambuj Singh1,Claudia Plant3. number of clusters. 2020. Incorporating User’s Preference into Attributed Graph Clustering. In To deal with this task, several recent works [3, 20] use short arXiv:2003.11079v1 [cs.LG] 24 Mar 2020 Woodstock ’20: ACM Symposium on Neural Gaze Detection, June 03–05, 2020, random walks starting from the target region to find the local cluster. Woodstock, NY. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/ Also, some approaches [2, 18] focus on using the graph diffusion 1122445.1122456 methods to find the local cluster. However, these methods are only suitable for detecting local clusters in plain graphs whose vertices have no attributes. Recently, FocusCO [34] has been proposed to find a local cluster of interest to users in attributed graphs. Given Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed an examplar set, it first exploits a metric learning method to learn a for profit or commercial advantage and that copies bear this notice and the full citation projection vector that makes the vertex in the examplar set similar on the first page. Copyrights for components of this work owned by others than ACM to each other in the projected attribute subspace, then updates the must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a graph weight and finally performs the focused cluster extraction. fee. Request permissions from [email protected]. FocusCO cannot infer the projection vector if the examplar set has Woodstock ’20, June 03–05, 2020, Woodstock, NY only one vertex. © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-XXXX-X/20/06...$15.00 https://doi.org/10.1145/1122445.1122456 Woodstock ’20, June 03–05, 2020, Woodstock, NY We Ye et al. In this paper, given user’s preference, i.e., the seed vertex and the • We propose Compactness, a new quality measure for clus- designated attributes, we develop a method that can automatically ters in attributed graphs. Compactness measures the ho- find the vertices that are similar to the given seed vertex. Thesimi- mogeneity (unimodality) of both the graph structure and larity is measured by the homogeneity both in the graph structure subspace that is composed of the designated attributes. and the subspace that is composed of the designated attributes. To • We propose LOCLU to optimize the Compactness score. this end, we first propose Compactness to measure the unimodal- • We demonstrate the effectiveness and efficiency of LOCLU ity1 of the clusters in attributed graphs. Compactness is composed by conducting experiments on both synthetic and real-world of two measures: Graph Unimodality (GU) and Attribute Unimodal- attributed graphs. ity (AU). GU measures the unimodality of the graph structure, and AU measures the unimodality of the subspace that is composed 2 PRELIMINARIES of the designated attributes. To consider both the graph structure and attributes, we first embed the graph structure into vector space. 2.1 Notation Then we consider the graph embedding vector as another desig- In this work, we use lower-case Roman letters (e.g. a;b) to denote nated attribute and apply the local clustering technique separately scalars. We denote vectors (column) by boldface lower case letters on each designated attribute. We call the procedure to find a local (e.g. x) and denote its i-th element by x¹iº. Matrices are denoted by cluster as LOCLU. boldface upper case letters (e.g. X). We denote entries in a matrix by Let us use a simple example to demonstrate our motivation. non-bold lower case letters, such as xij . Row i of matrix X is denoted Figure 1 shows an example social network, in which the vertices by X¹i; :º, column j by X¹:; jº. A set is denoted by calligraphic capital represent students in a middle school, the edges represent their letters (e.g. S). An undirected attributed graph is denoted by G = friendship relations, and the attributes associated to each vertex ¹V; E; Xº, where V is a set of graph vertices with number n = jVj are age (year), sport time (hour) per week, studying time (hour) per of vertices, E is a set of graph edges with number e = jEj of edges week and playing mobile game (M.G.) time (hour) per week. Given and X 2 Rn×d is a data matrix of attributes associated to vertices, vertex 4 and the designated attribute M.G., the task is to find a local where d is the number of attributes. An adjacency matrix of vertices n×n cluster around the vertex 4. (This task is of interest to mobile game is denoted by A 2 R with aij = 1¹i , jº and aij = 0¹i = jº. producers.) Conventional diffusion-based local clustering method The degree matrix D is a diagonal matrix associated with A with Í such as HK [18] finds a cluster C1 = f1; 3; 4; 5; 6; 7g. However, this dii = j aij . The random walk transition matrix W is defined as cluster is not homogeneous in the subspace of the M.G. attribute. D−1A. The Laplacian matrix is denoted as L = I − W, where I is the Compared with C1, the cluster C2 = f1; 2; 3; 4g is more local, which identity matrix. An attributed graph cluster is a subset of vertices is concentrated on the vertex 4 and the M.G. attribute. C ⊆ V with attributes. The indicator function is denoted by 1¹xº. Age: 12 2.2 The Dip Test Age: 17 Sport: 2 Age: 18 Sport: 0.5 Before introducing the concept of the dip test, let us first clarify the Study: 23 Sport: 0.8 definitions of unimodal distribution and multimodal distribution. In Study: 40 Age: 11 M.G.: 5.5 Study: 42 statistics, a unimodal distribution refers to a probability distribution M.G.: 0.9 Sport: 1.5 1 M.G.: 1 that only has a single mode (i.e. peak). If a probability distribution Study: 25 5 has multiple modes, it is called multimodal distribution. From the 7 M.G.: 5.5 behavior of the cumulative distribution function (CDF), unimodal 8 distribution can also be defined as: if the CDF is convex for x < m 2 3 4 Age: 18 and concave for x > m (m is the mode), then the distribution is Age: 13 6 Sport: 1 unimodal.
Recommended publications
  • How to Find out the IP Address of an Omron
    Communications Middleware/Network Browser How to find an Omron Controller’s IP address Valin Corporation | www.valin.com Overview • Many Omron PLC’s have Ethernet ports or Ethernet port options • The IP address for a PLC is usually changed by the programmer • Most customers do not mark the controller with IP address (label etc.) • Very difficult to communicate to the PLC over Ethernet if the IP address is unknown. Valin Corporation | www.valin.com Simple Ethernet Network Basics IP address is up to 12 digits (4 octets) Ex:192.168.1.1 For MOST PLC programming applications, the first 3 octets are the network address and the last is the node address. In above example 192.168.1 is network address, 1 is node address. For devices to communicate on a simple network: • Every device IP Network address must be the same. • Every device node number must be different. Device Laptop EX: Omron PLC 192.168.1.1 192.168.1.1 Device Laptop EX: Omron PLC 127.27.250.5 192.168.1.1 Device Laptop EX: Omron PLC 192.168.1.3 192.168.1.1 Valin Corporation | www.valin.com Omron Default IP Address • Most Omron Ethernet devices use one of the following IP addresses by default. Omron PLC 192.168.250.1 OR 192.168.1.1 Valin Corporation | www.valin.com PING Command • PING is a way to check if the device is connected (both virtually and physically) to the network. • Windows Command Prompt command. • PC must use the same network number as device (See previous) • Example: “ping 172.21.90.5” will test to see if a device with that IP address is connected to the PC.
    [Show full text]
  • Disk Clone Industrial
    Disk Clone Industrial USER MANUAL Ver. 1.0.0 Updated: 9 June 2020 | Contents | ii Contents Legal Statement............................................................................... 4 Introduction......................................................................................4 Cloning Data.................................................................................................................................... 4 Erasing Confidential Data..................................................................................................................5 Disk Clone Overview.......................................................................6 System Requirements....................................................................................................................... 7 Software Licensing........................................................................................................................... 7 Software Updates............................................................................................................................. 8 Getting Started.................................................................................9 Disk Clone Installation and Distribution.......................................................................................... 12 Launching and initial Configuration..................................................................................................12 Navigating Disk Clone.....................................................................................................................14
    [Show full text]
  • Problem Solving and Unix Tools
    Problem Solving and Unix Tools Command Shell versus Graphical User Interface • Ease of use • Interactive exploration • Scalability • Complexity • Repetition Example: Find all Tex files in a directory (and its subdirectories) that have not changed in the past 21 days. With an interactive file roller, it is easy to sort files by particular characteristics such as the file extension and the date. But this sorting does not apply to files within subdirectories of the current directory, and it is difficult to apply more than one sort criteria at a time. A command line interface allows us to construct a more complex search. In unix, we find the files we are after by executing the command, find /home/nolan/ -mtime +21 -name ’*.tex’ To find out more about a command you can read the online man pages man find or you can execute the command with the –help option. In this example, the standard output to the screen is piped into the more command which formats it to dispaly one screenful at a time. Hitting the space bar displays the next page of output, the return key displays the next line of output, and the ”q” key quits the display. find --help | more Construct Solution in Pieces • Solve a problem by breaking down into pieces and building back up • Typing vs automation • Error messages - experimentation 1 Example: Find all occurrences of a particular string in several files. The grep command searches the contents of files for a regular expression. In this case we search for the simple character string “/stat141/FINAL” in all files in the directory WebLog that begin with the filename “access”.
    [Show full text]
  • What Is UNIX? the Directory Structure Basic Commands Find
    What is UNIX? UNIX is an operating system like Windows on our computers. By operating system, we mean the suite of programs which make the computer work. It is a stable, multi-user, multi-tasking system for servers, desktops and laptops. The Directory Structure All the files are grouped together in the directory structure. The file-system is arranged in a hierarchical structure, like an inverted tree. The top of the hierarchy is traditionally called root (written as a slash / ) Basic commands When you first login, your current working directory is your home directory. In UNIX (.) means the current directory and (..) means the parent of the current directory. find command The find command is used to locate files on a Unix or Linux system. find will search any set of directories you specify for files that match the supplied search criteria. The syntax looks like this: find where-to-look criteria what-to-do All arguments to find are optional, and there are defaults for all parts. where-to-look defaults to . (that is, the current working directory), criteria defaults to none (that is, select all files), and what-to-do (known as the find action) defaults to ‑print (that is, display the names of found files to standard output). Examples: find . –name *.txt (finds all the files ending with txt in current directory and subdirectories) find . -mtime 1 (find all the files modified exact 1 day) find . -mtime -1 (find all the files modified less than 1 day) find . -mtime +1 (find all the files modified more than 1 day) find .
    [Show full text]
  • Linux Command Line Cheat Sheet by Davechild
    Linux Command Line Cheat Sheet by DaveChild Bash Commands ls Options Nano Shortcuts uname -a Show system and kernel -a Show all (including hidden) Files head -n1 /etc/issue Show distribution -R Recursive list Ctrl-R Read file mount Show mounted filesystems -r Reverse order Ctrl-O Save file date Show system date -t Sort by last modified Ctrl-X Close file uptime Show uptime -S Sort by file size Cut and Paste whoami Show your username -l Long listing format ALT-A Start marking text man command Show manual for command -1 One file per line CTRL-K Cut marked text or line -m Comma-separated output CTRL-U Paste text Bash Shortcuts -Q Quoted output Navigate File CTRL-c Stop current command ALT-/ End of file Search Files CTRL-z Sleep program CTRL-A Beginning of line CTRL-a Go to start of line grep pattern Search for pattern in files CTRL-E End of line files CTRL-e Go to end of line CTRL-C Show line number grep -i Case insensitive search CTRL-u Cut from start of line CTRL-_ Go to line number grep -r Recursive search CTRL-k Cut to end of line Search File grep -v Inverted search CTRL-r Search history CTRL-W Find find /dir/ - Find files starting with name in dir !! Repeat last command ALT-W Find next name name* !abc Run last command starting with abc CTRL-\ Search and replace find /dir/ -user Find files owned by name in dir !abc:p Print last command starting with abc name More nano info at: !$ Last argument of previous command find /dir/ - Find files modifed less than num http://www.nano-editor.org/docs.php !* All arguments of previous command mmin num minutes ago in dir Screen Shortcuts ^abc^123 Run previous command, replacing abc whereis Find binary / source / manual for with 123 command command screen Start a screen session.
    [Show full text]
  • Node and Edge Attributes
    Cytoscape User Manual Table of Contents Cytoscape User Manual ........................................................................................................ 3 Introduction ...................................................................................................................... 48 Development .............................................................................................................. 4 License ...................................................................................................................... 4 What’s New in 2.7 ....................................................................................................... 4 Please Cite Cytoscape! ................................................................................................. 7 Launching Cytoscape ........................................................................................................... 8 System requirements .................................................................................................... 8 Getting Started .......................................................................................................... 56 Quick Tour of Cytoscape ..................................................................................................... 12 The Menus ............................................................................................................... 15 Network Management ................................................................................................. 18 The
    [Show full text]
  • Networking TCP/IP Troubleshooting 7.1
    IBM IBM i Networking TCP/IP troubleshooting 7.1 IBM IBM i Networking TCP/IP troubleshooting 7.1 Note Before using this information and the product it supports, read the information in “Notices,” on page 79. This edition applies to IBM i 7.1 (product number 5770-SS1) and to all subsequent releases and modifications until otherwise indicated in new editions. This version does not run on all reduced instruction set computer (RISC) models nor does it run on CISC models. © Copyright IBM Corporation 1997, 2008. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents TCP/IP troubleshooting ........ 1 Server table ............ 34 PDF file for TCP/IP troubleshooting ...... 1 Checking jobs, job logs, and message logs .. 63 Troubleshooting tools and techniques ...... 1 Verifying that necessary jobs exist .... 64 Tools to verify your network structure ..... 1 Checking the job logs for error messages Netstat .............. 1 and other indication of problems .... 65 Using Netstat from a character-based Changing the message logging level on job interface ............. 2 descriptions and active jobs ...... 65 Using Netstat from System i Navigator .. 4 Other job considerations ....... 66 Ping ............... 7 Checking for active filter rules ...... 67 Using Ping from a character-based interface 7 Verifying system startup considerations for Using Ping from System i Navigator ... 10 networking ............ 68 Common error messages ....... 13 Starting subsystems ........ 68 PING parameters ......... 14 Starting TCP/IP .......... 68 Trace route ............ 14 Starting interfaces ......... 69 Using trace route from a character-based Starting servers .......... 69 interface ............ 15 Timing considerations ........ 70 Using trace route from System i Navigator 15 Varying on lines, controllers, and devices .
    [Show full text]
  • GNU Findutils Finding Files Version 4.8.0, 7 January 2021
    GNU Findutils Finding files version 4.8.0, 7 January 2021 by David MacKenzie and James Youngman This manual documents version 4.8.0 of the GNU utilities for finding files that match certain criteria and performing various operations on them. Copyright c 1994{2021 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled \GNU Free Documentation License". i Table of Contents 1 Introduction ::::::::::::::::::::::::::::::::::::: 1 1.1 Scope :::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 1 1.2 Overview ::::::::::::::::::::::::::::::::::::::::::::::::::::::: 2 2 Finding Files ::::::::::::::::::::::::::::::::::::: 4 2.1 find Expressions ::::::::::::::::::::::::::::::::::::::::::::::: 4 2.2 Name :::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 4 2.2.1 Base Name Patterns ::::::::::::::::::::::::::::::::::::::: 5 2.2.2 Full Name Patterns :::::::::::::::::::::::::::::::::::::::: 5 2.2.3 Fast Full Name Search ::::::::::::::::::::::::::::::::::::: 7 2.2.4 Shell Pattern Matching :::::::::::::::::::::::::::::::::::: 8 2.3 Links ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 8 2.3.1 Symbolic Links :::::::::::::::::::::::::::::::::::::::::::: 8 2.3.2 Hard Links ::::::::::::::::::::::::::::::::::::::::::::::: 10 2.4 Time
    [Show full text]
  • Partition.Pdf
    Linux Partition HOWTO Anthony Lissot Revision History Revision 3.5 26 Dec 2005 reorganized document page ordering. added page on setting up swap space. added page of partition labels. updated max swap size values in section 4. added instructions on making ext2/3 file systems. broken links identified by Richard Calmbach are fixed. created an XML version. Revision 3.4.4 08 March 2004 synchronized SGML version with HTML version. Updated lilo placement and swap size discussion. Revision 3.3 04 April 2003 synchronized SGML and HTML versions Revision 3.3 10 July 2001 Corrected Section 6, calculation of cylinder numbers Revision 3.2 1 September 2000 Dan Scott provides sgml conversion 2 Oct. 2000. Rewrote Introduction. Rewrote discussion on device names in Logical Devices. Reorganized Partition Types. Edited Partition Requirements. Added Recovering a deleted partition table. Revision 3.1 12 June 2000 Corrected swap size limitation in Partition Requirements, updated various links in Introduction, added submitted example in How to Partition with fdisk, added file system discussion in Partition Requirements. Revision 3.0 1 May 2000 First revision by Anthony Lissot based on Linux Partition HOWTO by Kristian Koehntopp. Revision 2.4 3 November 1997 Last revision by Kristian Koehntopp. This Linux Mini−HOWTO teaches you how to plan and create partitions on IDE and SCSI hard drives. It discusses partitioning terminology and considers size and location issues. Use of the fdisk partitioning utility for creating and recovering of partition tables is covered. The most recent version of this document is here. The Turkish translation is here. Linux Partition HOWTO Table of Contents 1.
    [Show full text]
  • Partitioning Disks with Parted
    Partitioning Disks with parted Author: Yogesh Babar Technical Reviewer: Chris Negus 10/6/2017 Storage devices in Linux (such as hard drives and USB drives) need to be structured in some way before use. In most cases, large storage devices are divided into separate sections, which in Linux are referred to as partitions. A popular tool for creating, removing and otherwise manipulating disk partitions in Linux is the parted command. Procedures in this tech brief step you through different ways of using the parted command to work with Linux partitions. UNDERSTANDING PARTED The parted command is particularly useful with large disk devices and many disk partitions. Differences between parted and the more common fdisk and cfdisk commands include: • GPT Format: The parted command can create can be used to create Globally Unique Identifiers Partition Tables (GPT), while fdisk and cfdisk are limited to msdos partition tables. • Larger disks: An msdos partition table can only format up to 2TB of disk space (although up to 16TB is possible in some cases). A GPT partition table, however, have the potential to address up to 8 zebibytes of space. • More partitions: Using primary and extended partitions, msdos partition tables allow only 16 partitions. With GPT, you get up to 128 partitions by default and can choose to have many more. • Reliability: Only one copy of the partition table is stored in an msdos partition. GPT keeps two copies of the partition table (at the beginning and end of the disk). The GPT also uses a CRC checksum to check the partition table's integrity (which is not done with msdos partitions).
    [Show full text]
  • From DOS/Windows to Linux HOWTO from DOS/Windows to Linux HOWTO
    From DOS/Windows to Linux HOWTO From DOS/Windows to Linux HOWTO Table of Contents From DOS/Windows to Linux HOWTO ........................................................................................................1 By Guido Gonzato, ggonza at tin.it.........................................................................................................1 1.Introduction...........................................................................................................................................1 2.For the Impatient...................................................................................................................................1 3.Meet bash..............................................................................................................................................1 4.Files and Programs................................................................................................................................1 5.Using Directories .................................................................................................................................2 6.Floppies, Hard Disks, and the Like ......................................................................................................2 7.What About Windows?.........................................................................................................................2 8.Tailoring the System.............................................................................................................................2 9.Networking:
    [Show full text]
  • (Router) Address Open the DOS Command Prompt Window As Follows
    1 – Find the Gateway (Router) Address Open the DOS Command Prompt window as follows - Windows XP Click the Start icon in the bottom left corner of the screen and then select Run... The Run dialog box will open. Type cmd in the box and press Enter or click OK. The DOS Command Prompt window will open. Windows Vista / Windows 7 Click on the Windows icon in the bottom left corner of the screen. Type cmd into the search box and press Enter. The DOS Command Prompt window will open. Windows 8 Push the Windows button on your keyboard or click the Windows icon in the bottom left corner of the screen if there is one visible. Type cmd once you see the Start screen and press Enter. The DOS Command Prompt window will open. In the DOS Command Prompt window, type ipconfig and then press Enter. You will see information relating to the computer’s network settings. The IPv4 Address is the local IP address of the computer you are using. The Gateway address is the router’s address. The DVR's IP address needs to be in the same range as these – i.e. the first three numbers need to be the same, but the last one different. If ipconfig returned a Gateway address of 10.10.0.1, the DVR would have an address of 10.10.0.xxx where xxx is between 2 and 254. In the above screenshot, the PC's IP address is 192.168.1.77, the Subnet Mask is 255.255.255.0 and the Gateway is 192.168.1.254.
    [Show full text]