<<

Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations

Bo Kang1, Jefrey Lijffijt1, Raúl Santos-Rodríguez2, Tijl De Bie1,2 1 Ghent University, 2 University of Bristol

What is SICA? Case study: Images and lighting • Displays subjectively interesting structure in the data • Dataset: ,1684 gray-scale frontal images (32 x 32 pixels) of 31 human subjects under 64 lighting • By means of linear projections of the data conditions, compiled from the Extended Yale Face Subjectively interesting projection? Database B • Prior expectation: Images with same lighting are similar; • Dimensionality reduction research: consists of 64 cliques a. Prediction vs. exploration • Results: The top PCA eigenfaces reflect lighting conditions, while the top SICA ones reflect facial features b. Linear vs. non-linear Top Five Eigenfaces of PCA Top Five Eigenfaces of SICA c. Objective vs. subjective • Projections that are interesting to the end-user

How to measure? Case study: World economy • Given dataset with data points in -dimensional Euclidean • Dataset: , 44 years of GDP per capita of 110 countries space between year 1970 and 2013. Countries are categorized into seven • Model user’s belief state as background distribution: regions • Prior expectation: GDP value between adjacent years are unlikely to have drastic changes • Allow user to specify prior expectations: • Results: Projections on to first PCA and SICA components show a - Scale of data: similar and smooth increase over the years. The projection on to second SICA component shows more local fluctuations

Projection on to First PCA and SICA Component Projection on to Second PCA and SICA Component - Pairwise similarities defined by graph , pairs in are 2200 PCA PCA 2000 SICA 300 SICA expected to be similar: 1800 200 1600

1400 100

1200 2007-08 global 0 1000 financial crisis

800 -100

600 -200 400 1970s energy crisis -300 200 Linearly Combined GDP Value • In 1-dimensional case, find projection that maximize the Linearly Combined GDP Value 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997 2000 2003 2006 2009 2012 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997 2000 2003 2006 2009 2012 Year information content of projection pattern with Year

0.4 Per country weights given by PCA 2nd component respect to : • PCA second component distributes 0.2 0

-0.2 East Asia & Pacific Europe & Central Asia weights equally to different regions Weights -0.4 Latin America & Caribbean Middle East & North Africa North America -0.6 South Aisa Sub-Saharan Africa -0.8 USA Qatar Japan Iceland Norway

• SICA second component mainly gives Australia 0.4 Eq.Guinea where is a resolution parameter, e.g., smallest distance that is Per country weights given by SICA 2nd component 0.2 positive weights to European countries, 0 visually resolvable -0.2

Weights -0.4 and negative weights to Middle-Eastern -0.6 -0.8 Qatar Iceland Norway Australia • Multi-dimensional case can be extended straightforwardly countries Denmark Luxembourg Saudi Arabia Case study: Synthetic grid Case study: Spatial socio-economics • Dataset: , 20 data points on a rectangular grid. The 1st • Dataset: , age demographics of 412 districts (Landkreise) in attribute varies strongly along one diagonal direction of the grid. The 2nd . It contains five categories: Elder (age > 64), Old (between 45 attribute alternates between -1 and +1. Remaining attributes are and 64), Middle Aged (between 25 and 44), Young (between 18 and 24), standard Gaussian noise and Children (age < 18) (a)(a) (b)(b) (c)(c)

3 10 5 5 5 • Prior expectation: A smooth 8 2

6 4 4 4 • Prior expectation: Historically, population density and birth rate in

4 1

2 variance along the grid, encoded 3 3 3 0 0 eastern Germany are lower than the rest of the country. This information

2 2 -2 2 -1 by a graph connecting adjacent -4 -2 1 1 -6 1 corresponds to a graph constraint with two cliques

-8 -3 nodes (Fig. a) 1234 1234 1234 • Results: PCA confirms prior expectation (Fig. a). SICA instead highlights • Results: Top component of PCA (Fig. b) confirms user’s expectation, not the large cities, whose demographics are different from less urban areas: informative. Top SICA component reveals another underlying property still some contrast between East and West Germany remains (Fig. b) (Fig. c), complementing the prior belief (a)(a) (b)(b) 8 8 Case study: Synthetic social graph 6 6 Prignitz 4 Prignitz Cloppenburg L¨¹chow-Dannenberg st Vechta Vechta 4 • Dataset: , a social network of 100 people. The 1 attribute 2 Wittenberg Wittenberg BorkenMuenster Dessau-Rosslau Muenster Dessau-Rosslau Osterode am Harz 2 nd Mansfeld-Suedharz Oberspree Mansfeld-Suedharz separates the data into two communities. The 2 attribute uniformly 0 Goerlitz Burgenlandkreis Goerlitz Burgenlandkreis GeraAltenburger Land GeraAltenburger Land Saalfeld-RudolstadtGreiz Suhl Saalfeld-RudolstadtGreiz 0 assigns -1 and +1 to data, reflecting, e.g., sentiments towards topics. Suhl Vogtlandkreis -2 Vogtlandkreis MainzFrankfurtOffenbach am am Main Main MainzFrankfurt am Main Wuerzburg -2 -4 Heidelberg Remaining attributes are standard Gaussian noise Eichstaett -4 -6 (a) (b) (c) Tuebingen (a) (b) (c) FreisingErding Freising Munich 3 im Freiburg im Breisgau -6 1 -8

2 • Prior expectation: 0.5

1

0 People are alike if they 0

-0.5 • Interpret results by inspecting the elements of the first PCA and SICA -1

-1 are connected (Fig. a) -2 component

-3 -1.5 Elder Old Mid-Age Young Children • Results: PCA confirms our expectation (Fig. b). SICA finds alternative PCA 1st Component -0.61 -0.42 0.43 0.09 0.51 nd ‘community’ structure corresponding to 2 feature (Fig. b) SICA 1st Component -0.62 -0.32 0.69 0.19 0.06