Spatial Analysis Clustering. Petteri Nurmi

Spatial Analysis Clustering Petteri Nurmi 28.3.2014 1

Questions What kind of preprocessing steps are useful for GPS measurements? What different classes of spatial clustering exist? What is the difference between partitioning algorithms and density-based clustering? What is a place? How places can be detected? 28.3.2014 2

Spatial Analysis Process of inspecting geographical data with the aim of extracting useful information Spatial data analysis process Preprocessing Cleaning the data, perform transformations (if needed) Analysis Exploratory: data is searched for models that describe it well without clear hypothesis Confirmatory: hypotheses about data are tested empirically Post-processing 28.3.2014 3

Measurement Noise Location measurements are inherently noisy Reference point geometry Atmospheric effects Multipath effects Measurement errors (clock or reference point errors) See Lecture IV Preprocessing attempts to reduce the effect of noise before data is being analyzed further Data cleaning: ensures good quality of measurements Check the validity of the data 28.3.2014 4

Preprocessing - GPS GPS requires at least 4 satellites for estimating position (4 unknowns: 3D position + time offset) GPS uncertainty affected by range error and satellite geometry Dilution of Precision gives an estimate of the influence of satellite geometry Horizontal Dilution of Precision (HDOP) most important for applications Cold/warm start can cause outliers in measurements 28.3.2014 5

Preprocessing GPS Example RAW GPS measurements 28.3.2014 6

Preprocessing GPS Example Points with satellites < 4 removed 28.3.2014 7

Preprocessing GPS Example Points with satellites < 4 and HDOP > 6.0 removed 28.3.2014 8

Preprocessing Removing Extreme Values 28.3.2014 9

Spatial Clustering Clustering refers to the process of grouping similar objects into classes Points within same cluster more similar to each other than to those in other clusters Spatial clustering refers to clustering that is applied on data with a geographical component Identifying similar geographical areas, e.g., in terms of crime rate or another statistic Merging of regions with similar weather patterns 28.3.2014 10

Spatial Clustering Four main categories of algorithms Partitioning methods (e.g., K-means, K-medoids) Hierarchical methods (e.g., BIRCH) Density-based methods (e.g., DBScan) Grid-based methods (e.g., CLIQUE) Optimal technique depends on various factors Application goal Trade-off between clustering quality and speed Characteristics and dimensionality of data Amount of noise in data 28.3.2014 11

Spatial Clustering - Partitioning Algorithms Partition data into k clusters so that total deviation of points from their cluster center is minimized Parameter k determines the number of clusters, given usually beforehand Various ways to measure total deviation: Squared distance (K-Means) Posterior of data (Gaussian Mixture Models) 28.3.2014 12

Partitioning Algorithms K-Means One of the best-known clustering algorithms Iterative relocation algorithm, optimizes squared loss m i corresponds to the center of a cluster, C i is the set of points allocated to cluster i Basic structure: Initialization: generate k cluster centers according to some criterion (e.g., random selection from data) During each iteration: Allocate each point to the cluster that is closest Revise cluster centers based on the points that are assigned to the cluster Repeat until no change in values 28.3.2014 13

K-Means Algorithm guaranteed to find a local optimum of the objective function (squared loss) Sensitive to the initial choice of cluster centers Clustering typically repeated multiple times with different initial values and solution with smallest total deviation used Initial values can be determined, e.g., using Random sampling Select fraction of data, perform clustering on that, use resulting clusters as initial values Data spectroscopy: analyze spectral characteristics of data values to determine a good initial guess 28.3.2014 14

K-Means - Example 15 15 Stopped after 2 iterations 10 10 5 5 0 0-5 -5-10 -10-15 -8-6 -4-2 0 2 4 6 8 10 12-15 -8-6 -4-2 0 2 4 6 8 10 12 28.3.2014 15

K-Means Example 28.3.2014 16

Partitioning Algorithms Probabilistic Clustering Generative: data assumed to be generated according to some model Parameters of the model unknown and need to be estimated from data Returns a probability distribution over the parameter values Two possible assignments of points to cluster Hard: each point belongs exactly to one cluster Soft: allow multiple (or all) clusters to contribute to the generation of the point 28.3.2014 17

Partitioning Algorithms Mixture Models Mixture Models provide a flexible and generic approach to probabilistic clustering Data generated by k random variables, each variable X i characterized by probability density function f i (θ i ) For each point i, a hidden and unobservable variable c i determines the cluster where i belongs to The clusters are called mixture components Probability of a point is a (convex) combination of the mixture component densities defines the weight or contribution of a component 28.3.2014 18

Partitioning Algorithms Gaussian Mixture Models Mixture model where mixture components are assumed to have a Gaussian distribution Mean μ i determines the center of the cluster Covariance matrix i determines shape of the cluster Assuming Euclidean distances: Shape is circle if variance of all dimensions is equal Shape is an ellipse aligned with coordinate axes when covariance matrix is diagonal Shape is a tilted ellipse when full covariance matrix used K-means can be understood as a Gaussian mixture model where variance is equal 28.3.2014 19

Partitioning Algorithms Gaussian Mixture Models Cluster parameters can be determined using the expectation maximization (EM) algorithm Iterative algorithm for finding optimal parameter values in models with latent (i.e., unobservable) variables Consists of two steps (E and M) which are iterated until solution converges Algorithm outline: Initialization: draw initial parameter values E-step: compute expectation of log-likelihood using current estimates M-step: compute parameters that maximize the expected log-likelihood computed in the E-step 28.3.2014 20

Partitioning Algorithms Infinite Mixture Models A generalization of mixture models where number of mixture models is assumed infinite (but countable) Example: Chinese restaurant process Customers arrive to a restaurant with an infinite number of circular tables, each having infinite capacity As new customer arrives (s)he selects the table to sit Either one of the partially occupied tables Or completely new table 28.3.2014 21

Partitioning Algorithms K-Medoids Partitioning algorithm that represents a cluster using the most centrally located measurement Instead of updating all centers during an iteration, typically updates only a single medoid How to determine the new medoid? How to evaluate effectiveness of clustering? Covered in more detail during Lecture VIII 28.3.2014 22

Density-Based Algorithms Class of algorithms that represent clusters as dense regions of objects In contrast to partitioning algorithms, can derive clusters of arbitrary shape Areas with low-density of objects are considered noise Basic concepts Epsilon neighborhood: collection of points that are within distance Eps from a point Dense neighborhood: Epsilon neighborhood that contains at least MinPts points 28.3.2014 23

Density-Based Algorithms Radius-Based Clustering Predecessor to density-based clustering Cluster all points with distance Eps of each other to the same cluster MinPts or some other criterion can be used to prune the resulting clusters 28.3.2014 24

Density-Based Algorithms DBScan A point that has at least MinPts within its Epsilon neighborhood is called a core object Object can only belong to a cluster if it is within the Epsilon neighborhood of at least one core object Core object o within Epsilon neighborhood of another core object p must belong to the same cluster as p Non-core object belonging to the Epsilon neighborhood of some core objects must belong to the same cluster as one of these core objects Non-core objects which do not belong to the Epsilon neighborhood of any core objects are noise 28.3.2014 25

Density-Based Algorithms DBScan Non-core object Core object Outlier / noise Core object Clusters A,B and C can be merged since they share a core object 28.3.2014 26

Density-Based Algorithms DBScan Algorithm that recursively merges Epsilon neighborhoods together to identify dense regions Let c be a core object, within the Epsilon neighborhood of c considered as seed points Cluster expanded with (previously unallocated) points that are within the Epsilon neighborhood of a seed point 28.3.2014 27

1.5 Density-Based Clustering Example DBScan: MinPts = 20, Eps = 0.102 1.5 DBScan: MinPts = 20, Eps = 0.124 15 DBScan: MinPts = 20, Eps = 0.077 1 0.5 0-0.5-1 -2-1 0 1 2 3 1 0.5 0-0.5-1 -1.5-2 -1 0 1 2 10 5 0-5 -10-15 -10-5 0 5 10 15 1 DBScan: MinPts = 20, Eps = 0.187 10 DBScan: MinPts = 15, Eps = 0.234 1 DBScan: MinPts = 20, Eps = 0.176 0.8 5 0.8 0.6 0 0.6 0.4 0.4 0.2-5 0.2 0 0 0.2 0.4 0.6 0.8 1-10 -10-5 0 5 10 0 0 0.2 0.4 0.6 0.8 1 28.3.2014 28

Density-Based Algorithms DJCluster Variant of DBScan where cluster expansion performed iteratively instead of recursively Better suited for large datasets Basic idea: Find Epsilon neighborhood of a point Assign all points within the neighborhood into cluster Check if cluster shares a core point with any of the previous clusters If so, clusters can be merged 28.3.2014 29

Notion of Place Location systems tend to provide information in coordinate form (absolute or relative) People refer to locations using semantic (or symbolic) descriptions Descriptions for the same place can vary between different people Place Representation of location that is consistent with the way people communicate location information 28.3.2014 30

Notion of Place Monastery Petra, Jordan Church Royal Tombs Hotel Treasury Ticket Office 28.3.2014 31

Notion of Place Definitions for place originate from the field of humanistic geography Roots in phenomenology and philosophy Especially philosophy of Martin Heidegger Places entities that relate physical locations with human experiences and meanings Relph: places physical locations that are linked with meanings and activities Tuan: places are spaces (i.e., physical locations) that are embodied with meanings 28.3.2014 32

Notion of Place The meanings attributed to places vary: Activities: swimming hall, movie theater, gym Social: friend s home, regular place to meet friends Generic: library, grocery store, train station Multiple meanings can be attributed to a place Relate to different activities (and times) at the place Places can be perceived as public or private Note: space can be public even if place is private! Depends on the activity, time of day etc. Influences preferences regarding location disclosure 28.3.2014 33

Why place matters? Personalized information delivery E.g., associate notes/to-do lists with places Select advertisements or other information to provide E.g., provide train or bus schedules Depends on stability of information and familiarity of place Awareness cue Places often a cue of activity and availability Automated status messages, e.g., in phone contact list Support user studies Differentiating meaningful situations in analysis phase 28.3.2014 34

Detecting places Locations correlate strongly with activities What are you doing? often answered with location during mobile phone calls People assign activity-related labels to places Places correlate with time Humans spend the majority of time in a few places Probability of labeling a place increases with time But traffic stops (traffic jams, traffic lights) seldom labeled èplaces can be detected from location traces Activity information can help (if available) 28.3.2014 35

Place Identification Place Identification = the process of detecting places from data A data analysis step with four steps Preparation: clean data, transform data Preprocessing: making data ready for analysis Analysis: performing the actual analysis Post-processing: refining the results Additionally a labeling step Assign semantics with the detected places Can take place before or after analysis 28.3.2014 36

Labeling Common choice is to prompt the user to label a place after it has been detected Alternative to label first and learn the places automatically based on the labels Some labels can be assigned automatically Geographic databases can be used to mine information about the type of building Time information can be used to identify home and workplace Different modalities: text, photo, photo + text 28.3.2014 37

Detecting Places Overview Most place detection algorithms operate on coordinate data Pruning: remove measurements that are unlikely to be meaningful Clustering: apply spatial clustering on the data Post-processing: determine which clusters are likely to correspond to meaningful places Spatial criteria: matching against Geo-databases, considering size of clusters etc. Temporal criteria: requiring a minimum stay duration 28.3.2014 38

Detecting Places Velocity Pruning Measurements where the user is moving are unlikely to correspond to significant places Velocity can be used to prune measurements and clustering applied on remaining data 28.3.2014 39

Place Detection Further Topics Coordinate algorithms unable to separate between different places within the same indoor space Radio fingerprinting based place detection uses stability of signal environment to detect places Current state-of-the-art in mobile phone based place detection Performance decreases in areas with limited signal environment Hybrid algorithms Combine coordinate-based techniques with radio fingerprinting based place detection 28.3.2014 40

Fingerprint-based Place Detection Basic idea is to compare similarity of fingerprint information over time If radio environment sufficiently similar, over a time window t, the user is assumed to be a in a place Many possible ways to measure similarity of RF environments Rank Correlation (NearMe) Extended Tanimoto (SensLoc) Normalized Euclidean distance 28.3.2014 41

Fingerprint-based Place Detection - Example Mac address: 1 2 A. -82-74 B. -84-79 C. -40-40 Consider the data on the left: ExtTanimoto(A,B) = (-82 * -84 + -74 * -79) / (82^2 + 74^2 + 84^2 + 79^2 - (-82 * -84 + -74 * -79)) = 0.9977 ExtTanimoto(A,C) = 0.68 A and B from same location with high probability, C likely from a different location If we get successive similar measurements for, e.g., 5 minutes or 10 minutes, we are assumed to be in a place 28.3.2014 42

Case Study: Zero Interaction Authentication (ZIA) B A Fingerprint similarity generic tool that has many other applications, as an example we consider ZIA Assume device B unlocks automatically whenever device A is in close proximity (zero user interaction) Car locks Token -based authentication for laptops / terminals Susceptible to relay attacks where another device pretends to be A If A and B compare their WiFi environments, the similarity of these environments can be used to resist against relay attacks 28.3.2014 43

Summary Spatial analysis refers to the process of inspecting geographical data Preprocessing: cleaning and preparing data for analysis Analysis: exploratory or confirmatory Post-processing: validating, pruning results Spatial clustering Grouping of similar (spatial) objects together Partitioning algorithms: divide data optimally to clusters Density-based algorithms: identify dense spatial regions 28.3.2014 44

Summary Place Representation of location that is consistent with the way people communicate location information Semantic / symbolic Place detection Process of identifying places from location measurements On coordinate data, can be solved using spatial clustering and temporal + spatial pruning 28.3.2014 45

Literature Ester, M.; Kriegel, H.-P.; Sander, J. & Xu, X., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), AAAI, 1996, 226-231 Sander, J.; Ester, M.; Kriegel, H.-P. & Xu, X., Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications, Data Mining and Knowledge Discovery, 1998, 2, 169-194 Zhou, C.; Frankowski, D.; Ludford, P.; Shekhar, S. & Terveen, L., Discovering Personally Meaningful Places: An Interactive Clustering Approach, ACM Transactions on Information Systems, 2007, 25, 12 Ashbrook, D. & Starner, T., Learning significant locations and predicting user movement with GPS, Proceedings of the 6th International Symposium on Wearable Computers (ISWC), IEEE, 2002, 101-108 Kang, J.; Welbourne, W.; Stewart, B. & Borriello, G., Extracting places from traces of locations, Proceedings of the 2nd ACM international workshop on Wireless mobile applications and services on WLAN hotspots (WMASH), ACM Press, 2004, 110-118 28.3.2014 46

Literature Liao, L.; Fox, D. & Kautz, H., Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields, International Journal of Robotics Research, 2007, 26, 119-134 Marmasse, N. & Schmandt, C., A user-centered location model, Personal and Ubiquitous Computing, 2002, 6, 318-321 Nurmi, P. & Bhattacharya, S., Identifying Meaningful Places: The Nonparametric Way, Proceedings of the 6th International Conference on Pervasive Computing (Pervasive), Springer, 2008, 5013, 111-127 Tuan, Y.-F., Space and Place: The Perspective of Experience, University of Minnesota Press, 2001 Relph, E., Place and Placelessness, Pion Books, 1976 Han, J.; Kambar, M. & Tung, A. K. H., Spatial Clustering Methods in Data Mining: A Survey, Geographic Data Mining and Knowledge Discovery, Taylor & Francis, 2001 28.3.2014 47

Literature Kim, D. H.; Kim, Y.; Estrin, D. & Srivastava, M. B. SensLoc: sensing everyday places and paths using less energy, Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys), ACM, 2010, 43-56 Hightower, J.; Consolvo, S.; LaMarca, A.; Smith, I. & Hughes, J. Learning and Recognizing the Places We Go, Proceedings of the 7th International Conference on Ubiquitous Computing (UBICOMP), Springer-Verlag, 2005, 3660, 159-176 Truong, H. T. T.; Gao, X.; Shrestha, B.; Saxena, N.; Asokan, N. & Nurmi, P. Comparing and Fusing Different Sensor Modalities for Relay Attack Resistance in Zero-Interaction Authentication, Proceedings of the 12th International Conference on Pervasive Computing and Communications (PerCom), 2014 28.3.2014 48