Spatial Analysis Clustering. Petteri Nurmi

Samankaltaiset tiedostot
Spatial Analysis Clustering Petteri Nurmi

Spatial Analysis Clustering. Petteri Nurmi

Capacity Utilization

Efficiency change over time

The CCR Model and Production Correspondence

Other approaches to restrict multipliers

Gap-filling methods for CH 4 data

T Statistical Natural Language Processing Answers 6 Collocations Version 1.0

Bounds on non-surjective cellular automata

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Alternative DEA Models

E80. Data Uncertainty, Data Fitting, Error Propagation. Jan. 23, 2014 Jon Roberts. Experimental Engineering

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Information on preparing Presentation

Alternatives to the DFT

7.4 Variability management

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Positioning Algorithms. Petteri Nurmi

TM ETRS-TM35FIN-ETRS89 WTG

Metsälamminkankaan tuulivoimapuiston osayleiskaava

C++11 seminaari, kevät Johannes Koskinen

Review Petteri Nurmi

Use of spatial data in the new production environment and in a data warehouse

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

Characterization of clay using x-ray and neutron scattering at the University of Helsinki and ILL

Statistical design. Tuomas Selander

Valuation of Asian Quanto- Basket Options

( ( OX2 Perkkiö. Rakennuskanta. Varjostus. 9 x N131 x HH145

16. Allocation Models

Tynnyrivaara, OX2 Tuulivoimahanke. ( Layout 9 x N131 x HH145. Rakennukset Asuinrakennus Lomarakennus 9 x N131 x HH145 Varjostus 1 h/a 8 h/a 20 h/a

AYYE 9/ HOUSING POLICY

WindPRO version joulu 2012 Printed/Page :42 / 1. SHADOW - Main Result

TM ETRS-TM35FIN-ETRS89 WTG

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

TM ETRS-TM35FIN-ETRS89 WTG

WindPRO version joulu 2012 Printed/Page :47 / 1. SHADOW - Main Result

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

Land-Use Model for the Helsinki Metropolitan Area

,0 Yes ,0 120, ,8

Capacity utilization

The Viking Battle - Part Version: Finnish

TM ETRS-TM35FIN-ETRS89 WTG

Tracking and Filtering. Petteri Nurmi

LYTH-CONS CONSISTENCY TRANSMITTER

Tracking and Filtering. Petteri Nurmi

Results on the new polydrug use questions in the Finnish TDI data

( ,5 1 1,5 2 km

The role of 3dr sector in rural -community based- tourism - potentials, challenges

TIEKE Verkottaja Service Tools for electronic data interchange utilizers. Heikki Laaksamo

TM ETRS-TM35FIN-ETRS89 WTG

7. Product-line architectures

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

TM ETRS-TM35FIN-ETRS89 WTG

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Kvanttilaskenta - 1. tehtävät

Categorical Decision Making Units and Comparison of Efficiency between Different Systems

TM ETRS-TM35FIN-ETRS89 WTG

Positioning Algorithms. Petteri Nurmi

DIGITAL MARKETING LANDSCAPE. Maatalous-metsätieteellinen tiedekunta

Research plan for masters thesis in forest sciences. The PELLETime 2009 Symposium Mervi Juntunen

WAMS 2010,Ylivieska Monitoring service of energy efficiency in housing Jan Nyman,

HITSAUKSEN TUOTTAVUUSRATKAISUT

Rakennukset Varjostus "real case" h/a 0,5 1,5

Use of Stochastic Compromise Programming to develop forest management alternatives for ecosystem services

HARJOITUS- PAKETTI A

Tietorakenteet ja algoritmit

Skene. Games Refueled. Muokkaa perustyyl. for Health, Kuopio

Choose Finland-Helsinki Valitse Finland-Helsinki

Curriculum. Gym card

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

Miksi Suomi on Suomi (Finnish Edition)

MUSEOT KULTTUURIPALVELUINA

RINNAKKAINEN OHJELMOINTI A,

812336A C++ -kielen perusteet,

Mat Seminar on Optimization. Data Envelopment Analysis. Economies of Scope S ysteemianalyysin. Laboratorio. Teknillinen korkeakoulu

TM ETRS-TM35FIN-ETRS89 WTG

BLOCKCHAINS AND ODR: SMART CONTRACTS AS AN ALTERNATIVE TO ENFORCEMENT

Modeling Mobility. Petteri Nurmi

( N117 x HH141 ( Honkajoki N117 x 9 x HH120 tv-alueet ( ( ( ( ( ( ( ( ( ( m. Honkajoki & Kankaanpää tuulivoimahankkeet

Indoor Environment

MEETING PEOPLE COMMUNICATIVE QUESTIONS

MIKES, Julkaisu J3/2000 MASS COMPARISON M3. Comparison of 1 kg and 10 kg weights between MIKES and three FINAS accredited calibration laboratories

Toppila/Kivistö Vastaa kaikkin neljään tehtävään, jotka kukin arvostellaan asteikolla 0-6 pistettä.

Trajectory Analysis. Sourav Bhattacharya, Petteri Nurmi

Huom. tämä kulma on yhtä suuri kuin ohjauskulman muutos. lasketaan ajoneuvon keskipisteen ympyräkaaren jänteen pituus

ReFuel 70 % Emission Reduction Using Renewable High Cetane Number Paraffinic Diesel Fuel. Kalle Lehto, Aalto-yliopisto 5.5.

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

KONEISTUSKOKOONPANON TEKEMINEN NX10-YMPÄRISTÖSSÄ

Graph. COMPUTE x=rv.normal(0,0.04). COMPUTE y=rv.normal(0,0.04). execute.

Returns to Scale Chapters

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

Indoor Localization I Introduction and Positioning Algorithms Petteri Nurmi

Keskeisiä näkökulmia RCE-verkoston rakentamisessa Central viewpoints to consider when constructing RCE

Kysymys 5 Compared to the workload, the number of credits awarded was (1 credits equals 27 working hours): (4)

S Sähkön jakelu ja markkinat S Electricity Distribution and Markets

Transkriptio:

Spatial Analysis Clustering Petteri Nurmi 24.11.2016 1

Questions How GPS measurements can be preprocessed? What different classes of spatial clustering exist? What is the difference between partitioning algorithms and density-based clustering? What is a place? How places can be detected? 24.11.2016 2

Spatial Analysis Process of inspecting geographical data with the aim of extracting useful/meaningful information Spatial data analysis process Preprocessing Cleaning the data, perform transformations (if needed) Analysis Exploratory: data is searched for models that describe it well without clear hypothesis Confirmatory: hypotheses about data are tested empirically Post-processing Cleaning noise in identified patterns Determining which of the detected patterns are meaningful 24.11.2016 3

Measurements: Sampling Three main ways to collect measurements Referred to as sampling Periodic: every x seconds Helps to save battery and reduce storage requirements E.g., car and public transportation measurements typically collected every 10 minutes (or even less often). Distance-based: every x meters/miles Continuous: as fast as possible Depends on the location system Typically around 1Hz with most systems What happens between samples? 24.11.2016 4

Interpolation Method of constructing new data within a the range of a discrete set of points Assumes two points (x 0, y 0 ) and (x 1, y 1 ) (i.e., time and sensor) value are given Linear interpolation: Effectively weighted average where weight depends on distance from the values Spline interpolation Intervals between two points modelled with low-order polynomials Polynomial pieces for intervals selected so that they fit smoothly with each other Needed for ensuring consistent spatial and/or temporal sampling rate Spatial interpolation: ensure measurements exist for every x meters Temporal interpolation: ensure measurements exist for every t seconds 24.11.2016 5

Interpolation - Example Note: map matching (covered later) can be used to ensure the interpolated measurements obey physical constraints Varying sample rate, 37 to 122 seconds Interpolated to 10s interval 24.11.2016 6

Measurements: Noise Location measurements are inherently noisy Reference point geometry Atmospheric effects Multipath effects Measurement errors (clock or reference point errors) Preprocessing attempts to reduce noise before data is being analyzed further Data cleaning: ensure quality of measurements Check the validity of the data 24.11.2016 7

Preprocessing - GPS GPS requires at least 4 satellites for estimating position (4 unknowns: 3D position + time offset) GPS uncertainty affected by range error and satellite geometry Dilution of Precision gives an estimate of the influence of satellite geometry Horizontal Dilution of Precision (HDOP) most important for applications Cold/warm start can cause outliers in measurements 24.11.2016 8

Preprocessing GPS Example RAW GPS measurements 24.11.2016 9

Preprocessing GPS Example Points with satellites < 4 removed 24.11.2016 10

Preprocessing GPS Example Points with satellites < 4 and HDOP > 6.0 removed 24.11.2016 11

Preprocessing Removing Extreme Values 24.11.2016 12

Preprocessing Other Location Techniques Similar preprocessing techniques required for other location systems GPS slightly special as parameters provided by GPS can be used to estimate magnitude of errors For most other techniques these need to be automatically detected Extreme value detection and interpolation beneficial for any location system Simple way to detect extreme values is to calculate the speed between successive measurements and to remove those that require excessive speed 24.11.2016 13

Preprocessing - Example Data from indoor localization (retail) Two potential error areas can be observed 24.11.2016 14

Spatial Clustering Clustering refers to the process of grouping similar objects into classes Points within same cluster more similar to each other than to those in other clusters Spatial clustering refers to clustering that is applied on data with a geographical component Identifying similar geographical areas, e.g., in terms of crime rate or another statistic Merging of regions with similar weather patterns 24.11.2016 15

Spatial Clustering Four main categories of algorithms Partitioning methods (e.g., K-means, K-medoids) Hierarchical methods (e.g., BIRCH) Density-based methods (e.g., DBScan) Grid-based methods (e.g., CLIQUE) Optimal technique depends on various factors Application goal Trade-off between clustering quality and speed Characteristics and dimensionality of data Amount of noise in data 24.11.2016 16

Spatial Clustering - Partitioning Algorithms Partition data into k clusters so that total deviation of points from their cluster center is minimized Parameter k determines the number of clusters, given usually beforehand Various ways to measure total deviation: Squared distance (K-Means) Posterior of data (Gaussian Mixture Models) 24.11.2016 17

Partitioning Algorithms K-Means One of the best-known clustering algorithms Iterative relocation algorithm, optimizes squared loss m i corresponds to the center of a cluster, C i is the set of points allocated to cluster i Basic structure: Initialization: generate k cluster centers according to some criterion (e.g., random selection from data) During each iteration: Allocate each point to the cluster that is closest Revise cluster centers based on the points that are assigned to the cluster Repeat until no change in values 24.11.2016 18

K-Means Algorithm guaranteed to find a local optimum of the objective function (squared loss) Sensitive to the initial choice of cluster centers Clustering typically repeated multiple times with different initial values and solution with smallest total deviation used Initial values can be determined, e.g., using Random sampling Select fraction of data, perform clustering on that, use resulting clusters as initial values Data spectroscopy: analyze spectral characteristics of data values to determine a good initial guess 24.11.2016 19

K-Means - Example 24.11.2016 20

K-Means Determining k Most common method is to examine changes in objective function as a value of k Cluster with different values of k, select the one that optimizes a selection metric KL index: measures relative change between two successive k values Cost refers to objective function, in the case of k- means sum of squares Scree plot: plot error as a function of k and select the knee or dip point Point where clear change in error Not guaranteed to exist, and often chosen heuristically based on visual inspection 24.11.2016 21

Partitioning Algorithms Probabilistic Clustering Generative: data assumed to be generated according to some model Parameters of the model unknown and need to be estimated from data Returns a probability distribution over the parameter values Two possible assignments of points to cluster Hard: each point belongs exactly to one cluster Soft: allow multiple (or all) clusters to contribute to the generation of the point 24.11.2016 22

Partitioning Algorithms Mixture Models Mixture Models provide a flexible and generic approach to probabilistic clustering Data generated by k random variables, each variable X i characterized by probability density function f i (θ i ) For each point i, a hidden and unobservable variable c i determines the cluster where i belongs to The clusters are called mixture components Probability of a point is a (convex) combination of the mixture component densities defines the weight or contribution of a component 24.11.2016 23

Partitioning Algorithms Gaussian Mixture Models Mixture model where mixture components are assumed to have a Gaussian distribution Mean μ i determines the center of the cluster Covariance matrix i determines shape of the cluster Assuming Euclidean distances: Shape is circle if variance of all dimensions is equal Shape is an ellipse aligned with coordinate axes when covariance matrix is diagonal Shape is a tilted ellipse when full covariance matrix used K-means can be understood as a Gaussian mixture model where variance is equal 24.11.2016 24

Partitioning Algorithms Gaussian Mixture Models Cluster parameters can be determined using the expectation maximization (EM) algorithm Iterative algorithm for finding optimal parameter values in models with latent (i.e., unobservable) variables Consists of two steps (E and M) which are iterated until solution converges Algorithm outline: Initialization: draw initial parameter values E-step: compute expectation of log-likelihood using current estimates M-step: compute parameters that maximize the expected log-likelihood computed in the E-step 24.11.2016 25

Partitioning Algorithms Infinite Mixture Models A generalization of mixture models where number of mixture models is assumed infinite (but countable) Example: Chinese restaurant process Customers arrive to a restaurant with an infinite number of circular tables, each having infinite capacity As new customer arrives (s)he selects the table to sit Either one of the partially occupied tables Or completely new table 24.11.2016 26

Partitioning Algorithms K-Medoids Partitioning algorithm that represents a cluster using the most centrally located measurement Instead of updating all centers during an iteration, typically updates only a single medoid How to determine the new medoid? How to evaluate effectiveness of clustering? Covered in more detail during Lecture VIII 24.11.2016 27

Density-Based Algorithms Class of algorithms that represent clusters as dense regions of objects In contrast to partitioning algorithms, can derive clusters of arbitrary shape Areas with low-density of objects are considered noise Basic concepts Epsilon neighborhood: collection of points that are within distance Eps from a point Dense neighborhood: Epsilon neighborhood that contains at least MinPts points 24.11.2016 28

Density-Based Algorithms Radius-Based Clustering Predecessor to density-based clustering Cluster all points with distance Eps of each other to the same cluster MinPts or some other criterion can be used to prune the resulting clusters 24.11.2016 29

Radius-Based Clustering Example 24.11.2016 30

Density-Based Algorithms DBScan A point that has at least MinPts within its Epsilon neighborhood is called a core object Object can only belong to a cluster if it is within the Epsilon neighborhood of at least one core object Core object o within Epsilon neighborhood of another core object p must belong to the same cluster as p Non-core object belonging to the Epsilon neighborhood of some core objects must belong to the same cluster as one of these core objects Non-core objects which do not belong to the Epsilon neighborhood of any core objects are noise 24.11.2016 31

Density-Based Algorithms DBScan Non-core object Core object Outlier / noise Core object Clusters A,B and C can be merged since they share a core object 24.11.2016 32

Density-Based Algorithms DBScan Algorithm that recursively merges Epsilon neighborhoods together to identify dense regions Let c be a core object, within the Epsilon neighborhood of c considered as seed points Cluster expanded with (previously unallocated) points that are within the Epsilon neighborhood of a seed point 24.11.2016 33

DBScan example Noise Clusters 24.11.2016 34

Density-Based Algorithms DJCluster Variant of DBScan where cluster expansion performed iteratively instead of recursively Better suited for large datasets Basic idea: Find Epsilon neighborhood of a point Assign all points within the neighborhood into cluster Check if cluster shares a core point with any of the previous clusters If so, clusters can be merged 24.11.2016 35

Notion of Place Location systems tend to provide information in coordinate form (absolute or relative) People refer to locations using semantic (or symbolic) descriptions Descriptions for the same place can vary between different people Place Representation of location that is consistent with the way people communicate location information 24.11.2016 36

Notion of Place Monastery Petra, Jordan Church Royal Tombs Hotel Treasury Ticket Office 24.11.2016 37

Notion of Place Definitions for place originate from the field of humanistic geography Roots in phenomenology and philosophy Especially philosophy of Martin Heidegger Places entities that relate physical locations with human experiences and meanings Relph: places physical locations that are linked with meanings and activities Tuan: places are spaces (i.e., physical locations) that are embodied with meanings 24.11.2016 38

Notion of Place The meanings attributed to places vary: Activities: swimming hall, movie theater, gym Social: friend s home, regular place to meet friends Generic: library, grocery store, train station Multiple meanings can be attributed to a place Relate to different activities (and times) at the place Places can be perceived as public or private Note: space can be public even if place is private! Depends on the activity, time of day etc. Influences preferences regarding location disclosure 24.11.2016 39

Why place matters? Personalized information delivery E.g., associate notes/to-do lists with places Select advertisements or other information to provide E.g., provide train or bus schedules Depends on stability of information and familiarity of place Awareness cue Places often a cue of activity and availability Automated status messages, e.g., in phone contact list Support user studies Differentiating meaningful situations in analysis phase 24.11.2016 40

Detecting places Locations correlate strongly with activities What are you doing? often answered with location during mobile phone calls People assign activity-related labels to places Places correlate with time Humans spend the majority of time in a few places Probability of labeling a place increases with time But traffic stops (traffic jams, traffic lights) seldom labeled èplaces can be detected from location traces Activity information can help (if available) 24.11.2016 41

Place Identification Place Identification = the process of detecting places from data A data analysis step with four steps Preparation: clean data, transform data Preprocessing: making data ready for analysis Analysis: performing the actual analysis Post-processing: refining the results Additionally a labeling step Assign semantics with the detected places Can take place before or after analysis 24.11.2016 42

Labeling Common choice is to prompt the user to label a place after it has been detected Alternative to label first and learn the places automatically based on the labels Some labels can be assigned automatically Geographic databases can be used to mine information about the type of building Time information can be used to identify home and workplace Different modalities: text, photo, photo + text 24.11.2016 43

Detecting Places Overview Most place detection algorithms operate on coordinate data Pruning: remove measurements that are unlikely to be meaningful Clustering: apply spatial clustering on the data Post-processing: determine which clusters are likely to correspond to meaningful places Spatial criteria: matching against Geo-databases, considering size of clusters etc. Temporal criteria: requiring a minimum stay duration 24.11.2016 44

Detecting Places Velocity Pruning Measurements where the user is moving are unlikely to correspond to significant places Velocity can be used to prune measurements and clustering applied on remaining data 24.11.2016 45

Place Detection Further Topics Coordinate algorithms unable to separate between different places within the same indoor space Radio fingerprinting based place detection uses stability of signal environment to detect places Current state-of-the-art in mobile phone based place detection Performance decreases in areas with limited signal environment Hybrid algorithms Combine coordinate-based techniques with radio fingerprinting based place detection 24.11.2016 46

Fingerprint-based Place Detection Basic idea is to compare similarity of fingerprint information over time If radio environment sufficiently similar, over a time window t, the user is assumed to be a in a place Many possible ways to measure similarity of RF environments Rank Correlation (NearMe) Extended Tanimoto (SensLoc) Normalized Euclidean distance 24.11.2016 47

Fingerprint-based Place Detection - Example Mac address: 1 2 A. -82-74 B. -84-79 C. -40-40 Consider the data on the left: ExtTanimoto(A,B) = (-82 * -84 + -74 * -79) / (82^2 + 74^2 + 84^2 + 79^2 - (-82 * -84 + -74 * -79)) = 0.9977 ExtTanimoto(A,C) = 0.68 A and B from same location with high probability, C likely from a different location If we get successive similar measurements for, e.g., 5 minutes or 10 minutes, we are assumed to be in a place 24.11.2016 48

Case Study: Zero Interaction Authentication (ZIA) B A Fingerprint similarity generic tool that has many other applications, as an example we consider ZIA Assume device B unlocks automatically whenever device A is in close proximity (zero user interaction) Car locks Token -based authentication for laptops / terminals Susceptible to relay attacks where another device pretends to be A If A and B compare their WiFi environments, the similarity of these environments can be used to resist against relay attacks 24.11.2016 49

Summary Spatial analysis refers to the process of inspecting geographical data Preprocessing: cleaning and preparing data for analysis Analysis: exploratory or confirmatory Post-processing: validating, pruning results Spatial clustering Grouping of similar (spatial) objects together Partitioning algorithms: divide data optimally to clusters Density-based algorithms: identify dense spatial regions 24.11.2016 50

Summary Place Representation of location that is consistent with the way people communicate location information Semantic / symbolic Place detection Process of identifying places from location measurements On coordinate data, can be solved using spatial clustering and temporal + spatial pruning 24.11.2016 51

Literature Ester, M.; Kriegel, H.-P.; Sander, J. & Xu, X., A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), AAAI, 1996, 226-231 Sander, J.; Ester, M.; Kriegel, H.-P. & Xu, X., Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications, Data Mining and Knowledge Discovery, 1998, 2, 169-194 Zhou, C.; Frankowski, D.; Ludford, P.; Shekhar, S. & Terveen, L., Discovering Personally Meaningful Places: An Interactive Clustering Approach, ACM Transactions on Information Systems, 2007, 25, 12 Ashbrook, D. & Starner, T., Learning significant locations and predicting user movement with GPS, Proceedings of the 6th International Symposium on Wearable Computers (ISWC), IEEE, 2002, 101-108 Kang, J.; Welbourne, W.; Stewart, B. & Borriello, G., Extracting places from traces of locations, Proceedings of the 2nd ACM international workshop on Wireless mobile applications and services on WLAN hotspots (WMASH), ACM Press, 2004, 110-118 24.11.2016 52

Literature Liao, L.; Fox, D. & Kautz, H., Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields, International Journal of Robotics Research, 2007, 26, 119-134 Marmasse, N. & Schmandt, C., A user-centered location model, Personal and Ubiquitous Computing, 2002, 6, 318-321 Nurmi, P. & Bhattacharya, S., Identifying Meaningful Places: The Nonparametric Way, Proceedings of the 6th International Conference on Pervasive Computing (Pervasive), Springer, 2008, 5013, 111-127 Tuan, Y.-F., Space and Place: The Perspective of Experience, University of Minnesota Press, 2001 Relph, E., Place and Placelessness, Pion Books, 1976 Han, J.; Kambar, M. & Tung, A. K. H., Spatial Clustering Methods in Data Mining: A Survey, Geographic Data Mining and Knowledge Discovery, Taylor & Francis, 2001 24.11.2016 53

Literature Kim, D. H.; Kim, Y.; Estrin, D. & Srivastava, M. B. SensLoc: sensing everyday places and paths using less energy, Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys), ACM, 2010, 43-56 Hightower, J.; Consolvo, S.; LaMarca, A.; Smith, I. & Hughes, J. Learning and Recognizing the Places We Go, Proceedings of the 7th International Conference on Ubiquitous Computing (UBICOMP), Springer-Verlag, 2005, 3660, 159-176 Truong, H. T. T.; Gao, X.; Shrestha, B.; Saxena, N.; Asokan, N. & Nurmi, P. Comparing and Fusing Different Sensor Modalities for Relay Attack Resistance in Zero-Interaction Authentication, Proceedings of the 12th International Conference on Pervasive Computing and Communications (PerCom), 2014 24.11.2016 54