Lecture14-Note
Data Lifecycle
Define Questions
- Utility of a question
- Understand your resources
- Be realistic
- Prioritize
Collecting/find data
- Frequency
- Granularity
- Cost
- money, time, storage, processing, effort, etc.
- Utility
Store data
- Compression vs plain
- Publish/Open access
- Metadata
- Security
Extract data
- Data queries to extract useful subsets or slices of the data
Pre-process data
-
Reformatting: Changing the format or encoding of the data
- E.g:Changing an image from JPG to PDF
-
Conversion: changing the unit of measurement or representation
- E.g: Average temperatures in different countries, data is in C and F
-
Cleaning: Detecting and correcting errors
- E.g: Temperature time series: 70,68,65,500,59
-
Imputation: hypothesizing missing values
- E.g: Temperature time series: 9am 50m 10am --, 11am 60, 12pm 65
-
Integration:mapping objects across datasets, merging them
- E.g: Use data from two separate social networks
-
Feature generation: formulating new features based on initial given features
- E.g: Eliminating morphological variation from a document
- Initial feature: investor,investors,investing,invested
- Generated feature:“invest”
- E.g: Eliminating morphological variation from a document
-
Feature construction:creating new feature by combining other features
- E.g.: Characterizing animal behaviors
- Initial feature: movement, sitting, laying down, eating, running,sleeping…
- Constructed features: “hunting” = running followed by eating
- E.g.: Characterizing animal behaviors
-
Feature selection: decision on what subset of the initial given features should be used
- E.g.: Characterizing customer music preferences
- Initial features:age, height, favorite, artist, car brand, address,…
- Selected feature: age, address, favorite artist
- E.g.: Characterizing customer music preferences
Analyze data
- Basic statistics
- Classification
- Clustering
- Pattern mining
- Event detection
Present results
- Data visualization
- Explanation
- Drill-down to details
Privacy and Ethics in Data Science
Privacy
Sensitive Data
-
Data about individuals and organizations that should not be freely disseminated and publicized
- Health
- Criminal
- Finance
-
Privacy concerns: Desire to limit the dissemination of sensitive data
-
Sensitive Data: Identifying values + Sensitive attribute
Sensitive Data in Data lifecycle
- Collect/find data
- Consent, State purpose/use, Decent quality, Error corrections
- Store data
- Physical safety
- Personnel training
- Access control
- Encryption
- Extract data, Pre-process data,Analyze data,Present results, Publish data
- Limit data use based on the purpose expressed in the original consent
- Secure data transmission
- Anonymization
Simple Anonymization Techniques
- Replace identifiers with random identifiers
- Abstraction: Replace values by ranges
- E.g.:3/1/16-> Spring 2016; Replace zip code by state
- Cluster data points and replace individuals by their cluster centroid
- Remove values
Addressing the problem of Simple Anonymization Techniques
- Provide guarantees that re-identification will not be possible within some bounds:
- E.g.: can only map a given individual to a set of 50 individuals
- K-anonymization: A dataset has k-anonymity if at least k individuals share the same identifying values
- I-diversity: A dataset has I-diversity if the individuals that share the same identifying values have at least I distinct values for the sensitive attribute
- t-closeness: the individuals that share the same identifying values have values for the sensitive attribute that are within a threshold t of diversity
- Differential privacy: Only method that provides mathematical guarantees of anonymity
- Differential privacy adds “noise” to the retrieval process os that such comparisons do not give us tha actual sensitive attribute information
Research Ethics
Institutional Review Board
- Reviews research to ensure ethical treatment of human subjects
- Levels of review: Full board; Expedited; Exempt; Non-human subjects research
All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Comment