Data Lifecycle

Define Questions

  • Utility of a question
  • Understand your resources
  • Be realistic
  • Prioritize

Collecting/find data

  • Frequency
  • Granularity
  • Cost
    • money, time, storage, processing, effort, etc.
  • Utility

Store data

  • Compression vs plain
  • Publish/Open access
  • Metadata
  • Security

Extract data

  • Data queries to extract useful subsets or slices of the data

Pre-process data

  • Reformatting: Changing the format or encoding of the data

    • E.g:Changing an image from JPG to PDF
  • Conversion: changing the unit of measurement or representation

    • E.g: Average temperatures in different countries, data is in C and F
  • Cleaning: Detecting and correcting errors

    • E.g: Temperature time series: 70,68,65,500,59
  • Imputation: hypothesizing missing values

    • E.g: Temperature time series: 9am 50m 10am --, 11am 60, 12pm 65
  • Integration:mapping objects across datasets, merging them

    • E.g: Use data from two separate social networks
  • Feature generation: formulating new features based on initial given features

    • E.g: Eliminating morphological variation from a document
      • Initial feature: investor,investors,investing,invested
      • Generated feature:“invest”
  • Feature construction:creating new feature by combining other features

    • E.g.: Characterizing animal behaviors
      • Initial feature: movement, sitting, laying down, eating, running,sleeping…
      • Constructed features: “hunting” = running followed by eating
  • Feature selection: decision on what subset of the initial given features should be used

    • E.g.: Characterizing customer music preferences
      • Initial features:age, height, favorite, artist, car brand, address,…
      • Selected feature: age, address, favorite artist

Analyze data

  • Basic statistics
  • Classification
  • Clustering
  • Pattern mining
  • Event detection

Present results

  • Data visualization
  • Explanation
    • Drill-down to details

Privacy and Ethics in Data Science

Privacy

Sensitive Data

  • Data about individuals and organizations that should not be freely disseminated and publicized

    • Health
    • Criminal
    • Finance
  • Privacy concerns: Desire to limit the dissemination of sensitive data

  • Sensitive Data: Identifying values + Sensitive attribute

Sensitive Data in Data lifecycle

  • Collect/find data
    • Consent, State purpose/use, Decent quality, Error corrections
  • Store data
    • Physical safety
    • Personnel training
    • Access control
    • Encryption
  • Extract data, Pre-process data,Analyze data,Present results, Publish data
    • Limit data use based on the purpose expressed in the original consent
    • Secure data transmission
    • Anonymization

Simple Anonymization Techniques

  • Replace identifiers with random identifiers
  • Abstraction: Replace values by ranges
    • E.g.:3/1/16-> Spring 2016; Replace zip code by state
  • Cluster data points and replace individuals by their cluster centroid
  • Remove values

Addressing the problem of Simple Anonymization Techniques

  • Provide guarantees that re-identification will not be possible within some bounds:
    • E.g.: can only map a given individual to a set of 50 individuals
  • K-anonymization: A dataset has k-anonymity if at least k individuals share the same identifying values
  • I-diversity: A dataset has I-diversity if the individuals that share the same identifying values have at least I distinct values for the sensitive attribute
  • t-closeness: the individuals that share the same identifying values have values for the sensitive attribute that are within a threshold t of diversity
  • Differential privacy: Only method that provides mathematical guarantees of anonymity
    • Differential privacy adds “noise” to the retrieval process os that such comparisons do not give us tha actual sensitive attribute information

Research Ethics

Institutional Review Board

  • Reviews research to ensure ethical treatment of human subjects
  • Levels of review: Full board; Expedited; Exempt; Non-human subjects research