Correlation

What is Correlation

  • Two variables are correlated (associated) when their values are not independent
    • Probabilistically speaking
  • Examples:
    • When people by chips they are very likely to buy beer
    • When people have yellow fingers, they are very likely to smoke

Variables that are predictive

  • Some variables are predictive because they are correlated with other target variables
    • Smoking and coughing are predictive variables for respiratory disease

Cause and Effect

  • A variable v1 is a cause for variable v2 if changing v1 changes v2
  • A variable v3 is an effect of variable v2 if changing v3 does not change v2

Correlation and Causation

  • Correlation
    • Knowledge of v1 provides information for v2
      • Eg: yellow fingers, cough, smoking, lung cancer
    • Can use any data collected and do statistical analysis
  • Causation
    • Run an Experiment (or randomized controlled trial) - The best way to show causality
      • From sample (eg 1000), randomly assign:
        • 500 (control condition)
          • Eg smokers
        • 500 (treatment/experimental condition)
          • Eg stop smoking
      • If there is a difference, then evidence for causality

Latent Variables

  • Latent variables are variables that cannot be directly observed, only inferred through a model
    • Eg DNA damage
    • Eg Carbon monoxide inhalation
  • Latent variables can be hard to identify, even harder to learn automatically from data

Graphical Model

  • Graphical Model, or graph that captures dependencies among variables, is on alternative approach

What is Graphical Model

  • Graph that captures dependencies among variables
      - Nodes are variables
     - Links indicate dependencies
     - Probabilities that represent how the dependencies work

Two Graphical Models

  • Graphic links have a direction
  • Cycles not allowed
  • Graph links do not have direction
  • Cycles are allowed

Bayesian Networks

  • A Bayesian network is a graph
    • Directed edges show how variables influence others (No cycles allowed)
    • Probability tables (one for each node) show the probability of the value of a variable given the values of its parent variables
    • A variable is only dependent on its parent variables, not on its earlier ancestors

Causal Models

  • A causal model is a Bayesian network where all the relationships among variables are causal
  • Causal models represent how independent variables have an effect on dependent variables
  • Causal reasoning uses the probabilities in the causal model to make inferences about the value of variables given the values of others
    • Eg: Given that the grass is wet, what is the probability that it rained?

Bayesian Inference

  • Bayesian inference is used to reason over a Bayesian network to determine the probabilities of some variables given some observed variables
    • Eg: Given that the grass is wet, what is the probability that it has been raining?

Markov Networks

  • A Markov network is an undirected graphical model that includes a potentialfunctionpotential function for each clique of interconnected nodes

Learning Graphical Models

  1. Parameter Learning
    • Learning the parameters (probabilities) of the model
  2. Structure Learning
    • Learning the structure of the model (like the structure of the graph)

Analyzing Network Data

Network Structure

  • Graphs of nodes connected by links
  • Nodes: entities of interest(ie, a person, a protein, a Web page…)
  • Links: relation between two nodes

Dynamic Networks – Representing Network Behavior

  • Behaviors of the entities over time
  • Network structure may also change over time
  • Weak vs strong links change
  • Changes almost always have trickle effects on the rest of the network

Sources of Networked Data

  1. Messages across people
  2. Social network sites
  3. Social media
  4. Constructing networks from other data

Types of Networks

  1. Homogeneous networks
  2. Heterogeneous networks
  3. Bipartite networks
  4. Social networks

Homogeneous vs Heterogeneous Networks

  • Homogeneous networks: all nodes have the same type
  • Heterogeneous networks: node have different types

Bipartite networks

  • Bipartite networks: nodes have two types, links are between the two types of nodes

Social Networks

  • People as nodes, links represent interactions
  • Major issue: networks in social sites are often note publicly accessible

Random Network & Scale-Free Network

Random Network

  1. Every node’s connection tend to be similar
  2. random formation

Scale-Free Network

  1. Each node has Few (and different) connections
  2. A few nodes may have many connections
  3. Example: The Internet and the Web are scale-free networks; sexual partnership between human

Cliques and Connected Components

  • A clique is a subgraph where all the nodes are connected to all the other nodes in the clique (All paths)
  • A connected component is a subgraph where for any two nodes in the subgraph there is a path that connects them (at least one)

Bridge

  • A bridge is a link between two nodes that if removed would result in the nodes being in disconnected components of the graph

Centrality

  • “Degree centrality” is assigned to the node with the most links

Analyzing Time Series Data

Why Time Series Data

  • Forecasting: prediction of upcoming events
  • Signal or anomaly detection: get rid of noise and find unexpected patterns that might signal an event
  • Used in IoT systems to track system health and trends
  • Used in research to study questions that unfold over time

Collecting Time Series Data

  • Source of Time Series Data:
    • Sensors: Environment, body, traffic…
    • Economic: Government reports,…
    • Commercial: Customers, products, usage,…
    • Social: Activities, emails, tweets,…
  • Some terms
    • Sampling rate: how often data is collected over time
    • Granularity: period to generate a single, aggregated data point
    • Adaptive sampling: Changing the sampling rate when something interesting is detected
    • Streaming data: When data is collected and stored continuously

Pre-Processing Time Series Data

  • Data cleaning: correcting errors
  • Imputation: hypothesizing missing values (eg, using average to be missing values)
  • Rescaling: converting to another sampling rate or granularity
  • Decomposition: separating major components
    • 3 components:
      • Trend: slowly changing over time
      • Seasonal: periodicity
      • Remainder: random noise, irregularities

Example:

Analyzing Time Series Data

Variable Tracking

  • Given a variable, account for its values over time
  • Get the Tracking coefficient
  • The variable may be unobservable, instead it must be derived from observed variables (aka tracking a “latent” variable)

Alert Systems

  • Given a variable, track its values over time and generate an alert when it goes over a threshold
    • The variable may not de directly observable, instead it could be derived from observed variables

Event Detection

  • Given: an event pattern specified as a (set of) variable(s) and their value ranges over timeframes
  • Find: a match of the event pattern against the time series

Detection of Trigger Events

  • Given: an event pattern composed of a set of variables and their value ranges over timeframes
  • Find: an event pattern that occurs earlier in the time series where there is correlation or causation with the given event pattern

Causality Detection

  • Given: a time series

  • Find: events that may be temporally related through a causal relation

  • Only one direction!

    Granger Causality

    • Time series X “Granger-causes” time series Y if past values of X can be used to predict future values of Y
      • above and beyond the information contained in past values of Y
        Example:

X is rain, and Y is umbrella sales. spike on rain are going have a causal effect on purchasing of umbrellas. If We did a Granger causality analysis, it would say that there’s evidence for Granger causality between rain and purchasing umbrellas.

Discovery of Unexpected Events

  • Given a times series, track its variables and report any unusual values or patterns
    • System must have some definition of “unusual”

Pattern Mining

  • Given a time series, identify patterns that are:
    • predictive of all or some of the variable values
    • predictive of correlations among variables
    • characteristic of the data (even those that are complex, non-periodic, irregular, chaotic)