Correlation

What is Correlation

Two variables are correlated (associated) when their values are not independent
- Probabilistically speaking
Examples:
- When people by chips they are very likely to buy beer
- When people have yellow fingers, they are very likely to smoke

Variables that are predictive

Some variables are predictive because they are correlated with other target variables
- Smoking and coughing are predictive variables for respiratory disease

Cause and Effect

A variable v1 is a cause for variable v2 if changing v1 changes v2
A variable v3 is an effect of variable v2 if changing v3 does not change v2

Correlation and Causation

Correlation
- Knowledge of v1 provides information for v2
  - Eg: yellow fingers, cough, smoking, lung cancer
- Can use any data collected and do statistical analysis
Causation
- Run an Experiment (or randomized controlled trial) - The best way to show causality
  - From sample (eg 1000), randomly assign:
    - 500 (control condition)
      - Eg smokers
    - 500 (treatment/experimental condition)
      - Eg stop smoking
  - If there is a difference, then evidence for causality

Latent Variables

Latent variables are variables that cannot be directly observed, only inferred through a model
- Eg DNA damage
- Eg Carbon monoxide inhalation
Latent variables can be hard to identify, even harder to learn automatically from data

Graphical Model

Graphical Model, or graph that captures dependencies among variables, is on alternative approach

What is Graphical Model

Graph that captures dependencies among variables
- Nodes are variables
- Links indicate dependencies
- Probabilities that represent how the dependencies work

Two Graphical Models

Graphic links have a direction
Cycles not allowed

Graph links do not have direction
Cycles are allowed

Bayesian Networks

A Bayesian network is a graph
- Directed edges show how variables influence others (No cycles allowed)
- Probability tables (one for each node) show the probability of the value of a variable given the values of its parent variables
- A variable is only dependent on its parent variables, not on its earlier ancestors

Causal Models

A causal model is a Bayesian network where all the relationships among variables are causal
Causal models represent how independent variables have an effect on dependent variables
Causal reasoning uses the probabilities in the causal model to make inferences about the value of variables given the values of others
- Eg: Given that the grass is wet, what is the probability that it rained?

Bayesian Inference

Bayesian inference is used to reason over a Bayesian network to determine the probabilities of some variables given some observed variables
- Eg: Given that the grass is wet, what is the probability that it has been raining?

Markov Networks

A Markov network is an undirected graphical model that includes a $potential function$ for each clique of interconnected nodes

Learning Graphical Models

Parameter Learning
- Learning the parameters (probabilities) of the model
Structure Learning
- Learning the structure of the model (like the structure of the graph)

Analyzing Network Data

Network Structure

Graphs of nodes connected by links
Nodes: entities of interest(ie, a person, a protein, a Web page…)
Links： relation between two nodes

Dynamic Networks – Representing Network Behavior

Behaviors of the entities over time
Network structure may also change over time
Weak vs strong links change
Changes almost always have trickle effects on the rest of the network

Sources of Networked Data

Messages across people
Social network sites
Social media
Constructing networks from other data

Types of Networks

Homogeneous networks
Heterogeneous networks
Bipartite networks
Social networks

Homogeneous vs Heterogeneous Networks

Homogeneous networks: all nodes have the same type
Heterogeneous networks: node have different types

Bipartite networks

Bipartite networks: nodes have two types, links are between the two types of nodes

People as nodes, links represent interactions
Major issue: networks in social sites are often note publicly accessible

Random Network & Scale-Free Network

Random Network

Every node’s connection tend to be similar
random formation

Scale-Free Network

Each node has Few (and different) connections
A few nodes may have many connections
Example: The Internet and the Web are scale-free networks; sexual partnership between human

Cliques and Connected Components

A clique is a subgraph where all the nodes are connected to all the other nodes in the clique (All paths)
A connected component is a subgraph where for any two nodes in the subgraph there is a path that connects them (at least one)

Bridge

A bridge is a link between two nodes that if removed would result in the nodes being in disconnected components of the graph

Centrality

“Degree centrality” is assigned to the node with the most links

Analyzing Time Series Data

Why Time Series Data

Forecasting: prediction of upcoming events
Signal or anomaly detection: get rid of noise and find unexpected patterns that might signal an event
Used in IoT systems to track system health and trends
Used in research to study questions that unfold over time

Collecting Time Series Data

Source of Time Series Data:
- Sensors: Environment, body, traffic…
- Economic: Government reports,…
- Commercial: Customers, products, usage,…
- Social: Activities, emails, tweets,…
Some terms
- Sampling rate: how often data is collected over time
- Granularity: period to generate a single, aggregated data point
- Adaptive sampling: Changing the sampling rate when something interesting is detected
- Streaming data: When data is collected and stored continuously

Pre-Processing Time Series Data

Data cleaning: correcting errors
Imputation: hypothesizing missing values (eg, using average to be missing values)
Rescaling: converting to another sampling rate or granularity
Decomposition: separating major components
- 3 components:
  - Trend: slowly changing over time
  - Seasonal: periodicity
  - Remainder: random noise, irregularities

Example:

Analyzing Time Series Data

Variable Tracking

Given a variable, account for its values over time
Get the Tracking coefficient
The variable may be unobservable, instead it must be derived from observed variables (aka tracking a “latent” variable)

Alert Systems

Given a variable, track its values over time and generate an alert when it goes over a threshold
- The variable may not de directly observable, instead it could be derived from observed variables

Event Detection

Given: an event pattern specified as a (set of) variable(s) and their value ranges over timeframes
Find: a match of the event pattern against the time series

Detection of Trigger Events

Given: an event pattern composed of a set of variables and their value ranges over timeframes
Find: an event pattern that occurs earlier in the time series where there is correlation or causation with the given event pattern

Causality Detection

Given: a time series
Find: events that may be temporally related through a causal relation
Only one direction!

Granger Causality
- Time series X “Granger-causes” time series Y if past values of X can be used to predict future values of Y
  - above and beyond the information contained in past values of Y
    Example:

X is rain, and Y is umbrella sales. spike on rain are going have a causal effect on purchasing of umbrellas. If We did a Granger causality analysis, it would say that there’s evidence for Granger causality between rain and purchasing umbrellas.

Discovery of Unexpected Events

Given a times series, track its variables and report any unusual values or patterns
- System must have some definition of “unusual”

Pattern Mining

Given a time series, identify patterns that are:
- predictive of all or some of the variable values
- predictive of correlations among variables
- characteristic of the data (even those that are complex, non-periodic, irregular, chaotic)

Correlation

What is Correlation

Variables that are predictive

Cause and Effect

Correlation and Causation

Latent Variables

Graphical Model

What is Graphical Model

Two Graphical Models

Bayesian Networks

Causal Models

Bayesian Inference

Markov Networks

Learning Graphical Models

Analyzing Network Data

Network Structure

Dynamic Networks – Representing Network Behavior

Sources of Networked Data

Types of Networks

Homogeneous vs Heterogeneous Networks

Bipartite networks

Social Networks

Random Network & Scale-Free Network

Random Network

Scale-Free Network

Cliques and Connected Components

Bridge

Centrality

Analyzing Time Series Data

Why Time Series Data

Collecting Time Series Data

Pre-Processing Time Series Data

Analyzing Time Series Data

Variable Tracking

Alert Systems

Event Detection

Detection of Trigger Events

Causality Detection

Granger Causality

Discovery of Unexpected Events

Pattern Mining