Data Transformation

flowchart TD
 A[Types] --> B[Scaling]
 A --> C[Standardisation]
 A --> D[Normalisation]

  classDef level1 fill:#add8e6,stroke:#333,stroke-width:1px,font-weight:normal;
  classDef level2 fill:#fffacd,stroke:#333,stroke-width:1px,font-weight:normal;

  class A level1;
  class B,C,D level2;
  

Scaling

Scaling changes the range of your feature data.

It’s crucial when features have different units or scales because many machine learning algorithms perform better or converge faster when features are on a relatively similar scale.

Min-Max Scaling

Min-max scaling transforms data to a specified range, typically 0 to 1.

min_max_scaling <- function(x) {   (x - min(x)) / (max(x) - min(x)) }

Max-Abs Scaling

Max-abs scaling scales the data to the range [-1, 1] by dividing by the maximum absolute value. Useful for data that is already centered at zero without outliers.

max_abs_scaling <- function(x) {   x / max(abs(x)) }

Range Adjustment

Adjusts the range of data to [a, b], maintaining relative distances between values.

range_adjustment <- function(x, a, b) {   (b - a) * (x - min(x)) / (max(x) - min(x)) + a }

Standardisation

Standardisation is another scaling technique that transforms the features to have zero mean and a variance of one, making them follow a standard normal distribution.

Z-score Standardisation

Also known as standard score, it involves subtracting the mean and dividing by the standard deviation.

z_score_standardisation <- function(x) {   (x - mean(x)) / sd(x) }

Normalisation

In this context, normalisation involves adjusting the data in such a way that each data point (or sample) is transformed to have the same scale.

This is done so that no single feature or data point dominates others because of its scale, which is particularly important in various algorithms, especially those used in machine learning and statistics.

Range Normalisation ([0,1])

This is similar to min-max scaling but specifically refers to scaling individual data points.

range_normalisation <- function(x) {   x / sum(x) }

Vector Normalisation (Unit Length)

Introduction

Vector normalisation is where we adjust the length of a vector so that it becomes 1 unit long, while retaining its direction.

This is useful in data science and statistics, especially in algorithms where the scale of the data impacts the outcome (e.g., in clustering or principal component analysis).

The reason for normalising a vector to unit length is to remove the influence of the magnitude of the data, allowing different data points to be compared based solely on their direction. This can help identify patterns where the direction of the data is more informative than its size or length.

Process

Calculate the magnitude (norm) of the vector, which is the square root of the sum of the squared components of the vector. Divide each component of the vector by its magnitude. This scales the length of the vector to 1, known as a unit vector, without altering its direction.

vector_normalisation <- function(x) {   x / sqrt(sum(x^2)) }

Additional Technique

Logarithmic Normalisation

This method can help manage skewed data, transforming it into a more Gaussian-like distribution.

logarithmic_normalisation <- function(x) {   log1p(x) # log1p function used for numerical stability }

Shiny app - data_transformation.r