2 - gshdh - CHERREADS

Module 2

Data Quality Assessment

Data quality

assessment is the process of evaluating the accuracy, completeness,

reliability, and overall suitability of data for a specific purpose or use

case.

Ensuring data

quality is crucial because poor-quality data can lead to incorrect insights,

flawed decision-making, and wasted resources.

Here are the key

steps and considerations involved in data quality assessment:

Define Data Quality Objectives:

Clearly define the goals and objectives of your

data quality assessment. What are you trying to achieve with Data Profiling:

Conduct data

profiling to gain a better understanding of your data. This involves

summarizing key statistics and characteristics of the data, such as data types,

distributions, and value ranges.

Data Accuracy:

Evaluate the

accuracy of the data by comparing it to trusted sources or known standards.

Check for errors, inconsistencies, and outliers in the data.

Data

Completeness:

Assess whether the

data is complete and whether it contains all the required information for your

analysis or application. Look for missing values and determine how to handle

them.

Data

Consistency:

Ensure that the

data is consistent within itself and across different data sources or datasets.

Check for naming conventions, data formats,

and data definitions.

Data

Reliability:

Evaluate the

reliability of the data sources and the methods used to collect and process the

data. Consider the reputation and trustworthiness of the sources.

Data Relevance:

Determine if the

data is relevant to your specific use case. Irrelevant or outdated data can

lead to incorrect conclusions.

Data Timeliness:

Assess whether the

data is up-to-date and whether it meets your required timeframes. Outdated data

may not reflect current conditions.

Data Integrity:

Ensure that data

has not been tampered with or corrupted during storage or transmission.

Implement data integrity checks, such as checksums or hashing.

Data

Documentation:

Maintain

comprehensive documentation of the data, including metadata, data lineage, and

data transformation processes. This documentation helps users understand the

data's context and history.

Data Quality

Metrics:

Define specific

data quality metrics and thresholds that align with your objectives. Common

metrics include accuracy, completeness, consistency, and reliability.

Data Cleaning:

If you identify

data quality issues, implement data cleaning and data transformation processes

to correct or remove problematic data points.

Data Validation

and Verification:

Develop validation

and verification procedures to ensure that data quality is maintained over

time. Regularly monitor and validate data as it is collected or updated.

Data Quality

Tools:

Utilize data

quality tools and software to automate data quality checks and validations.

These tools can help streamline the assessment process.

Data Governance:

Establish data

governance policies and procedures to maintain data quality standards

consistent across the organization. Continuously work on improving data quality

by addressing identified issues, refining data collection processes, and

enhancing data quality controls.

Data quality

assessment is an ongoing process that should be integrated into the data

management lifecycle. It requires collaboration between data analysts, data

engineers, data scientists, and business stakeholders to ensure that data is

fit for its intended purpose. Regularly monitoring and improving data quality

is essential for data-driven decision-making and successful business outcomes.

Handling Missing Values

Program 1: It will show the features which

has null values

# importing

pandas package

import pandas as

# making data

frame from csv file

data =

pd.read_csv("C:/Users/sraba/OneDrive/Desktop/applied

data science/module2/employees.csv")

# creating subset of dataset, bool series with gender

value NaN

# filtering data

bool_series =

pd.isnull(data["Gender"])

missing_values =

pd.isnull(data["Gender"]).sum()

Print(missing_values)

# displaying data

only with Gender = NaN

data[bool_series]

# display the details of

dataset

data.info()

output:

Program: 2

#Heatmap Visualization of Null value data

import seaborn as sns

import matplotlib.pyplot as plt

# importing

pandas package

import pandas as

# making data

frame from csv file

data = pd.read_csv("C:/Users/sraba/OneDrive/Desktop/applied

data science/module2/employees.csv")

print(data.isnull())

sns.heatmap(data.isnull(),

cbar=True, cmap='viridis')

plt.show()

Description:

The code uses Seaborn

and Matplotlib to create a heatmap visualization of missing values in a

DataFrame named data. Here's a breakdown of what the code does:

import seaborn as sns and import matplotlib.pyplot as plt:

These lines import the Seaborn and Matplotlib libraries, which are used for

data visualization.

sns.heatmap(data.isnull(), cbar=False,

cmap='viridis'):

data.isnull(): This part of the code creates a DataFrame of the same shape as your

original DataFrame data, where each cell is True if the corresponding cell in

data is missing (i.e., contains a NaN or None value) and False otherwise.

sns.heatmap(): This function from Seaborn is used to create a heatmap. It takes

several parameters:

data.isnull(): The DataFrame of boolean values indicating missing values.

cbar=False: This parameter specifies whether to show the

colorbar. In this case, it's set to False to hide the colorbar.

cmap='viridis': This parameter specifies the colormap to use for coloring the heatmap.

'viridis' is a popular choice, but you can choose other colormaps as well.

plt.show(): This line displays the heatmap using Matplotlib.

Program 3: It will show the features which doesn't has any null

values.

# importing pandas package

import pandas as pd

# making data frame from csv

file

data = pd.read_csv("C:/Users/sraba/OneDrive/Desktop/applied

data science/module2/employees.csv")

# creating bool series True for

NaN values

bool_series =

pd.notnull(data["Gender"])

# display the no of not null

value of specific feature

without_missing_values=pd.notnull(data["Gender"]).sum()

print(without_missing_values)

# filtering data

# displaying data only with

Gender = Not NaN

data[bool_series]

# display the details of dataset

data.info()

Output: As

shown in the output image, only the rows having Gender = NOT NULL are

displayed.

Handling Missing Values

Once

you've identified the missing values, you can choose one or more of the

following methods to handle them:

1. Removing Rows with Missing Values

If the

missing values are relatively few and won't significantly affect your analysis:

program -4

Dropping Rows with at least

1 null value in CSV file

importing pandas module

import

pandas as pd

making data frame from csv file

data =

pd.read_csv("C:/Users/sraba/OneDrive/Desktop/applied

data science/module2/employees.csv")

making new data frame with dropped NA values

new_data

= data.dropna(axis = 0, how ='any')

new_data

print("Old

data frame length:", len(data))

print("New

data frame length:", len(new_data))

print("Number

of rows with at least 1 NA value: ", (len(data)-len(new_data)))

Description: Using Dropna

function. data: This is assumed to be a Pandas DataFrame containing your data.

.dropna(axis=0, how='any'): This is a DataFrame method

that removes rows containing missing values from the original DataFrame.

axis=0: This parameter specifies that you want to remove

rows (axis 0) containing missing values. Rows are dropped when at least one NaN

(missing value) is found in any of the row's cells.

how='any': This parameter specifies that you want to drop

rows if any missing value is found in the row. In other words, if any cell in a

row contains a NaN, that entire row will be removed from the DataFrame.

2. Filling

Missing Values

You can

fill missing values with a specific value, such as the mean, median, or a

constant

Program

-5

Filling null values

in CSV File

importing pandas package

import pandas as pd

making data frame from csv file

data = pd.read_csv("C:/Users/sraba/OneDrive/Desktop/applied

data science/module2/employees.csv")

Printing the first 10 to 24 rows of

# the

data frame for visualization

data[10:25]

# Fill

missing values with the mean/mode/median of the column

#print(data['Bonus %'].fillna(data['Bonus %'].mean(), inplace=True))

# Fill

missing values with a constant

#data.fillna(0,

inplace=True)

# will

replace Nan value in dataframe with value -99

#print(data.replace(to_replace

= np.nan, value =-99))

Filling with Categorical Missing Values

For

categorical variables, you can fill missing values with the most frequent

category or a specific label:

Program 6

importing pandas package

import

pandas as pd

making data frame from csv file

data =

pd.read_csv("C:/Users/sraba/OneDrive/Desktop/applied data

science/module2/employees.csv")

filling a null values using fillna() with a specific value

data["Gender"].fillna("No Gender", inplace = True)

missing_values

= pd.isnull(data["Gender"]).sum()

print(missing_values)

program

importing pandas package

import

pandas as pd

reading data from csv file

data =

pd.read_csv("C:/Users/sraba/OneDrive/Desktop/applied data

science/module2/employees.csv")

# Fill

missing categorical values with the most frequent category

data['Gender'].fillna(data['Gender'].mode()[0],

inplace=True)

missing_values

= pd.isnull(data["Gender"]).sum()

print(missing_values)

4. Handling Time

Series Data

When

dealing with time series data, you may need to handle missing values

differently.

· Methods like forward fill,

· backward fill, and

· interpolation are often

appropriate.

Forward fill

(ffill) and backward fill (bfill) are techniques used to fill missing values in

a DataFrame by propagating values from neighboring rows in a specific

direction. These methods are often used when dealing with time-series data or

any data where values have a natural order or sequence.

You can choose between forward fill and

backward fill based on your specific needs and the nature of your data. These

methods are useful for preserving temporal or sequential patterns in your data

when filling missing values.

Here's how

to use forward fill and backward fill in Pandas:

Forward

Fill (ffill):

Forward

fill replaces missing values with the most recent non-missing value in the

column. It propagates values forward in the column.

Program :8

import

pandas as pd

import

numpy as np

Sample DataFrame with missing values

data =

pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan, 6]})

Forward fill missing values in column 'A'

data['A'].fillna(method='ffill',

inplace=True)

Display the DataFrame after forward fill

print(data)

Backward

Fill (bfill):

Backward fill replaces missing

values with the next non-missing value in the column. It propagates values

backward in the column.

Program :9

import pandas as pd

import numpy as np

# Sample DataFrame with missing values

data = pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan,

6]})

# Backward fill missing values in column 'A'

data['A'].fillna(method='bfill', inplace=True)

# Display the DataFrame after backward fill

print(data)

Interpolation

Interpolation is a method for filling

missing values in a DataFrame by estimating them based on the values of

adjacent data points. It can be a useful technique when dealing with

time-series or continuous data where the order of data points is significant.

In Python, you can perform interpolation using the interpolate() method in Pandas.

Here's how to use interpolation to

fill missing values in a DataFrame:

Program :10

import pandas as pd

import numpy as np

# Sample DataFrame with missing values

data = pd.DataFrame({'A': [1, 2, np.nan, 4, np.nan,

6]})

# Interpolate missing values in column 'A' using

linear interpolation

data['A'].interpolate(method='linear',

inplace=True)

# Display the DataFrame after interpolation

print(data)

output:

A0 1.01 2.02 3.03 4.04 5.05 6.0

In this example, we have a DataFrame

with missing values in column 'A'. We use the interpolate() method with the method='linear' parameter to perform linear interpolation

specifically on column 'A'. The inplace=True

parameter modifies the DataFrame in place.

Example:

5.Imputation

with Machine Learning Models

For more

advanced scenarios, you can use machine learning models like k-Nearest

Neighbors (KNN) or regression to impute missing values based on other features

in your dataset.

After

running this code, the DataFrame will

contain imputed values for missing data, where each missing value has been

replaced by an estimated value based on the two nearest neighbors in the

feature space. This imputation technique can be useful when you want to

preserve the overall structure and relationships in your data while filling in

missing values.

Program :11

# importing pandas package

import pandas as pd

from sklearn.impute import

KNNImputer

# making data frame from csv

file

data =

pd.read_csv("C:/Users/sraba/OneDrive/Desktop/applied data

science/module2/salary.csv")

print(data.info)

imputer =

KNNImputer(n_neighbors=2)

data =

pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

data.info

The code

uses scikit-learn's KNNImputer to impute missing values in a Pandas DataFrame

df using a k-Nearest Neighbors (KNN) imputation approach. Here's a breakdown of

the code:

from

sklearn.impute import KNNImputer: This line imports the KNNImputer class from scikit-learn's

impute module. The KNNImputer is a machine learning-based imputation method

that replaces missing values with estimates based on the values of their

nearest neighbors.

imputer =

KNNImputer(n_neighbors=2): This line creates an instance of the KNNImputer class with n_neighbors

set to 2. This means that it will consider the values of the two nearest

neighbors to impute missing values.

df =

pd.DataFrame(imputer.fit_transform(df), columns=df.columns): This line applies the KNN imputation

to the DataFrame df and replaces the original DataFrame with the imputed one.

Here's what's happening within this line:

imputer.fit_transform(df): This fits the KNNImputer model to

the data in df and then transforms the DataFrame to impute missing values. The

fit_transform method returns a NumPy array with imputed values.

pd.DataFrame(...): It converts the NumPy array with

imputed values back into a Pandas DataFrame.

columns=df.columns: This sets the column names of the

new DataFrame to be the same as the original DataFrame df.

Using

LinearRegression

Regression-based

imputation is a technique that uses regression models to predict missing values

in a dataset based on the relationships between the missing variable and other

variables in the dataset. In Python, you can use various regression models from

libraries like Scikit-Learn or StatsModels for this purpose.

Here's an example using Scikit-Learn's

LinearRegression for regression-based imputation:

Program 12

import

pandas as pd

from

sklearn.linear_model import LinearRegression

Sample dataset with missing values

data

= {

'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9,

10],

'Target': [2, 4, 6, 8, 10, None, None,

None, None, None]

}

= pd.DataFrame(data)

print(df)

Separate rows with and without missing values

df_missing

= df[df['Target'].isna()]

df_not_missing

= df[df['Target'].notna()]

Create a linear regression model

model

= LinearRegression()

Fit the model on data with no missing values

model.fit(df_not_missing[['Feature1']],

df_not_missing['Target'])

Predict missing values

predicted_values

= model.predict(df_missing[['Feature1']])

Fill missing values with predicted values

df_missing['Target']

= predicted_values

Combine the filled and non-missing rows

= pd.concat([df_not_missing, df_missing], ignore_index=True)

print(df)

In this

example:

We have a

dataset with a missing target variable ('Target').

We split the

dataset into two parts: one with missing values (df_missing) and one without

missing values (df_not_missing).

We create a

linear regression model and fit it using the available data in df_not_missing.

We then use

the trained model to predict missing values in df_missing.

Finally, we

combine the filled rows and non-missing rows to obtain the complete dataset.

Data imputation methods :

Data

imputation methods are techniques used to fill in missing values in a dataset.

Handling missing data is a crucial step in data preprocessing because many

machine learning algorithms and statistical analyses require complete datasets.

Here are some common data imputation methods:

Mean/Median/Mode

Imputation:

Mean

Imputation: Fill missing values with the mean (average) of the non-missing

values in the column.

Median

Imputation: Fill missing values with the median value of the non-missing values

in the column. This is less sensitive to outliers than the mean.

Mode

Imputation: Fill missing values with the mode (most frequent value) of the

column for categorical data.

Forward

Fill and Backward Fill:

Forward Fill

(ffill): Replace missing values with the most recent non-missing value before

them (propagating values forward).

Backward

Fill (bfill): Replace missing values with the next non-missing value after them

(propagating values backward). These methods are useful for time-series data.

Interpolation:

Linear

Interpolation: Estimate missing values based on a linear relationship between

neighboring data points.

Polynomial

Interpolation: Estimate values using polynomial functions fitted to the

surrounding data points.

Spline

Interpolation: Use piecewise-defined polynomials or splines to fill missing

values.

Time-based

Interpolation: Interpolate values based on time intervals or timestamps.

K-Nearest

Neighbors (KNN) Imputation:

Estimate

missing values by averaging or weighting values from the k-nearest neighbors in

a multidimensional space.

Regression

Imputation:

Use

regression models to predict missing values based on relationships with other

variables in the dataset.

Multiple

Imputation:

Generate

multiple complete datasets with imputed values and then perform analyses on

each dataset. This accounts for uncertainty in imputed values and can provide

more robust results.

Deep

Learning Imputation:

Train neural

networks or deep learning models to predict missing values based on the

relationships in the data. Techniques like autoencoders can be used for this

purpose.

Domain-specific

Imputation:

Use domain

knowledge or specific rules to impute missing values. For example, imputing

missing age values based on common age-group characteristics.

Random

Sampling Imputation:

Fill missing

values by randomly selecting values from the observed data. This can introduce

some randomness but is useful in certain scenarios.

Listwise

Deletion:

Discard rows

with missing values entirely. This should be used cautiously, as it can lead to

a loss of information and potentially biased results.

The choice

of imputation method depends on the nature of your data, the underlying

assumptions, and the specific goals of your analysis. It's essential to

understand the characteristics of your data and carefully consider the

potential impact of the chosen imputation method on your results. It's also

good practice to document and report any imputation methods used in your data

analysis to maintain transparency.

Feature

Aggregation:

Feature

aggregation is a data preprocessing technique used in machine learning and data

analysis to reduce the dimensionality of a dataset by combining or summarizing

multiple related features (variables) into a single feature or a set of new

features. Feature aggregation is often employed to simplify complex datasets,

improve model performance, and reduce computational complexity.

When

performing feature aggregation, it's essential to consider the domain

knowledge, the goals of the analysis or modelling task, and the potential

impact on the interpretability of the data. Careful feature selection and

aggregation can lead to more efficient and accurate machine learning models

while avoiding information loss. However, inappropriate aggregation can

introduce biases or obscure valuable insights, so it should be done

judiciously.

Here are

some common methods and scenarios where feature aggregation is applied:

Summation

or Aggregation Functions:

Calculate

the sum, mean, median, maximum, minimum, or other statistical measures of a set

of related features.

For example,

if you have daily sales data, you can aggregate it into monthly or yearly

totals.

Time-Based

Aggregation:

Aggregate

time-series data into different time intervals (e.g., hourly, daily, weekly) to

capture patterns at different levels of granularity. This is useful for trend

analysis and forecasting.

Categorical

Feature Aggregation:

Combine

categories within a categorical feature to create broader categories or reduce

the number of distinct values. For example, group rare categories into an

"Other" category.

Feature

Engineering:

Create new

features by combining or transforming existing features to capture specific

relationships or interactions in the data. Feature engineering can include

operations like adding ratios, creating interaction terms, or using

mathematical functions.

Text Data

Aggregation:

In natural

language processing (NLP) tasks, aggregate text data by calculating various

statistics such as term frequency-inverse document frequency (TF-IDF) scores,

word embeddings, or topic modelling representations.

Feature

Selection and Dimensionality Reduction:

Use feature

aggregation as a dimensionality reduction technique to reduce the number of

features in high-dimensional datasets. Dimensionality reduction can help

prevent overfitting and reduce computational complexity.

Image and

Sensor Data Aggregation:

Aggregate

pixel values or sensor readings within regions of interest to extract

meaningful features from images or sensor data. For example, compute histograms

or texture features from image regions.

Feature

Scaling and Normalization:

Combine

feature scaling and normalization techniques to ensure that features are on a

consistent scale. Aggregating features and then scaling them can help in some

cases.

Principal

Component Analysis (PCA):

PCA can be

considered a form of feature aggregation that combines the original features

into orthogonal components while preserving as much variance as possible. It's

a powerful dimensionality reduction technique.

Feature

Extraction in Deep Learning:

In deep

learning, feature aggregation layers, such as pooling layers in convolutional

neural networks (CNNs), aggregate features within local regions of input data

to reduce spatial dimensions while retaining essential information.

Feature

encoding:

Feature encoding is a crucial step in preparing data for

machine learning models, especially when dealing with categorical variables.

Encoding categorical variables means converts them into

numerical representations that machine learning algorithms can work with.

Python provides

several methods and libraries for feature encoding.

Few common techniques are there, and they can be implemented

using libraries like Pandas and Scikit-Learn.

Here's how

you can perform feature encoding in Python:

1. Label Encoding: Label encoding assigns a unique

integer to each category in a categorical variable. In this example, 'Red' gets

encoded as 2, 'Green' as 1, and 'Blue' as 0.

import pandas as pd

from sklearn.preprocessing import LabelEncoder

data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']}

df = pd.DataFrame(data)

print(df)

label_encoder = LabelEncoder()

df['Color_encoded'] = label_encoder.fit_transform(df['Color'])

print(df['Color_encoded'])

2. One-Hot Encoding: One-hot encoding creates binary

columns for each category in the categorical variable, where each column

represents the presence (1) or absence (0) of a category. This will create

three columns: 'Color_Red', 'Color_Green', and 'Color_Blue', where each row

will have 1 in the corresponding column for the color and 0 in the others.

import pandas as pd

data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']}

df = pd.DataFrame(data)

print(df)

df_encoded = pd.get_dummies(df, columns=['Color'],

prefix=['Color'])

print(df_encoded)

3. Custom Encoding: You can also perform custom

encoding for categorical variables if needed, especially if the variable has

some ordinal relationship. For example, you might assign values based on the

order of importance.

This custom encoding maps 'Small' to 1, 'Medium' to 2, and 'Large' to 3.

import pandas as pd

data = {'Size': ['Small', 'Medium', 'Large', 'Medium',

'Small']}

df = pd.DataFrame(data)

size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}

df['Size_encoded'] = df['Size'].map(size_mapping)

print(df['Size_encoded'])

4. Binary Encoding: Binary encoding is an efficient

method for encoding categorical

features by converting the categories into binary code.

5. Count Encoding: Count encoding replaces each category

with the count of occurrences in the dataset. It's useful when the frequency of

each category is informative.

6. Target Encoding (Mean Encoding): Target encoding is used for encoding

categorical features based on the mean of the target variable within each

category.

7. Feature hashing, also known as the hashing trick, is

a technique used for dimensionality reduction and feature encoding in machine

learning

It's particularly useful when you have a large

number of categorical features, and you want to convert them into a fixed

number of numerical features, reducing the dimensionality of the data.

Feature hashing works by

applying a hash function to the original features and then using the hash

values as new features.

Remember that the choice of encoding method depends on the nature of your

data and the machine learning algorithm you intend to use.

Label encoding may introduce

unintended ordinal relationships, so it's generally safer to use one-hot

encoding for non-ordinal categorical variables. Custom encoding can be useful

when you have domain knowledge that suggests a specific encoding scheme.

Additionally, Scikit-Learn provides tools like LabelEncoder and

OneHotEncoder for these purposes, making it convenient to integrate feature

encoding into your machine learning pipelines.

Normalization techniques in python

Normalization, also known as feature

scaling, is a crucial preprocessing step in data science and machine learning.

It involves scaling numerical

features to a standard range to ensure that they have similar magnitudes.

Normalization helps prevent certain

features from dominating others and can lead to more stable and effective

machine learning models.

Choose the

appropriate normalization technique based on the nature of your data and the

requirements of your machine learning algorithm.

Some

algorithms, like k-means clustering, are sensitive to the scale of features and

may require standardization, while others, like decision trees or random

forests, are less affected by feature scaling.

Three common techniques for

normalization in Python are Min-Max scaling , Z-score scaling (Standardization)

and custom scaling.

Min-Max Scaling:

Min-Max

scaling scales features to a specific range, typically [0, 1]. It's especially

useful when you want to transform your data into a bounded range.

python

After applying Min-Max scaling, both features will be in the range [0, 1].

import

pandas as pd

from

sklearn.preprocessing import MinMaxScaler

Sample data

data

= {'Feature1': [10, 20, 30, 40],

'Feature2': [1, 2, 3, 4]}

= pd.DataFrame(data)

print(df)

Create a MinMaxScaler instance

scaler

= MinMaxScaler()

Fit the scaler to your data and transform it

df[['Feature1',

'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

print(df[['Feature1', 'Feature2']])

Z-Score Scaling (Standardization):

Z-score

scaling (Standardization) scales features to have a mean of 0 and a standard

deviation of 1. It's particularly useful when your data follows a normal

distribution.

After

applying Z-score scaling, both features will have a mean of 0 and a standard

deviation of 1.

import

pandas as pd

from

sklearn.preprocessing import StandardScaler

Sample data

data

= {'Feature1': [10, 20, 30, 40],

'Feature2': [1, 2, 3, 4]}

= pd.DataFrame(data)

print(df)

Create a StandardScaler instance

scaler

= StandardScaler()

Fit the scaler to your data and transform it

df[['Feature1',

'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

print(df[['Feature1',

'Feature2']])

3. Custom

Scaling

You can also

perform custom scaling if needed, depending on the specific requirements of

your dataset and analysis. This may involve scaling to a different range or

using domain-specific knowledge to determine the scaling factor.

Custom

scaling allows you to define the scaling logic based on your domain knowledge

or specific data characteristics.

import

pandas as pd

Sample data

data

= {'Feature1': [10, 20, 30, 40],

'Feature2': [1, 2, 3, 4]}

= pd.DataFrame(data)

print(df)

Custom scaling (for example, scaling to a range of [0, 100])

df['Feature1']

= (df['Feature1'] - df['Feature1'].min()) / (df['Feature1'].max() -

df['Feature1'].min()) * 100

print(df[['Feature1',

'Feature2']])

Standardization

is commonly used when the distribution of features is not known, while Min-Max

scaling is suitable when you want to map features to a specific range. Custom

scaling can be useful when you have domain-specific constraints or knowledge

about the data.

Data Visualization using matplotlib

In 2002,

John Hunter was introduced the

Matplotlib.

It is a

popular Python library for creating static, animated, and interactive

visualizations in various formats.

It provides a wide range of tools for creating

plots and charts, making it a valuable tool for data visualization and

scientific computing.

Matplotlib

offers many additional features, such as 3D plotting, annotations, and

interactive widgets when used in Jupyter Notebooks.

Matplotlib

works well with other Python libraries like Seaborn and Pandas for enhanced

data visualization capabilities.

Step to

use Matplotlib

Installation: Use pip to install Matplotlib

pip

install matplotlib

Import

Matplotlib: Import

the library in your Python script or Jupyter Notebook

import

matplotlib.pyplot as plt

Different

Plot Types: Matplotlib

supports various plot types, including scatter plots, bar plots, histograms,

pie charts, and more.

You can choose the appropriate type depending

on your data and visualization needs.

Code to

do simple line chart

= [1, 2, 3, 4, 5]

= [2, 4, 6, 8, 10]

plt.plot(x,

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.title('Simple

Line Plot')

plt.show()

Subplots: You can create multiple subplots

within a single figure using plt.subplot() or plt.subplots().

plt.subplot(2,

2, 4) means:

nrows = 2:

There are 2 rows in the grid.

ncols = 2:

There are 2 columns in the grid.

index = 4:

You are creating a subplot in the fourth position (bottom-right) of the 2x2

grid.

plt.subplot(2,

2, 1)

plt.plot(x,

plt.title('Subplot

1')

plt.subplot(2,

2, 2)

plt.scatter(x,

plt.title('Subplot

2')

plt.subplot(2,

2, 3)

plt.bar(x,

plt.title('Subplot

3')

plt.subplot(2,

2, 4)

plt.hist(y,

bins=5)

plt.title('Subplot

4')

plt.tight_layout() # Ensures proper spacing

plt.show()

Saving

Plots: You can save

your plots to various file formats (e.g., PNG, PDF, SVG) using plt.savefig().

plt.savefig('my_plot.png')

Functions

or Submodules:

Matplotlib

is organized into several submodules, each of which serves a specific purpose

or provides a particular set of functionality.

matplotlib.pyplot

(pyplot): This

submodule provides a high-level interface for creating and customizing plots

and charts.

It is commonly used for most basic plotting

tasks and simplifies the process of creating visualizations.

matplotlib.figure: The figure module is responsible for

creating and managing figure objects, which serve as the top-level container

for all the elements of a plot.

matplotlib.axes: The axes module contains classes for

creating and customizing the coordinate systems (i.e., the x and y axes) within

a figure. You can create subplots and customize them using this submodule.

matplotlib.axis: This submodule handles the

properties and formatting of axis objects, including tick marks, labels, and

scales.

matplotlib.lines: The lines module provides classes

and functions for working with line plots. It allows you to create and

customize lines in your plots.

matplotlib.patches: The patches module offers various

geometric shapes (e.g., rectangles, circles, polygons) that you can use to

annotate or highlight regions in your plots.

matplotlib.text: This submodule deals with adding

text annotations to your plots. It includes classes for working with text

elements and labels.

matplotlib.image:

The image module is

used for displaying and manipulating images in Matplotlib plots.

matplotlib.colors: This submodule provides classes and

functions for working with colors, including colormaps and color conversions.

matplotlib.colorbar: The colorbar module allows you to

add colorbars to your plots, which are useful for interpreting the color

mapping in various types of plots.

matplotlib.ticker: The ticker module is responsible for

controlling the formatting of tick locations and labels on the axes.

matplotlib.gridspec: This submodule provides a flexible

way to create complex subplot layouts.

matplotlib.transforms: The transforms module is used for

coordinate transformations and is often used internally by Matplotlib.

matplotlib.widgets: The widgets module allows you to

add interactive widgets to your Matplotlib figures in Jupyter notebooks and

other interactive environments.

matplotlib.backends: The backends module contains code

for handling different rendering backends, such as rendering to a window,

saving to an image file, or embedding plots in GUI applications.

import

matplotlib

matplotlib.use('TkAgg')

matplotlib.dates:

This submodule is

used for working with date and time data in Matplotlib.

mpl_toolkits:

This is not a single

module but a directory containing various toolkit submodules. One of the

commonly used toolkits is mpl_toolkits.mplot3d, which provides 3D plotting

functionality.

Descriptive Statistics

Descriptive statistics are measures that summarize important

features of data, often with a single number. Producing descriptive statistics

is a common first step to take after cleaning and preparing a data set for

analysis.

Measures of

center

Measures of center are statistics that give us a sense of the

"middle" value of a numeric variable.

Common measures of center include the

mean, median and mode.

The mean is simply an average: the sum

of the values divided by the total number of records. We can use df.mean() to

get the mean of each column in a DataFrame.

%matplotlib inline

import numpy as np

import pandas as pd

import

matplotlib.pyplot as plt

mtcars = pd.read_csv("../input/mtcars/mtcars.csv")mtcars = mtcars.rename(columns={'Unnamed: 0': 'model'})mtcars.index = mtcars.modeldel mtcars["model"] mtcars.head()

mpg

cyl

disp

drat

qsec

gear

carb

model

Mazda RX4

21.0

160.0

110

3.90

2.620

16.46

Mazda RX4

Wag

21.0

160.0

110

3.90

2.875

17.02

Datsun

710

22.8

108.0

3.85

2.320

18.61

Hornet 4

Drive

21.4

258.0

110

3.08

3.215

19.44

Hornet

Sportabout

18.7

360.0

175

3.15

3.440

17.02

MEAN & Median mtcars.mean() # Get the mean of each columnmpg 20.090625cyl 6.187500disp 230.721875hp 146.687500drat 3.596563wt 3.217250qsec 17.848750vs 0.437500am 0.406250gear 3.687500carb 2.812500dtype: float64 mtcars.mean(axis=1) # Get the mean of each row mtcars.median() # Get the median of each columnmpg 19.200cyl 6.000disp 196.300hp 123.000drat 3.695wt 3.325qsec 17.710vs 0.000am 0.000gear 4.000carb 2.000dtype: float64

Although the mean and median both give us some sense of the

center of a distribution, they aren't always the same. The median always gives

us a value that splits the data into two halves while the mean is a numeric

average so extreme values can have a significant impact on the mean. In a

symmetric distribution, the mean and median will be the same.

norm_data = pd.DataFrame(np.random.normal(size=100000)) norm_data.plot(kind="density", figsize=(10,10)); plt.vlines(norm_data.mean(), # Plot black line at mean ymin=0, ymax=0.4, linewidth=5.0); plt.vlines(norm_data.median(), # Plot red line at median ymin=0, ymax=0.4, linewidth=2.0, color="red");

In the plot above the mean and median

are both so close to zero that the red median line lies on top of the thicker

black line drawn at the mean.

In skewed distributions, the mean

tends to get pulled in the direction of the skew, while the median tends to

resist the effects of skew:

skewed_data = pd.DataFrame(np.random.exponential(size=100000)) skewed_data.plot(kind="density", figsize=(10,10), xlim=(-1,5)); plt.vlines(skewed_data.mean(), # Plot black line at mean ymin=0, ymax=0.8, linewidth=5.0); plt.vlines(skewed_data.median(), # Plot red line at median ymin=0, ymax=0.8, linewidth=2.0, color="red");

The mean is also influenced heavily by outliers, while the

median resists the influence of outliers:

norm_data = np.random.normal(size=50)outliers = np.random.normal(15, size=3)combined_data = pd.DataFrame(np.concatenate((norm_data, outliers), axis=0)) combined_data.plot(kind="density", figsize=(10,10), xlim=(-5,20)); plt.vlines(combined_data.mean(), # Plot black line at mean ymin=0, ymax=0.2, linewidth=5.0); plt.vlines(combined_data.median(), # Plot red line at median ymin=0, ymax=0.2, linewidth=2.0, color="red");

Since the median tends to resist the effects of skewness and

outliers, it is known a "robust" statistic. The median generally

gives a better sense of the typical value in a distribution with significant

skew or outliers.

MODE:

The mode of a variable is simply the value that appears most

frequently. Unlike mean and median, you can take the mode of a categorical

variable and it is possible to have multiple modes. Find the mode with

df.mode():

mtcars.mode()

mpg

cyl

disp

drat

qsec

gear

carb

10.4

8.0

275.8

110.0

3.07

3.44

17.02

0.0

3.0

15.2

NaN

175.0

3.92

NaN

18.90

NaN

19.2

NaN

180.0

NaN

21.0

NaN

21.4

NaN

22.8

NaN

30.4

NaN

The columns with multiple modes (multiple values with the

same count) return multiple values as the mode. Columns with no mode (no value

that appears more than once) return NaN.

Measures of Spread

Measures of spread (dispersion) are

statistics that describe how data varies. While measures of center give us an

idea of the typical value, measures of spread give us a sense of how much the

data tends to diverge from the typical value.

One of the simplest measures of spread

is the range. Range is the distance between the maximum and minimum

observations:

max(mtcars["mpg"]) - min(mtcars["mpg"])

23.5

As noted earlier, the median represents the 50th percentile

of a data set. A summary of several percentiles can be used to describe a

variable's spread. We can extract the minimum value (0th percentile), first

quartile (25th percentile), median, third quartile(75th percentile) and maximum

value (100th percentile) using the quantile() function:

five_num = [mtcars["mpg"].quantile(0), mtcars["mpg"].quantile(0.25), mtcars["mpg"].quantile(0.50), mtcars["mpg"].quantile(0.75), mtcars["mpg"].quantile(1)] five_num[10.4, 15.425, 19.2, 22.8, 33.9]

Since these values are so commonly used to describe data,

they are known as the "five number summary". They are the same

percentile values returned by df.describe():

mtcars["mpg"].describe() count 32.000000mean 20.090625std 6.026948min 10.40000025% 15.42500050% 19.20000075% 22.800000max 33.900000Name: mpg, dtype: float64

Interquartile (IQR) range is another common measure of

spread. IQR is the distance between the 3rd quartile and the 1st quartile:

mtcars["mpg"].quantile(0.75) - mtcars["mpg"].quantile(0.25)7.375

The boxplots we learned to create in the lesson on plotting

are just visual representations of the five number summary and IQR:

mtcars.boxplot(column="mpg", return_type='axes', figsize=(8,8)) plt.text(x=0.74, y=22.25, s="3rd Quartile")plt.text(x=0.8, y=18.75, s="Median")plt.text(x=0.75, y=15.5, s="1st Quartile")plt.text(x=0.9, y=10, s="Min")plt.text(x=0.9, y=33.5, s="Max")plt.text(x=0.7, y=19.5, s="IQR", rotation=90, size=25);

Measures of spread with Variance and

standard deviation

Variance

Variance and standard deviation are

two other common measures of spread.

The variance of a distribution is the

average of the squared deviations (differences) from the mean.

It quantifies the amount of variation

or dispersion in a set of data points. It represents the average squared

deviation of data points from the mean.

Use df.var() to check variance:

mtcars["mpg"].var()

6.026948052089105

Standard deviation

The standard deviation is the square root of the variance.

Standard deviation can be more interpretable than variance, since the standard

deviation is expressed in terms of the same units as the variable in question

while variance is expressed in terms of units squared. It represents

the average deviation or spread of data points from the mean.

The standard deviation is a

statistical measure that quantifies the amount of variation or dispersion in a

set of data points.

A low standard deviation indicates

that the data points tend to be close to the mean, while a high standard

deviation suggests that the data points are more widely dispersed from the

mean.

Use df.std() to check the standard deviation:

mtcars["mpg"].std()

6.026948052089105

Median absolute deviation

Since variance and standard deviation are both derived from

the mean, they are susceptible to the influence of data skew and outliers.

Median absolute deviation is an alternative measure of spread based on the

median, which inherits the median's robustness against the influence of skew

and outliers. It is the median of the absolute value of the deviations from the

median:

abs_median_devs =

abs(mtcars["mpg"]

- mtcars["mpg"].median())

abs_median_devs.median() * 1.4826

5.411490000000001

Skewness and Kurtosis

descriptive statistics include measures that give you a sense

of the shape of a distribution. Skewness measures the skew or asymmetry of a

distribution while kurtosis measures how much data is in the tails of a

distribution v.s. the center.

If the skewness is close to 0, it

suggests that the data is approximately symmetric (normally distributed).

If the skewness is negative, it

indicates a left-skewed (negatively skewed) distribution, meaning the tail on

the left side is longer or fatter than the right side.

If the skewness is positive, it

indicates a right-skewed (positively skewed) distribution, #meaning the tail on

the right side is longer or fatter than the left side.

Kurtosis values can provide

information about the shape of the data distribution:

If the kurtosis is close to 0, it

indicates that the distribution has a similar shape to a normal distribution

(mesokurtic).

If the kurtosis is negative (less than

0), it indicates a flatter distribution with lighter tails (platykurtic).

#If the kurtosis is positive (greater

than 0), it indicates a more peaked distribution with #heavier tails

(leptokurtic).

Pandas has built in functions for checking skewness and

kurtosis, df.skew() and df.kurt() or df.kurtosis() respectively:

mtcars["mpg"].skew() #

Check skewness

Out[18]:

0.6723771376290805

In [19]:

mtcars["mpg"].kurt() # Check kurtosis

Out[19]:

-0.0220062914240855

Chapter 5 - 2