MTH 522-01B – Mathematical Statistics
Student Name – Gummedelli Saarika
Student ID – 02082642
Exploration of data
Data exploration is an important phase in the data science and analytics process. It entails inspecting and visualising a dataset to comprehend its qualities, trends, and relationships. Data exploration’s major purpose is to obtain insights into the data, find trends, anomalies, and potential problems, and prepare the data for future study.
There are two types of Statistics –
- Descriptive Statistics
- Inferential Statistics
Descriptive Statistics – The Analysis done on the sample is called Descriptive Statistics.
Inferential Statistics – The Analysis done on the sample and inferring back to population is called Inferential Statistics.
Measures of Central Tendency –
Measures of central tendency are statistical measures used to describe the centre or average of a data set. They provide insights into the “typical” or “central” value within a data distribution.
Mean – The mean, often referred to as the average, is calculated by summing up all the values in a data set and then dividing by the total number of values.
Median – The median is the middle value in a data set when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.
Mode – The mode is the value that appears most frequently in a data set.
Measures of Dispersion/ Variability –
Measures of dispersion, also known as measures of variability or spread, are statistical measures used to describe the extent to which data points in a dataset deviate from the central tendency.
Standard Deviation – The standard deviation is the square root of the variance. It measures the average deviation of data points from the mean and is in the same units as the data.
Variance – Variance measures the average squared difference between each data point and the mean. It provides a more comprehensive view of the spread of the data but is sensitive to outliers.
Shape Statistics –
These measures used in statistics to describe the shape of a probability distribution or data set. They provide information about the departure of a distribution from a distribution and can help identify patterns and characteristics of the data.
Skewness
Skewness is a measure of the asymmetry of a probability distribution or data set. It indicates the degree to which the data is skewed or tilted to one side.
Skewness can be positive or negative:
Positive Skewness – If the tail on the right-hand side (higher values) of the distribution is longer or fatter than the left-hand side (lower values), the distribution is said to be positively skewed. This means that most of the data points are concentrated on the left side, and there are some extreme values on the right side.
Negative Skewness – If the tail on the left-hand side (lower values) of the distribution is longer or fatter than the right-hand side (higher values), the distribution is said to be negatively skewed. In this case, most data points are concentrated on the right side, with some extreme values on the left side.
Kurtosis
Kurtosis is a measure of the peakedness of a probability distribution or data set. It tells us whether the data has heavy tails or is more peaked compared to a normal distribution.
Kurtosis can be positive or negative:
Positive Kurtosis – If a distribution has positive kurtosis, it means that it has heavier tails and is more peaked at the centre compared to a normal distribution. This indicates the presence of more extreme values or outliers in the data.
Negative Kurtosis – A distribution with negative kurtosis has lighter tails and is flatter at the centre compared to a normal distribution. This suggests that the data has fewer extreme values than a normal distribution.
Heteroscedasticity –
Heteroscedasticity is to describe a situation where the variability of the errors (residuals) in a regression model varies across different levels of the independent variables. In simpler terms, it means that the spread or dispersion of the data points around the regression line is not constant across the entire range of predictor values.
Linear Least Squares model –
The Linear Least Squares model, often referred to as Linear Regression, is a fundamental statistical and machine learning technique used for modelling the relationship between a dependent variable and one or more independent variables.
The primary goal of this model is to find the best-fitting linear equation that minimizes the sum of the squared differences (residuals) between the observed data points and the predictions made by the model. In other words, it finds the line (or hyperplane in higher dimensions) that best represents the data.