08 December, 2023
Recommendation systems, also known as recommender systems, are algorithms and techniques designed to provide personalized suggestions or recommendations to users. These systems are widely used in various online platforms, such as e-commerce websites, streaming services, social media, and more. There are several types of recommendation systems, and they can be categorized into three main approaches:
- Collaborative Filtering:
- Collaborative filtering makes recommendations based on the preferences and behavior of similar users. There are two main types:
- User-based collaborative filtering: Recommends items based on the preferences of users who are similar to the target user.
- Item-based collaborative filtering: Recommends items similar to those the target user has liked or interacted with.
- Collaborative filtering doesn’t require knowledge about the items or users and relies on user-item interactions.
- Collaborative filtering makes recommendations based on the preferences and behavior of similar users. There are two main types:
- Content-Based Filtering:
- Content-based filtering recommends items by analyzing the characteristics of the items and matching them with the user’s preferences. It focuses on the attributes or features of the items and the user’s profile.
- For example, in a movie recommendation system, if a user has liked action movies in the past, the system may recommend other action movies.
- Hybrid Methods:
- Hybrid recommendation systems combine both collaborative and content-based filtering approaches to overcome the limitations of each method. By leveraging the strengths of both, hybrid systems can provide more accurate and diverse recommendations.
- There are various ways to combine these methods, such as using collaborative filtering to generate a preliminary list of recommendations and then refining them with content-based filtering.
December 08, 2023 MTH 522 Project 1 – Punchline Report
06 December, 2023
Exponential smoothing is a time series forecasting method that is widely used for predicting future data points based on past observations. It is particularly useful when there is a trend or seasonality in the data. The method assigns exponentially decreasing weights to past observations, with more recent observations receiving higher weights.
There are variations of exponential smoothing, including:
- Simple Exponential Smoothing (SES)
- Double Exponential Smoothing (Holt’s Method)
- Triple Exponential Smoothing (Holt-Winters Method)
04 December, 2023
Text mining, also known as text analytics is a crucial component of data science that focuses on extracting meaningful information and insights from textual data. Text mining techniques enable the analysis and interpretation of unstructured text, making it possible to derive valuable knowledge from large volumes of text data.
1. Text Preprocessing:
Before mining insights, raw text data often needs preprocessing. This involves tasks such as removing irrelevant characters, converting text to lowercase, stemming or lemmatization to reduce words to their base or root form, and eliminating stop words (common words like “the,” “and,” etc.).
2. Tokenization:
Breaking down text into smaller units, such as words or phrases (tokens), is an essential step. Tokenization forms the basis for various text mining analyses.
3. Text Representation:
Converting text into a format suitable for analysis is crucial. Common techniques include:
– Bag of Words: Represents text as an unordered set of words, disregarding grammar and word order.
– Term Frequency-Inverse Document Frequency (TF-IDF): Weighs the importance of each word in a document relative to its frequency in the entire dataset.
4. Named Entity Recognition (NER):
Identifying and classifying entities (such as names of people, organizations, locations) within the text is essential for understanding the context and relationships in the data.
5. Sentiment Analysis:
Determining the sentiment expressed in text data (positive, negative, neutral) is a common application. This is especially valuable for analyzing customer reviews, social media comments, and other sources of opinion.
6. Topic Modeling:
Discovering latent topics within a collection of documents is done through techniques like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF). Topic modeling helps in understanding the main themes present in the text.
7. Text Classification:
Assigning predefined categories or labels to text documents is a common application. Machine learning algorithms, such as Naive Bayes, Support Vector Machines, or deep learning models, can be used for text classification.
8. Text Clustering:
Grouping similar documents together based on their content is another task. Clustering algorithms like k-means can be applied to achieve this.
9. Text Summarization:
Creating concise and informative summaries of large volumes of text is valuable for quickly understanding the content.
10. Information Extraction:
Extracting specific pieces of information, such as key phrases, from text can be useful for generating structured data from unstructured text.
Text mining is applied across various industries, including finance, healthcare, marketing, and social media analysis. In data science, it plays a critical role in extracting actionable insights from the vast amounts of textual information available today.
01 December, 2023
ARIMA, which stands for AutoRegressive Integrated Moving Average, is a popular time series forecasting model. It combines autoregression (AR), differencing (I), and moving averages (MA) to capture patterns and trends in time series data. Here’s a brief explanation:
- AutoRegressive (AR): This component models the relationship between the current observation and its past values. The term “autoregressive” signifies that the model uses past observations as predictors for future values.
- Integrated (I): The differencing step involves transforming a non-stationary time series into a stationary one. Stationarity simplifies modeling as it assumes that statistical properties, such as mean and variance, remain constant over time.
- Moving Average (MA): This part captures the relationship between the current observation and a residual error from a moving average model applied to past observations.
29 November, 2023
Forecasting with data-driven models involves using historical data to make predictions about future trends or values. Key concepts and approaches include:
1. Time Series Data: Forecasting often deals with time series data, where observations are collected sequentially over time.
2. Data Preprocessing: Cleaning and preparing the data by handling missing values, outliers, and transforming variables if needed.
3. Feature Selection: Identifying relevant features that contribute to the forecasting task.
4. Model Selection: Choosing an appropriate data-driven model, such as linear regression, ARIMA (Auto Regressive Integrated Moving Average), or more advanced models like machine learning algorithms (e.g., Random Forests, Gradient Boosting) for complex patterns.
5. Training and Validation: Splitting the dataset into training and validation sets to train the model and assess its performance.
6. Hyperparameter Tuning: Adjusting model parameters to optimize performance.
7. Evaluation Metrics: Using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), or others to evaluate the accuracy of the model’s predictions.
8. Ensemble Methods: Combining multiple models for improved accuracy and robustness.
9. Time Series Decomposition: Separating the time series into components like trend, seasonality, and residual for better understanding and modeling.
10. Exogenous Variables: Considering external factors that may influence the forecast and incorporating them into the model.
11. Long Short-Term Memory (LSTM) Networks: For deep learning enthusiasts, LSTMs are effective for sequential data and time series forecasting.
12. Cross-Validation: Assessing model performance across different subsets of the data to ensure generalization.
Data-driven forecasting models are widely used in finance, supply chain management, weather prediction, and many other fields. Continuous monitoring and updating of models with new data are essential for maintaining accuracy over time. Choosing the most suitable model depends on the specific characteristics of the data and the nature of the forecasting problem.
27 November, 2023
Image classification is a computer vision task where algorithms categorize images into predefined classes or labels. It involves training a model on a dataset of labeled images to learn patterns and features associated with each class. Key points:
- Training Data: A labeled dataset is used for model training, with images and corresponding class labels.
- Convolutional Neural Networks (CNNs): Commonly employed for image classification, CNNs are deep learning architectures that automatically learn hierarchical features from images.
- Layers: CNNs consist of layers like convolutional, pooling, and fully connected layers, allowing the model to understand spatial hierarchies in image data.
- Activation Functions: Non-linear functions like ReLU introduce non-linearity, enabling the network to learn complex relationships.
- Loss Functions: Cross-entropy loss is often used for classification tasks, measuring the difference between predicted and actual class probabilities.
- Transfer Learning: Leveraging pre-trained models on large datasets (e.g., ImageNet) accelerates training and improves performance on specific tasks.
Image classification finds applications in various fields, including object recognition, medical imaging, and autonomous vehicles. It’s a fundamental task in computer vision, with advancements driven by deep learning techniques.
17 November, 2023
Natural Language Processing (NLP) is a branch of AI focusing on computer-human language interaction. It involves tasks like tokenization, part-of-speech tagging, and named entity recognition to understand, interpret, and generate human language. NLP applications include chatbots, translation, sentiment analysis, and speech recognition. Techniques range from rule-based systems to advanced deep learning models like BERT. NLP plays a crucial role in diverse fields such as virtual assistants, information retrieval, and text summarization, continually advancing to enhance language comprehension and generation by machines.
15 November, 2023
Random Forest, an ensemble learning technique, builds multiple decision trees during training and combines their outputs for robust predictions. Each tree is constructed with a random subset of the training data and a random subset of features, reducing overfitting. The final result is determined by aggregating the predictions of individual trees, providing improved accuracy and generalization compared to a single tree. Random Forests are versatile, handling classification and regression tasks, and they excel in diverse domains like finance, healthcare, and image analysis. Their ability to mitigate overfitting and enhance predictive power makes them a widely used and effective machine learning approach.
13 November, 2023
Decision trees, a popular machine learning algorithm, form tree-like models for classification or regression. Nodes represent decisions based on feature values, and leaves signify final outcomes. Training involves recursive dataset splits to create rules for homogeneous subsets. Common splitting criteria are Gini impurity and mean squared error. Pruning mitigates overfitting. Decision trees are interpretable and adept at handling diverse data types. Despite simplicity, they may not capture complex relationships and are sensitive to noise.
- Advantages:
- Easy to understand and interpret, making them suitable for visual representation.
- Require minimal data preparation.
- Can handle both numerical and categorical data.
- Disadvantages:
- Prone to overfitting, especially with deep trees.
- May not capture complex relationships in the data as effectively as other algorithms.
- Can be sensitive to small variations in the training data.
November 11, 2023 MTH 522 Project 2 – Punchline Report
8 November, 2023
A decision tree is a supervised machine learning algorithm that is used for both classification and regression tasks. It’s a popular tool for decision-making and predictive modeling because it can handle both categorical and numerical data, is easy to understand, and can be visualized effectively. Decision trees are constructed through a process of recursive partitioning, where the dataset is repeatedly split into subsets based on certain criteria or features.
How a Decision Tree Is Built:
- Feature Selection: The algorithm selects the best feature from the dataset to split the data at each internal node. The goal is to choose the feature that best separates the data into homogeneous groups, minimizing impurity or error.
- Splitting Criteria: The algorithm uses a splitting criterion to determine how to divide the data based on the chosen feature. Common splitting criteria for classification tasks include Gini impurity and entropy, while for regression tasks, it’s often mean squared error (MSE).
- Recursive Process: The process of feature selection, splitting, and creation of child nodes is repeated recursively until a stopping criterion is met. This can include a maximum tree depth, a minimum number of samples required to create a node, or a maximum number of leaf nodes.
6 November, 2023
The Chi-Square test is a statistical test used to evaluate if two category variables are associated or dependent on one another. It is especially useful for determining whether or not there is a significant link between variables and whether or not the observed frequencies in a contingency table differ from what would be predicted under the assumption of independence.
The Chi-Square test has various versions, including:
The Chi-Square Test for Independence (also known as the 2 Test for Independence) is used to examine whether there is a significant relationship between two categorical variables. It is frequently used in research to determine whether one variable is dependent on another.
Chi-Square Goodness-of-Fit Test: This test is used to examine if observed data follows a given distribution, such as the normal, uniform, or any other expected distribution. It is frequently used to evaluate the fit of a model.
The Chi-Square Test for Homogeneity is used to examine if the distribution of a categorical variable is consistent across groups or populations.
3 November, 2023
Other Clustering Techniques –
Mean Shift Clustering:
Mean Shift is a mode-seeking clustering algorithm that aims to find the modes (peaks) of the data density.
It doesn’t require specifying the number of clusters in advance and can find clusters of different shapes and sizes.
Spectral Clustering:
Spectral clustering treats the data as a graph and uses the eigenvalues and eigenvectors of the similarity matrix to partition the data into clusters.
It can handle non-convex clusters and is effective for image segmentation and community detection.
Fuzzy Clustering (Fuzzy C-Means):
Fuzzy clustering allows data points to belong to multiple clusters with varying degrees of membership.
It’s suitable when data points are not clearly separable into distinct clusters.
1 November, 2023
ANOVA, or Analysis of Variance, is a statistical test used to compare the means of different groups in a sample. It’s especially handy when you have three or more groups or conditions and want to see if there are any statistically significant differences between them. ANOVA can assist you in determining whether the variation between group means is greater than the variation within groups, which is useful in a variety of research and experimental contexts.
ANOVA Variables –
When you have one categorical independent variable with more than two levels (groups) and want to compare their means, you use one-way ANOVA.
Two-Way ANOVA extends one-way ANOVA to two independent variables, allowing you to investigate their interaction.
Multifactor ANOVA is used when there are more than two independent variables or factors that may interact in complex ways.
30 October, 2023
“K-nearest neighbours imputation” is a method for filling in missing values in a dataset by predicting those missing values based on the values of their K-nearest neighbours.
Steps for K-nearest Neighbours Imputation –
1. Identify Missing Values: First, you must identify the missing values in your dataset. These are often identified by “Nan” or another placeholder.
2. Select a Value for K: You must select a value for K, which is the number of nearest neighbours that will be used to impute the missing value. The value of K is a hyperparameter that you can change based on your data and problem.
3. Calculate Distances: For each missing value, calculate the distance between that missing point and all other data points in your dataset.
3. Calculate Distances: Calculate the distance between each missing value and all other data points in your dataset. Depending on your data and situation, common distance measures include Euclidean distance, Manhattan distance, and others.
4. Locate K-Nearest Neighbours: Choose the K data points that are closest to the missing value. These are the neighbours who are closest to K.
5. Impute the Missing Value: The average (for numerical data) or mode (for categorical data) of the appropriate characteristic from the K-nearest neighbours can be used to impute the missing value. Alternatively, you can utilise weighted averages depending on distances, giving closer neighbours more weight.
6. Repeat for All Missing Values: Go through steps 3-5 again for all missing values in your dataset.
27 October, 2023
A t-test is a statistical method used to determine whether there is a significant difference in the means of two groups. The one-tailed and two-tailed t-tests are two forms of t-tests that differ in their hypotheses and the types of results they allow.
One Tail T-Test:
The one-tailed t-test, also known as a directional test, is performed when there is a specific hypothesis concerning the direction of the difference in means between the two groups being compared. In essence, it is used to assess whether the mean of one group is considerably bigger or lower than the mean of the other.
The null hypothesis in a one-tailed t-test states that no significant difference exists between the means of the two groups. The alternative hypothesis, on the other hand, indicates the expected direction of the difference. For example, if two distinct treatments are being compared for their efficacy on a certain condition, and there is a previous opinion that one treatment is superior, a one-tailed t-test can be used to test this specific hypothesis.
Two Tail T Test:
The two-tailed t-test, also known as a non-directional test, is used when there is no specific hypothesis indicating the direction of the difference between the means of the two groups. It is concerned with determining whether any meaningful difference, regardless of direction, exists.
The null hypothesis claims that no significant difference exists between the means of the two groups in a two-tailed t-test. In this scenario, the alternative hypothesis simply proposes the existence of a significant difference without stating its direction. A two-tailed t-test, for example, can detect if a significant difference between the means exists when comparing the average scores of two different groups on a test without a prior belief about which group will outperform the other.
23 October, 2023
K-Medoids is a clustering algorithm similar to K-Means, but it uses actual data points (called “medoids”) to represent clusters. It works as follows:
1. Initialization: Select K data points as initial medoids.
2. Assignment: Assign data points to the nearest medoid, forming K clusters.
3. Update Medoids: Choose the data point within each cluster that minimizes total dissimilarity as the new medoid.
4. Repeat: Iterate steps 2 and 3 until convergence.
K-Medoids is robust to outliers and is suitable for non-numeric or categorical data. It’s used in applications like image compression, document clustering, and customer segmentation to find representative points within clusters.
20 October, 2023
DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a density-based clustering algorithm commonly used in machine learning and data analysis. It is designed to identify clusters of data points based on their density distribution within the dataset. DBSCAN has several advantages, including the ability to discover clusters of varying shapes and sizes, as well as its robustness in handling noise and outliers. Here’s how DBSCAN works:
1. Density-Based Clustering: DBSCAN defines clusters based on the density of data points. A cluster is a dense region of data points separated by areas of lower data point density.
2. Core Points: In DBSCAN, a core point is a data point that has at least a specified number of other data points (a predefined minimum number of data points, known as MinPts) within a certain distance (specified by a parameter ε or epsilon) from it. In other words, a core point is surrounded by other data points, making it the center of a cluster.
3. Border Points: Border points are data points that are within ε distance of a core point but do not meet the min Points requirement themselves. They are part of a cluster but not core points.
4. Noise Points: Noise points are data points that are neither core points nor border points. They do not belong to any cluster and are typically outliers.
5. Cluster Formation: DBSCAN starts with an arbitrary data point and identifies all connected core points (and their reachable data points) to form a cluster. This process continues until no more core points can be added to the cluster.
6. Repeat for All Data Points: The algorithm repeats the cluster formation process for all data points, classifying them as core, border, or noise points.
18 October, 2023
K-Means clustering is a popular and widely used clustering algorithm in machine learning and data analysis. It is a partitional clustering method that divides a dataset into K clusters, where K is a predefined number chosen by the user. The goal of K-Means is to minimize the variance (sum of squared distances) within each cluster. Here’s how the K-Means algorithm works:
1. Initialization: Choose K initial cluster centroids (representative points) either randomly or using some other strategy.
2. Assignment: Assign each data point to the nearest cluster centroid based on a distance metric (usually Euclidean distance).
3. Update Centroids: Recalculate the cluster centroids as the mean of all data points assigned to that cluster.
4. Iteration: Repeat the assignment and centroid update steps until convergence or a predefined stopping criterion is met. Common stopping criteria include a maximum number of iterations or when the assignments no longer change significantly.
K-Means is a simple and efficient clustering algorithm, but it has some limitations:
– It requires specifying the number of clusters (K) in advance, which may not always be known.
– K-Means is sensitive to the initial choice of cluster centroids. Different initializations can lead to different results.
– It assumes that clusters are spherical, equally sized, and have similar density, which may not always hold in real-world data.
– K-Means can be sensitive to outliers, as a single outlier can significantly affect the cluster centroids.
Despite its limitations, K-Means is widely used for tasks like customer segmentation, image compression, and data reduction, where its simplicity and efficiency make it a practical choice. Various improvements and adaptations of K-Means have been developed to address its limitations, such as K-Means++, Mini-Batch K-Means, and hierarchical versions of the algorithm.
16 October, 2023
Clustering is an unsupervised machine learning technique that groups together comparable data items based on their intrinsic properties or patterns. Clustering seeks to uncover underlying structures within a dataset that lacks predetermined labels or classifications. The following is how clustering works:
1. Data Preparation: Begin by creating a dataset with a collection of data points. These information points might be anything from customer profiles to genetic sequences.
2. Similarity Measurement: Define a metric for comparing the similarity or dissimilarity of data points. Euclidean distance, cosine similarity, and other domain-specific distance functions are common metrics.
3. Clustering Algorithm: Select a clustering algorithm to divide the data into clusters. K-Means, Hierarchical Clustering, DBSCAN, and many other methods are common.
4. Clustering Process: The selected algorithm assigns data points to clusters iteratively based on their similarity and changes the cluster centroids or structures until a stopping criterion is satisfied. The number of clusters might be set (as in K-Means) or automatically determined (as in DBSCAN).
5. Interpretation: Once the clustering is complete, the findings can be interpreted by analysing the properties of the data points inside each cluster. This analysis aids in comprehending the data’s underlying patterns or structures.
Clustering has numerous applications in a variety of fields. It is used in consumer segmentation for targeted marketing, image processing to group comparable images, biological data analysis for gene expression pattern identification, and a variety of other industries where uncovering hidden patterns or correlations inside data is important.
13 October, 2023
A geo histogram typically refers to a histogram-like visualization that represents data within a geographic or spatial context. Instead of representing data along one-dimensional axes, as in a traditional histogram, a geo histogram displays data in a spatial format, often using a map as a backdrop.
how a geo histogram works and what it can be used for:
1. Geographic Data Representation: In a geo histogram, data points are associated with specific geographical locations or areas. Each location or area is represented on a map.
2. Binning or Clustering: Just like in a traditional histogram where data is divided into bins or categories along an axis, in a geo histogram, data points are often grouped into geographic regions or bins on the map. These regions can be defined based on various criteria, such as administrative boundaries, proximity to certain points of interest, or density.
3. Data Visualization: The resulting map will display regions or areas, each filled or shaded according to the density or count of data points within it. In this way, you can visualize how data is distributed spatially.
4. Color or Intensity: The level of shading or color intensity in each region typically corresponds to the frequency or density of data points within that region. Darker shades often indicate higher values or concentrations of data.
11 October, 2023
“Geo locations” and “geo positions” are related terms that pertain to geographic information and spatial data. They refer to specific points or coordinates on the Earth’s surface, often expressed in latitude and longitude. However, there are subtle differences in how these terms are used:
Geo Locations:
– Geo locations typically refer to specific places or points of interest on the Earth’s surface. They are often associated with named locations, such as cities, landmarks, addresses, or any area with a distinct identity.
– Geo locations are commonly used in navigation, mapping applications, and location-based services to help people find and identify places. For example, finding a restaurant, getting directions to a city, or locating a tourist attraction all involve geo locations.
Geo Positions:
– Geo positions are more general and refer to specific geographic coordinates on the Earth’s surface, usually expressed in terms of latitude and longitude. These are numerical representations of a point’s exact position in relation to the Earth’s grid system.
– Geo positions are often used in geospatial data analysis, geographic information systems (GIS), and mapping technologies to precisely pinpoint and analyze locations. For example, recording the coordinates of a geological survey point or mapping out the boundaries of a specific region on the Earth’s surface involves geo positions.
9 October, 2023
The importance of variables (features or attributes) in a dataset can vary significantly depending on the context of data analysis or machine learning:
1. Relevance to the Context: Variables’ significance depends on their relevance to the specific task. Not all variables matter, with some directly impacting the outcome and others being noise.
2. Feature Selection: To streamline models and enhance performance, it’s common to select essential variables using techniques like correlation analysis, feature importance scores, or domain expertise.
3. Collinearity: Highly correlated variables can make individual importance hard to discern. Eliminating one can improve model interpretability.
4. Data Exploration: Exploratory data analysis helps understand variable relationships and importance through techniques like visualization and data mining.
5. Machine Learning Models: Variable importance varies between models, with some models offering feature importance for assessment.
6. Domain Expertise: Prior knowledge aids in identifying important variables not evident in data alone.
7. Outliers and Anomalies: Addressing outliers is crucial to prevent them from unduly affecting variable importance.
8. Data Preprocessing: Techniques like scaling and encoding impact variable importance and model performance.
9. Model Interactions: Some models capture variable interactions, making the importance of a single variable dependent on others.
10. Model Objective: Variable importance differs based on the analysis goal; predictive models prioritize variables with strong predictive power, while causal analysis focuses on uncovering causal relationships.
October 8, 2023 MTH 522 Project 1 – Punchline Report
October 6, 2023
Added discussion to the project report –
Discussion –
When examining our regression models, it becomes apparent that the
highest combined contribution of ‘% INACTIVE’ and ‘% OBESE’
variables in predicting ‘% DIABETES,’ accounting for 38.5%, may not
fully capture the intricacies involved in diabetes prediction.
In light of these findings, our conclusion is that a more comprehensive
analysis is necessary, one that takes into account a broader spectrum of
factors that influence diabetes. This emphasizes the significance of
incorporating additional variables to delve deeper into the underlying
complexities, thereby providing a more comprehensive and holistic
perspective on diabetes prediction.
October 4, 2023
Framed issues and findings for the report –
The Issues –
The dataset utilized in this study is derived from the Centers for Disease
Control and Prevention (CDC), a prominent public health agency based
in the United States. Specifically, this dataset contains valuable
information related to diabetes, focusing on the data for the year 2018.
The dataset in question contains a wide range of features, including
important attributes like ‘% INACTIVE,’ ‘% OBESE,’ and the target
variable ‘% DIABETES’.
We address the questions:
Can we predict the impact of the variables ‘INACTIVE’ and ‘%
OBESE’ on the target variable ‘% DIABETES’?
Do these variables significantly contribute to predicting the
occurrence of diabetes, and if they do, how accurate is their
contribution to the predictive model?
Is there any correlation between the variables ‘% INACTIVE’ and
‘% OBESE’?
Do these two variables, ‘% INACTIVE’ and ‘% OBESE,’
individually or collectively influence the outcome?
Findings –
The quadratic regression diabetes prediction model yielded a total
contribution of 38.5% when the variables ‘% INACTIVE’ and ‘%
2
OBESE’ were included. This contribution decreased to 36.5% with the
addition of an interaction term.
Additionally, the contribution decreased to 30.1% when support vector
regression was used. These results emphasize the need for a thorough
assessment of other variables in the predictive model and point out the
existence of additional relevant factors that influence diabetes.
October 2, 2023
Today’s discussion was about the topics –
Thesis Structure –
What is my question? (Research Question)
Why it is worth doing? (Rationale) -> What impact
Who has done anything like this? (Literature Review)
How will I go about it? (Method)
What did I find? (Results)
So what? (conclusions)
Capstone Project –
Just must give a 10-minute talk or present a poster.
September 29, 2023
Analysing data doesn’t have to be complicated, and there are simple steps you can follow to gain insights –
Understand Your Data: Start by looking at your data. Understand what each column represents. What is the data about? What are the variables?
Clean Your Data: Check for missing values and outliers. Remove or fill in missing data if necessary.
Descriptive Statistics:
Use basic descriptive statistics to summarize the main characteristics of your data:
- Mean: The average value.
- Median: The middle value.
- Mode: The most frequent value.
- Range: The difference between the maximum and minimum values.
Visualization:
– Create simple visualizations to understand the distribution of your data.
- Histograms: Show the distribution of a single variable.
- Scatter plots: Show the relationship between two variables.
- Bar charts: Compare categorical variables.
Correlation Analysis: If you have multiple variables, check for correlations. This helps you understand how changes in one variable relate to changes in another.
Grouping and Aggregation: Group your data by a categorical variable and calculate summary statistics for each group. This can reveal patterns and trends.
Simple Trend Analysis: If your data spans different time periods, look for trends over time. This can be done through line charts.
Ask Questions: Formulate specific questions about your data. For example, “Are there differences between groups?” or “Does variable A seem to influence variable B?”
Use Simple Tools: If you’re not familiar with programming, use spreadsheet software like Microsoft Excel or Google Sheets. These tools often have built-in functions for basic data analysis.
September 27, 2023
Cross-validation is a machine learning approach used to evaluate a model’s performance and make sure it generalises effectively to fresh, untried data. K-fold cross-validation is one popular type of cross-validation. The original dataset is randomly divided into k folds of equal size, and in k-fold cross-validation, the model is trained and tested k times, with each test set being a new fold and the remaining folds serving as the training set.
Mean Squared Error (MSE) is a common metric used to measure the average squared difference between the actual and predicted values in a regression problem. It is calculated as the average of the squared differences between the predicted and actual values.
September 25, 2023
Today’s discussion was about the topics –
K-fold cross-validation is a popular machine learning technique for evaluating the performance of a predictive model and decreasing the danger of overfitting. It is especially useful when you have a limited amount of data and want to make the most of it while still generating a credible estimate of the performance of your model. The following is how K-fold cross-validation works:
Dataset Splitting: Begin with your dataset and divide it into K subgroups of nearly similar size. These subsets are frequently referred to as “folds.”
Training and Testing: You train and test your model K times, using a different fold as the test set and the remaining K-1 folds as the training set each time. If you use 5-fold cross-validation, for example, you will repeat the process five times, each time using a different fifth of the data as the test set.
Performance Metrics: After each run, you compute a performance metric (such as accuracy, F1 score, or mean squared error) to assess how well your model did on the test set.
Average performance: measures from all K runs to receive an overall evaluation of your model’s performance. When compared to a single train-test split, this average provides a more trustworthy prediction of how well your model is expected to perform on unknown data.
September 22, 2023
Today’s discussion was about the topics –
A heatmap is a graphical representation of data that depicts values using a colour scale. It is frequently used to display data in a two-dimensional matrix or grid format, with each cell in the grid representing a different data point or value. The magnitude or intensity of the data represented by each cell determines its colour.
Heatmaps are often used to reveal patterns, trends, and relationships in data in a variety of domains, including statistics, data analysis, and data visualisation. They are especially effective for locating areas of high or low concentration, correlation, or variation within a dataset.
Polynomial regression is a sort of regression analysis used in statistics and machine learning to fit a polynomial equation to data to represent the connection between a dependent variable and one or more independent variables. Whereas simple linear regression represents interactions as straight lines, polynomial regression allows for the capturing of more complex, nonlinear correlations.
Support Vector Regression is a machine learning technique for regression tasks in which the goal is to predict a continuous numeric output based on one or more input features.
SVR’s central concept is to locate a hyperplane in a high-dimensional feature space that best captures the relationship between the input variables and the target variable while minimising prediction error. SVR, in contrast to classic linear regression, can handle both linear and nonlinear correlations between variables by employing a technique known as the “kernel trick.”
September 20, 2023
A t-test is a statistical hypothesis test that is used to assess whether there is a significant difference in the means of two groups or populations. It is widely used when comparing the means of two samples to assess whether the observed differences are statistically significant or if they may have occurred by chance.
The independent samples t-test and the paired samples t-test are the two most used forms of t-tests:
Independent Samples T-Test (Two-Sample T-Test) – This type of t-test is used to compare the means of two separate groups or populations to see if they differ substantially. You could, for example, use an independent samples t-test to compare the average test scores of two groups of students who were taught using different approaches.
The null hypothesis (Ho) typically assumes that there is no significant difference between the means of the two groups, while the alternative hypothesis (Ha) suggests that there is a significant difference.
Paired Samples T-Test (Dependent T-Test) – When comparing the means of two related groups or when you have matched pairs of observations, this sort of t-test is utilised. For example, you could use a paired samples t-test to compare people’s blood pressure before and after a given treatment.
In this case, the null hypothesis (Ho) assumes that there is no significant difference in the means of the paired observations, while the alternative hypothesis (Ha) suggests that there is a significant difference.
September 18, 2023
Today’s discussion was about the topics –
Multiple linear regression is a statistical method used for modelling the relationship between a dependent variable and two or more independent variables by fitting a linear equation to observed data. It is an extension of simple linear regression, which deals with the relationship between two variables.
R-squared (R²) and adjusted R-squared (R²_adj) are statistical measures used to assess the goodness of fit of a regression model, particularly in the context of multiple linear regression. They provide insights into how well the independent variables in a regression model explain the variability in the dependent variable. These values are used to evaluate the overall performance and appropriateness of the model.
R-squared (R²) –
- R-squared is a measure of the proportion of the total variance in the dependent variable that is explained by the independent variables in the model.
- It ranges between 0 and 1, where 0 indicates that the independent variables do not explain any of the variance, and 1 indicates that they explain all of it.
- In the context of linear regression, R² is calculated as the ratio of the explained variance to the total variance:
R-squared is often interpreted as the goodness of fit of the model. A higher R² indicates a better fit, meaning that a larger proportion of the variance in the dependent variable is explained by the independent variables. However, high R² values don’t necessarily imply causation or the absence of omitted variables.
Adjusted R-squared (R²_adj) –
- Adjusted R-squared is a modification of R-squared that considers the number of independent variables in the model.
- It penalizes the addition of unnecessary variables to the model, preventing overfitting.
- As more independent variables are added to the model, R-squared may artificially increase, even if the additional variables do not provide significant explanatory power.
September 15, 2023
Today’s discussion was about the topics –
Collinearity – Collinearity occurs when two or more independent variables in a regression analysis are closely linked, making determining the individual influence of each variable on the dependent variable difficult. Collinearity can result in unstable coefficient estimates, decreased statistical power, and difficulty comprehending the model.
Polynomial Regression – The link between the independent variables and the dependent variable is treated as an nth-degree polynomial in polynomial regression. In other words, instead of a straight line, it’s a linear regression with a polynomial equation. When the relationship between the variables is not linear but follows a curve, polynomial regression is useful.
Log Transformations – Log transformations are a frequent data transformation technique used in data analysis and modelling, especially when working with exponential or multiplicative data. The data is transformed with a log transformation to make it more symmetric and to lessen the influence of extreme numbers.
I have tried Auto EDA using pandas profiling for data visualization so that I might find more insights from it.
Auto EDA – Auto EDA (Automated Exploratory Data Analysis) refers to the process of automatically analysing and summarizing a dataset to gain insights and understand its characteristics without significant manual intervention.
September 13, 2023
Hypothesis Testing
Hypothesis testing is statistical method that is used in making decisions using experimental data. Hypothesis Testing is basically an assumption that we make about population parameter.
- It is a validation technique.
- It will allow us to validate whatever inferences we are making towards population are true or not.
There are two hypothesis statements –
- Null Hypothesis (Ho)
- Alternative Hypothesis (Ha)
Null Hypothesis – There is no significant difference in process A & process B.
Alternative Hypothesis – There is significant difference in process A & process B.
Example: Medicine B for treating headache that is newly developed by a pharmaceutical company has 30 minutes longer effect than existing Medicine A.
Ho – Medicine A and B has same effect.
Ha – Medicine B has 30 minutes longer effect than Medicine A.
P value – It is defined as the probability under the assumption of no effect or no difference, of obtaining a result equal to or more extreme that what observed.
R-squared value, also known as the coefficient of determination, is a statistical measure used to assess the goodness of fit of a regression model. It provides information about how well the independent variables in a regression model explain the variability in the dependent variable.
September 11, 2023
MTH 522-01B – Mathematical Statistics
Student Name – Gummedelli Saarika
Student ID – 02082642
Exploration of data
Data exploration is an important phase in the data science and analytics process. It entails inspecting and visualising a dataset to comprehend its qualities, trends, and relationships. Data exploration’s major purpose is to obtain insights into the data, find trends, anomalies, and potential problems, and prepare the data for future study.
There are two types of Statistics –
- Descriptive Statistics
- Inferential Statistics
Descriptive Statistics – The Analysis done on the sample is called Descriptive Statistics.
Inferential Statistics – The Analysis done on the sample and inferring back to population is called Inferential Statistics.
Measures of Central Tendency –
Measures of central tendency are statistical measures used to describe the centre or average of a data set. They provide insights into the “typical” or “central” value within a data distribution.
Mean – The mean, often referred to as the average, is calculated by summing up all the values in a data set and then dividing by the total number of values.
Median – The median is the middle value in a data set when the values are arranged in ascending or descending order. If there is an even number of values, the median is the average of the two middle values.
Mode – The mode is the value that appears most frequently in a data set.
Measures of Dispersion/ Variability –
Measures of dispersion, also known as measures of variability or spread, are statistical measures used to describe the extent to which data points in a dataset deviate from the central tendency.
Standard Deviation – The standard deviation is the square root of the variance. It measures the average deviation of data points from the mean and is in the same units as the data.
Variance – Variance measures the average squared difference between each data point and the mean. It provides a more comprehensive view of the spread of the data but is sensitive to outliers.
Shape Statistics –
These measures used in statistics to describe the shape of a probability distribution or data set. They provide information about the departure of a distribution from a distribution and can help identify patterns and characteristics of the data.
Skewness
Skewness is a measure of the asymmetry of a probability distribution or data set. It indicates the degree to which the data is skewed or tilted to one side.
Skewness can be positive or negative:
Positive Skewness – If the tail on the right-hand side (higher values) of the distribution is longer or fatter than the left-hand side (lower values), the distribution is said to be positively skewed. This means that most of the data points are concentrated on the left side, and there are some extreme values on the right side.
Negative Skewness – If the tail on the left-hand side (lower values) of the distribution is longer or fatter than the right-hand side (higher values), the distribution is said to be negatively skewed. In this case, most data points are concentrated on the right side, with some extreme values on the left side.
Kurtosis
Kurtosis is a measure of the peakedness of a probability distribution or data set. It tells us whether the data has heavy tails or is more peaked compared to a normal distribution.
Kurtosis can be positive or negative:
Positive Kurtosis – If a distribution has positive kurtosis, it means that it has heavier tails and is more peaked at the centre compared to a normal distribution. This indicates the presence of more extreme values or outliers in the data.
Negative Kurtosis – A distribution with negative kurtosis has lighter tails and is flatter at the centre compared to a normal distribution. This suggests that the data has fewer extreme values than a normal distribution.
Heteroscedasticity –
Heteroscedasticity is to describe a situation where the variability of the errors (residuals) in a regression model varies across different levels of the independent variables. In simpler terms, it means that the spread or dispersion of the data points around the regression line is not constant across the entire range of predictor values.
Linear Least Squares model –
The Linear Least Squares model, often referred to as Linear Regression, is a fundamental statistical and machine learning technique used for modelling the relationship between a dependent variable and one or more independent variables.
The primary goal of this model is to find the best-fitting linear equation that minimizes the sum of the squared differences (residuals) between the observed data points and the predictions made by the model. In other words, it finds the line (or hyperplane in higher dimensions) that best represents the data.
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!