September 29,2023

The t-test is a parametric test of difference, meaning that it makes the same assumptions about your data as other parametric tests. The t test assumes your  data is independent, is normally distributed and have a similar amount of variance within each group being compared. A t-test helps determine whether the difference in mean between two groups is due to real effect or if it has occurred randomly.

The Monte Carlo permutation test is a technique used to determine whether the observed difference between two groupsis real or if it has occurred by chance. It is very helpful while dealing with complex data.

September 27,2023

When we carry out 5-fold cross-validation on the given data which consists data of obesity, inactivity and diabetes.There are a 354 data points of all 3 variables. With increasing model complexity, the training error often decreases and tends to underestimate the test error. Randomly divide the data into 5 roughly equal groups of 71, 71, 71, and 70, which we then use for cross-validations to determine the test error.  Then dividing the labelled data in which no repeats are present, into 5 equal groups  of 71, 71, 71,71 and 70. Then using 5 training data and test data sets to build, polynomial models on the training data and find the mean square error on the test data.

September 25,2023

K-fold Cross-Validation is a widely used approach for estimating test error. The idea is to randomly divide the data into K equal-sized parts. We leave out part k, fit the model to the other K-1 parts, and then obtain predictions for the left-out kth part. This is done in turn for each part k = 1,2…..k, and then the results are combined. We cannot apply cross-validation in step 2 directly without applying it in step 1. Because it would ignore the fact that in step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be included in the validation process.

September 22,2023

In the given CDC 2018 diabetes data, we discovered that how your weight (BMI) affects your chances of getting diabetes depends on your age. It’s kind of like BMI has a different role for younger and older people. Instead of using straight lines, we might have to use curvy lines when looking at the data. We use special math to handle these curvy lines. When the relationship between factors like your weight and diabetes isn’t a straight line, we need to use math that fits the curves to make it simpler to understand. We use step functions to catch these quick jumps, like when a signifiant increase in exercise suddenly has a big impact on risk.

September 20,2023

Variance means the measurement of how spread out or dispersed the values in a dataset are. A smaller variance indicates that the data points are closer to the mean and less dispersed, while a larger variance suggests that the data points are more spread out from the mean. A t-test is a way to figure out if two sets of data are different from each other in a meaningful way, taking into account the variability within each group and the size of the samples. P-value shows if the data is normal. If it’s below 0.05, data might not be normal. If its above 0.05 then its close to normal. A pre-molt histogram shows the distribution of data before an event. A post-molt histogram shows data after the event to check for changes. A Monte Carlo permutation test that involves repeatedly shuffling the data to simulate random chance outcomes.

September 18,2023

Linear regression model with more than one predictor variable is known as multiple linear regression. For Galton, “regression” referred only to the tendency of extreme data values to “revert” to the overall mean value. Correlation between predictor variables which is also known as factors. %inactivity and %obesity are considered as factors (predictor variable) for %diabetes. A high R-squared value determines how well the model fits in the observed data. Cross Validation is a technique to check error in data and to check overfitting.

September 15,2023

The linear model’s significance in data science projects must be emphasized. By simplifying calculations, we can understand why we measure distance parallel to the Y-axis in linear models. It is advised to use transformations like log or exponential in linear models.

September 13, 2023

Heteroscedasticity is important to ensure that the assumptions of the model are met and that the results are valid. Breusch-Pagan test is used to test heteroscedasticity analytically. The Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan. There are a couple of variants of the Breusch-Pagan test. It uses the residuals from the original linear regression of %diabetes against %inactivity. Calculate Pearson’s R-Squared  and then the calculate the p-value. A p-value is a peculiar value which is a method to assess whether the results of an experiment or study are statistically significant or whether they could have occurred by chance. We need to see the p-value is as large as n  where n = 1370 which is the number of %inactivity data points for a chi-squared distribution with degree of freedom. Typically, p-value is very small such as 0.05, it means it has strong evidence against the null hypothesis. In these types of cases, they choose to reject the null hypothesis that the original linear model is homoscedastic.

September 11,2023

The Federal Information Processing Standards (FIPS ) common to all data sets shows 354 rows of data that contains information on all 3 variables %obesity, %inactivity and %diabetes. There are relatively large number of data points for both diabetes and inactivity. For the given %diabetes data has a kurtosis of approximately 4, which is slightly higher than the value of 3 for a normal distribution, and a quantile-quantile plot reveals a significant departure from normality and for the given %inactivity data has a kurtosis of about 2, which is somewhat lower the value of 3 for normal distribution. Kurtosis is critical. The linear least squares model is a technique used to identify the line that best illustrates the relationship between the variables, by minimizing the sum of the squared differences between the anticipated and observed values of the data points. In any linear model it is very important to examine residuals. Residuals represents the error or unexplained variations in the data. When the residual is examined it is noticed that the higher than 3 value of 4.07 for the kurtosis, and the quantile plot,  its indicates a deviation from normality for the residuals. This can create issue in testing for heteroscedasticity. Heteroscedasticity is important to ensure that the assumptions of the model are met and that the results are valid. Breusch-Pagan test is used to test heteroscedasticity analytically. When we plot residuals versus the predicted values from linear model, the fanning out of residuals as the fitted value gets large is an indicator that the linear model is not reliable hence it is heteroscedastic.