December 8,2023

The ARIMA (Autoregressive Integrated Moving Average) model is a popular statistical method for time series forecasting that captures the dynamics of the series through three main parameters: AR (p), I (d), and MA (q).
AR (Autoregression): Refers to the use of previous values in the time series to predict future values.
I (Integrated): Represents the differencing of raw observations to make the time series stationary, which means that the series has constant mean and variance over time.
MA (Moving Average): Incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.
The lines in each graph represent:
Blue Line (Actual): The actual observed values from the dataset.
Orange Line (Fitted): The values predicted by the ARIMA model.

Logan Passengers: The ARIMA model appears to track the seasonal pattern and general trend of the passenger data quite closely.

Logan International Flights: The model captures the seasonality and fluctuations in the number of flights well.


Hotel Occupancy Rate: The ARIMA model follows the actual occupancy rates, including the seasonal peaks and troughs.

Hotel Average Daily Rate: The model fits the average daily rate with some accuracy, again reflecting the seasonality in the data.

Total Jobs: The ARIMA model fits the data tightly, suggesting that the model is capturing the underlying trend effectively.

Unemployment Rate: The model seems to follow the actual rate closely, including the downward trend over time.

Labor Force Participation Rate: The ARIMA model provides a reasonable fit to the data, capturing the stability of the participation rate over time.

 

 

December 6,2023

The Partial Autocorrelation Function (PACF) plots show the partial correlation of each time series with its own lagged values, controlling for the values of the time series at all shorter lags. This is helpful in identifying the order of the autoregressive (AR) part of an ARIMA model. Here’s how to interpret these plots:

Logan Passengers: The PACF plot for ‘Logan Passengers’ might show significant partial autocorrelations at one or more lags. Significant spikes (those that cross the blue confidence interval) suggest that those lags have a predictive relationship with the current value, after accounting for the relationships at all shorter lags. If such spikes occur at the first few lags and then cut off, it indicates an AR process of that order.

Logan Intl Flights: Like ‘Logan Passengers’, look for significant spikes in the early lags. The number of significant lags can indicate the order of an AR process for ‘Logan Intl Flights’. If there are no significant spikes or they are sporadic, it might suggest that an AR process is not appropriate.

Hotel Occupancy Rate: If there are significant spikes at fixed intervals, it may suggest seasonality in the data. Otherwise, the number and position of significant spikes can help determine the order of the AR process.

Labor Force Participation Rate: This PACF plot would be analyzed in the same manner, identifying the number of significant lags to determine the potential order of an AR process.

Hotel Average Daily Rate: If significant partial autocorrelations are present, they indicate the potential order of the AR process. If they decay gradually, it might suggest a mixed ARMA process.

Total Jobs: Look for the point at which the partial autocorrelations become insignificant. This will give you the suggested order of the AR process for the ‘Total Jobs’ series.

Unemployment Rate: As with the others, the presence and position of significant partial autocorrelations will inform the choice of AR order for modeling the ‘Unemployment Rate’.

December 4,2023

Autocorrelation Function (ACF) plots for various time series data. These plots show the correlation of the time series with its own past values at different lags (time intervals). The blue shaded area in the plot represents the confidence interval, typically set at 95%. Correlation values outside of this area are considered statistically significant.

Here’s a brief explanation for each graph:

Logan Passengers: The ACF plot for ‘Logan Passengers’ shows that there are a few lags where the correlation is significant. This suggests that past values of the series have some correlation with future values, hinting at a potential AR process.

Logan Intl Flights: The ‘Logan Intl Flights’ ACF plot indicates significant autocorrelation at a few initial lags. This might suggest an autoregressive component in the time series, which could be used in model identification.

Hotel Occupancy Rate: This plot shows several significant spikes, suggesting a strong seasonal pattern or an AR component that repeats at regular intervals.

Labor Force Participation Rate: The ACF for ‘Labor Force Participation Rate’ shows fewer significant correlations, indicating that the series might not be strongly dependent on its past values, or it might be a more complex model that does not fit neatly into an AR or MA process.

Hotel Average Daily Rate: The plot displays very few significant correlations at specific lags, which may suggest that the series has a less pronounced AR structure.

Total Jobs: Significant correlations between a few initial lags are visible, which could indicate an AR process at work. The data might be influenced by its values in the near past.

Unemployment Rate: The ACF plot shows almost no significant autocorrelation at any lag, suggesting that past values do not have a strong linear relationship with future values.

December 1, 2023

Time series plots of differenced economic indicators, each with a p-value indicating the significance of a statistical test, likely a unit root test such as the Augmented Dickey-Fuller (ADF) test. Here’s an analysis of what each plot represents:

Differenced logan_passengers: This plot shows the changes in the number of passengers over time after differencing the data (likely to achieve stationarity). The p-value suggests that the differenced series is stationary (p < 0.05).

Differenced logan_intl_flights: Like the passengers’ plot, this shows the changes in the number of international flights. The p-value is above the common threshold of 0.05, suggesting that the series may not be stationary.

Differenced hotel_occup_rate: This graph displays the changes in the hotel occupancy rate over time. The p- value is 0.0000, which is highly significant and indicates stationarity of the differenced series.

Differenced hotel_avg_daily_rate: Shows the changes in the average daily rate for hotels. The p-value again indicates that the differenced series is stationary.

Differenced total_jobs: This represents the changes in the total number of jobs. The p-value is not below the 0.05 threshold, suggesting non-stationarity.

Differenced unemp_rate: The changes in the unemployment rate over time are plotted here. The p-value is greater than 0.05, suggesting that the series may not be stationary.

Differenced labor_force_part_rate: Shows the changes in the labor force participation rate. The p-value is close to the threshold, which could suggest marginal stationarity depending on the specific significance level you are using.

In each plot, the time series data have been different, which is a common technique to remove trends and seasonal patterns to achieve stationarity in time series analysis. Stationarity is an important assumption in many time series models, and the ADF test is often used to test for it. The low p-value (typically <0.05) in the ADF test suggests that the null hypothesis of the presence of a unit root can be rejected, implying stationarity.

November 29, 2023

Augmented Dickey-Fuller (ADF) tests the null hypothesis that a unit root is present in a time series. Each plot is labeled with an “ADF Statistic” and a “p-value,” which are used to determine whether a time series is stationary.

Here is the analysis:

ADF Test: Total Jobs
The plot shows the time series data for “total_jobs.”
The ADF statistic is positive, and the p-value is very high (0.9475), indicating strong evidence that the series is non-stationary.

ADF Test: Unemp Rate
This is the time series data for “unemp_rate.”
The ADF statistic is negative, but the p-value is not below the common threshold of 0.05 (0.4789), suggesting the series is likely non-stationary.

ADF Test: Logan Passengers
This plot represents “logan_passengers” over time.
The ADF statistic is positive, and the p-value is extremely high (0.9853), indicating that the series is non- stationary.

ADF Test: Logan Intl Flights
The time series data for “logan_intl_flights” is shown.
The ADF statistic is negative, and the p-value is 0.2306, which is above the 0.05 threshold, suggesting non- stationarity.

ADF Test: Hotel Occup Rate
The plot displays the “hotel_occup_rate” time series.
The ADF statistic is negative, with a p-value of 0.4359, again indicating non-stationarity as the p-value is above 0.05.

ADF Test: Hotel Avg Daily Rate
This plot shows the “hotel_avg_daily_rate” time series.
The ADF statistic is negative, and the p-value is very low (0.0058), suggesting that the series is stationary.

ADF Test: Labor Force Participation Rate
This plot shows the “Labor_Force_Part_Rate” time series.
The ADF statistic value is positive, with a p-value (0.9691). With a p-value significantly greater than the common threshold of 0.05, the test suggests that the series is non-stationary.

For the ADF test, a p-value below a threshold (commonly 0.05) indicates stationarity, meaning there is no unit root present in the time series. A non-stationary time series is characterized by a changing mean or variance over time, which can be problematic for many types of time series analysis, including forecasting.

November 27, 2023

The joint kernel density estimate (KDE) plots illustrate the relationship between different economic indicators and the total number of jobs, using data from the provided dataset. Here’s an analysis of each plot:

Total Jobs vs Hotel Average Daily Rate:
The plot suggests a concentration of points where the average daily hotel rate is around $250, with the highest job numbers.
This may indicate that when hotel rates are at a moderate level, it is correlated with higher employment, possibly due to balanced tourism or business travel activities.

Total Jobs vs Hotel Occupancy Rate:
The highest density is observed at occupancy rates between 0.7 and 0.9, which could suggest a positive association with total jobs.
This pattern implies that higher hotel occupancy rates, potentially indicating higher tourist or business activity, might correspond with higher employment levels.

Total Jobs vs Unemployment Rate:
The density is elongated and negatively sloped, indicating an inverse relationship between the unemployment rate and total jobs, which is expected.
As the unemployment rate decreases, the total number of jobs tends to increase.

Total Jobs vs Labor Force Participation Rate:
The plot shows a slight positive trend, with higher job numbers corresponding to a labor force participation rate mainly between 0.63 and 0.67.
This could imply that as more people participate in the labor force, it is indicative of a stronger job market.

Total Jobs vs Logan International Flights:
The density suggests a positive relationship, with a greater number of jobs associated with an increased number of international flights.
This may reflect the impact of international travel on local employment, particularly in sectors linked to travel, tourism, and possibly international business.

Total Jobs vs Logan Passengers:
Like international flights, there is a positive correlation with the number of passengers.
The highest density of job numbers coincides with passenger numbers around 3 million, indicating that air travel volume may positively influence employment figures.

November 24, 2023

Total Jobs vs Logan Passengers:
There is a positive relationship between the number of passengers at Logan Airport and the total number of jobs. The R-Squared value is approximately 0.729, suggesting that about 72.9% of the variability in total jobs can be explained by the number of Logan passengers. The p-value is extremely low (approximately 3.57×10-15), indicating a statistically significant relationship.

Total Jobs vs Logan International Flights:
Similarly, the number of international flights has a positive correlation with the total number of jobs. The R-Squared value is 0.764, meaning that approximately 76.4% of the variability in total jobs is accounted for by the number of international flights. The p-value is very small (around 3.04×10-17), which implies a statistically significant relationship.

Total Jobs vs Hotel Occupancy Rate:
The relationship between hotel occupancy rates and total jobs is weaker compared to the previous two variables. The R-Squared value is about 0.142, indicating that only 14.2% of the variability in total jobs is explained by the hotel occupancy rate. The p-value is approximately 0.197, which is above the typical significance level of 0.05, suggesting that the relationship might not be statistically significant.

Total Jobs vs Hotel Average Daily Rate:
There is a moderate positive relationship between the average daily rate of hotels and total jobs. The R-Squared value is 0.313, which means that about 31.3% of the variability in total jobs can be explained by the hotel average daily rate. The p-value is approximately 0.0038, indicating a statistically significant relationship at common significance levels.

Total Jobs vs Unemployment Rate:
There is a strong negative relationship between the unemployment rate and the total number of jobs, which is intuitive as higher unemployment would typically be associated with fewer jobs. The R-Squared value is about 0.872, suggesting that 87.2% of the variability in total jobs can be explained by the unemployment rate. The p-value is extremely low (around 4.10 x 10-27), indicating a very strong statistically significant relationship.

November 22, 2023

R-squared (0.9564): This indicates a very high proportion of variance in the dependent variable (total jobs) is predictable from the independent variables in the model.
Adjusted R-squared (0.9213): This is a modified version of R-squared adjusted for the number of predictors in the model, still indicating a good fit.
MAE (Mean Absolute Error): The average absolute error of the predictions is 3889 jobs.
MSE (Mean Squared Error): The average squared difference between the estimated values and the actual value is 2,229,292.5, a measure that gives higher weight to larger errors.
RMSE (Root Mean Squared Error): The square root of MSE, which is 4708 jobs, gives an idea of the magnitude of the errors in the same units as the dependent variable (total jobs).

R-squared (0.9564): This value is very high, suggesting the model explains a large proportion of the variance in the validation dataset.
Adjusted R-squared (0.9213): This is also high, indicating that the number of predictors in the model is appropriate for data and the model fits the validation data well.
MAE (Mean Absolute Error) (3888.99): On average, the model’s predictions are off by approximately 3889 jobs from the actual values.
MSE (Mean Squared Error) (2,229,292.5): This is relatively high, influenced by the squared nature of the metric which gives more weight to larger errors.
RMSE (Root Mean Squared Error) (4708.31): This is the square root of the MSE and provides an error term in the same units as the predicted variable (total jobs). This value suggests that typical predictions are within approximately 4708 jobs of the actual values.

November 20, 2023

Hotel Average Daily Rate: Shows a distribution with a clear peak, which may indicate a common average daily rate around which hotel prices are centered. The spread of the plot might suggest variations in pricing, which could reflect different hotel categories or seasonal pricing strategies.

Hotel Occupancy Rate: Has a unimodal distribution, perhaps indicating that most hotels maintain a consistent occupancy rate, with fewer occurrences of very low or very high occupancy. This could reflect a stable demand for accommodation in Boston.

Labor Force Participation Rate: Shows a tight distribution, indicating that the participation rate does not fluctuate widely and remains relatively stable over time.

Logan International Flights: Displays the distribution of the number of international flights at Logan Airport. A unimodal, possibly slightly skewed distribution would suggest that there’s a common range of flight numbers, with occasional periods of increased or decreased international traffic.

Logan Passengers: Shows a distribution potentially skewed to one side, indicating variability in passenger numbers. Peaks could correspond to high-travel seasons or specific events that attract more travelers.

Total Jobs: Appears to illustrate the distribution of total job numbers in Boston. The distribution might be relatively broad, indicating variability in employment numbers, which could be influenced by economic cycles, job market health, and seasonal employment trends.

Unemployment Rate: Seems to have a pronounced peak, suggesting the most common unemployment rate that the city experiences. A narrower peak could indicate a relatively stable unemployment rate over the period analyzed.

November 17, 2023

Logan Passengers: Shows the frequency distribution of the number of passengers traveling through Logan Airport. The distribution seems to be skewed to the right, indicating that there are days with exceptionally high passenger numbers, possibly during peak travel seasons or special events.

Logan Intl Flights: The histogram for international flights also appears to be right-skewed, suggesting that while there is a consistent average number of flights, there are periods with significantly higher international traffic.

Hotel Occupancy Rate: Shows a potential left-skewed distribution, indicating that there are fewer instances of low occupancy rates and a tendency for higher occupancy on most days.

Hotel Avg Daily Rate: Exhibits a somewhat uniform distribution with several peaks, suggesting that there are common price points at which hotels set their daily rates, with fluctuations around these points.

Total Jobs: Looks to be normally distributed with a slight right skew, implying that most days have a consistent number of jobs, with occasional peaks possibly due to seasonal employment or economic growth.

Unemployment Rate: Shows a right-skewed distribution, indicating that lower unemployment rates are more common, with fewer occurrences of higher rates.

Labor Force Participation Rate: Appears somewhat normally distributed, suggesting that the labor force participation rate in Boston remains relatively stable over time.

November 15, 2023

Understanding the general pattern in order to determine Boston’s tourism peak times:

The monthly passenger data at Logan Airport paints a vivid picture of tourism trends in Boston. Throughout the year, we can see distinct peaks and valleys in the number of passengers, which correspond to the highs and lows of the tourism season. This cyclical pattern is a key indicator of when the city experiences its highest influx of visitors.

Moreover, by looking at the graph year-over-year, we can observe whether tourism in Boston is growing, remaining stable, or facing a decline. Such insights are crucial for stakeholders in the tourism industry, as they provide a clear view of the most popular times for tourists in the city. This information can be invaluable for planning purposes, whether it’s for staffing needs, marketing campaigns, or resource management. In essence, the passenger numbers at Logan Airport offer a reliable barometer for understanding and anticipating the dynamics of tourism in Boston.

NOVEMBER 13, 2023

 

I began examining information from Analyze Boston, the open data repository for the City of Boston. This is a remnant dataset of economic indicators from the Boston Planning and Development Authority (BPDA), which was in charge of organizing and directing inclusive growth in the City of Boston, that were recorded monthly between January 2013 and December 2019. A wide range of economic data on employment, housing, travel, and real estate development are gathered and analyzed by BPDA. I was able to correctly eliminate all null value.

November 10,2023

The decision tree divides the data into branches that helped to reveal patterns and relationships. It can show whether individuals of a particular race in a particular age group are more or less likely to engage in police shootings. The decision tree lets us see how different types of shootings by police have occurred. The results of using a decision tree provides a clear and descriptive way to understand the complex relationships in our given data. It will also help us identify specific situations where police shootings are more common in certain age groups and ethnic groups. This can be a powerful tool for uncovering biases or trends that may not be immediately obvious by simply looking at raw data.

October 27,2023

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): This algorithm forms clusters based on the density of data points. It groups points that are closely packed together and marks points in low-density regions as outliers. DBSCAN can find clusters of any shape and is good at separating high-density clusters from noise in the data.

October 25, 2023

K-means: This method groups data into ‘k’ number of clusters by reducing the distance between data points and the center of their assigned cluster. The ‘means’ in the name refers to averaging the data points to find the center of the cluster. It’s good for spherical-shaped clusters.

K-medoids: Similar to K-means, but instead of using the mean, it uses actual data points as the center of the cluster, known as medoids. This method is more robust to noise and outliers compared to K-means because medoids are less influenced by extreme values.

October 23, 2023

Effect Size: Using effect size is similar to measuring the robustness or plausibility of an observed pattern. It tells us not only that there is a difference, but how big that difference is. Using effect sizes for age differences between racial groups in police shootings will not only tell us that there are significant differences, but also allow us to understand how large those differences are in terms of benefits. This helps us understand the true significance of the findings, rather than simply recognizing that the differences are not due to chance.

October 16, 2023

Statistical Significance: It will helps us determine whether findings from data, as the ages of different groups of people killed by police, may be accurate or could have just happened. If something is statistically significant it means that we are pretty sure that the patterns we see in the data (such as one group being younger or older than another) are real, not just a random so if we use statistical significance then age differences between groups are real.

October 13,2023

If we want to see how age and race might affect certain outcomes, such as police contacts, it is inefficient to do a bunch of subtests for each category.  ANOVA is a technique that allows us to look at all the puzzles at once. It is a single, simplified test that can tell us whether age, race, or a mixture of both has an effect giving us a clear answer.

November 8,2023

Means:This refers to the average age of people in each racial group who were killed by police. If we consider all the averages to see if there’s a significant difference between racial groups. We might find that one group tends to be older or younger on average when they are killed by police. Knowing the average ages can help us understand broader patterns and possibly identify factors that might contribute to these tragic events.

November 6,2023

Variance tells us how spread out the ages are in each racial group of people killed by police. If everyone were the same age, variance would be zero because there’s no spread. But in real life, ages vary, so we get a number that tells us how much they differ from the average age. For different racial groups have different variances in ages. This is important because when we use methods like ANOVA (Analysis of Variance), we assume that these variances are roughly the same across groups. But that’s not the case here, which means we might need to use different statistical methods that don’t make this assumption to accurately understand the data.

November 3, 2023

ANOVA and t-tests are used to obtain p-values, which are then used to calculate the chance of observing a significant mean difference. From a Bayesian standpoint, this is so-called frequentist method of determining a large difference in means is considered fundamentally incorrect, particularly in some cases it is evident that the null hypothesis is false. However, a Bayesian approach to inference for the age and race data appears to be doomed mainly because, although the data exist, we do not know of any previous distributions, have any idea how those prior distributions might look, or have any idea in what sense the age and race data is evidence of change.

November 1, 2023

The average age of Hispanics group is 32 years old, meaning that half of the population is younger and the other half older. The mean age of 33.59 years is more over the average, suggesting some skewness in the age distributions. The average age of this group is indicated by the standard deviation, which is 10.74 years. The distribution’s skewness, which measures 0.803, indicates a moderate right skew with a greater proportion of younger than elderly people. The age distribution for Hispanics is significantly higher and has more tail, as indicated by the kurtosis value of 3.725, which is higher than the kurtosis of a normal distribution.

The Native American group shows an even age distribution with a mean age of 32 years. This group has a slightly larger age group than the Hispanic group, with a median age of 8.949 years. Even though it is less than the Hispanic distribution, a skewness value of 0.565 indicates a minor skewness to the right. Since the age distribution has a somewhat smaller tail and a less prominent peak than the normal distribution, the kurtosis value of 2.883 suggests that age is undervalued.

October 30,2023

Importing data for age and race only for those which has values ​​for both variables. The Asian population median age according to age data is 35 years old. With a standard deviation of 11.5921, the ages range from an average of about 35.96 years. This dispersion shows that the age statistics vary by 134.377. For Asian population the skewness, which measures the asymmetry of the age distribution is computed as 0.327765. The kurtosis gives information on the age distribution’s tails and sharpness which is 2.35263. The median age of Black population according to age data is 31 years old. With 32.9281 years on average, the average age is slightly more. With a variance of 129.701, the standard deviation shows the distribution of ages is 11.3886. With a skewness of 0.962894, the age distribution of the Black population is more skewed which indicates asymmetry. Compared to the Asian population, the kurtosis of 3.81164 indicates a distinct peak sharpness and tail.

October 20,2023

The age of individuals varies from 13 to 88 years. The average age is 31.7 years, slightly below the mean age of 32.7 years. The standard deviation representing the distance around the mean is about 11.4 years. Since the skewness value is about 0.99, the data appears to be slightly skewed to the right. A kurtosis value of 3.91 is usually used to indicate a relatively high distribution, which deviates from the kurtosis of 3 for a normal distribution. This could indicate a discrepancy or a certain age.

October 18,2023

Looking at the statistics of the age distribution after removing the ages from the cells that do include age values. Because the mean and median are different, the distribution of ages is significantly skewed to the right; in fact the skewness is about 0.73. The kurtosis is near to 3, indicating that there are no significant fat tails or peaks around the mean. After checking the percentage of the right tail of the age distribution lies more than 2 standard deviations from the mean which is somewhat greater than what get for standard normal distribution.

October 11,2023

Several separate tests comparing groups based on factors such as age and race can be repetitive and time-consuming. Instead, the ANOVA test is a much better method. This test allows us to compare all groups at once, making analysis easier and faster, and helps us to understand whether and how these factors are related.

October 06,2023

 

The data presented in the residual models of diabetes and inactivity provide valuable insight into the residual characteristics in the context of a background sample. By using these data we can collectively provide an assessment of the  linear regression model. The residuals are shown to have a moderate spread, exhibit symmetry at the center, and have few outliers. While these characteristics are important for understanding model performance, they do not directly reflect the strength or ambiguity of the relationship between diabetes and inactivity, which would require examination of regression coefficients and R-squared values, if there is any.

October 04,2023

The bootstrap is a versatile and effective statistical tool that may be used to calculate the level of uncertainty surrounding a certain estimate or statistical learning technique. It can offer a confidence range for a coefficient or an estimate of the standard error of that coefficient. The bootstrap method allows the computer to recreate the process of getting fresh data sets, allowing us to estimate our estimate’s variability without having to create more samples.

 

 

October 02,2023

Bootstrapping creates many different versions of the dataset, which allows us to see how different subsets of data can affect the result. Bootstrapping also helps to get a more accurate understanding of the overall given diabetes population and it can also help to calculate more accurate confidence intervals. Bootstrapping will helps us to assess the model performs on new data that its not seen

September 29,2023

The t-test is a parametric test of difference, meaning that it makes the same assumptions about your data as other parametric tests. The t test assumes your  data is independent, is normally distributed and have a similar amount of variance within each group being compared. A t-test helps determine whether the difference in mean between two groups is due to real effect or if it has occurred randomly.

The Monte Carlo permutation test is a technique used to determine whether the observed difference between two groupsis real or if it has occurred by chance. It is very helpful while dealing with complex data.

September 27,2023

When we carry out 5-fold cross-validation on the given data which consists data of obesity, inactivity and diabetes.There are a 354 data points of all 3 variables. With increasing model complexity, the training error often decreases and tends to underestimate the test error. Randomly divide the data into 5 roughly equal groups of 71, 71, 71, and 70, which we then use for cross-validations to determine the test error.  Then dividing the labelled data in which no repeats are present, into 5 equal groups  of 71, 71, 71,71 and 70. Then using 5 training data and test data sets to build, polynomial models on the training data and find the mean square error on the test data.

September 25,2023

K-fold Cross-Validation is a widely used approach for estimating test error. The idea is to randomly divide the data into K equal-sized parts. We leave out part k, fit the model to the other K-1 parts, and then obtain predictions for the left-out kth part. This is done in turn for each part k = 1,2…..k, and then the results are combined. We cannot apply cross-validation in step 2 directly without applying it in step 1. Because it would ignore the fact that in step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be included in the validation process.

September 22,2023

In the given CDC 2018 diabetes data, we discovered that how your weight (BMI) affects your chances of getting diabetes depends on your age. It’s kind of like BMI has a different role for younger and older people. Instead of using straight lines, we might have to use curvy lines when looking at the data. We use special math to handle these curvy lines. When the relationship between factors like your weight and diabetes isn’t a straight line, we need to use math that fits the curves to make it simpler to understand. We use step functions to catch these quick jumps, like when a signifiant increase in exercise suddenly has a big impact on risk.

September 20,2023

Variance means the measurement of how spread out or dispersed the values in a dataset are. A smaller variance indicates that the data points are closer to the mean and less dispersed, while a larger variance suggests that the data points are more spread out from the mean. A t-test is a way to figure out if two sets of data are different from each other in a meaningful way, taking into account the variability within each group and the size of the samples. P-value shows if the data is normal. If it’s below 0.05, data might not be normal. If its above 0.05 then its close to normal. A pre-molt histogram shows the distribution of data before an event. A post-molt histogram shows data after the event to check for changes. A Monte Carlo permutation test that involves repeatedly shuffling the data to simulate random chance outcomes.

September 18,2023

Linear regression model with more than one predictor variable is known as multiple linear regression. For Galton, “regression” referred only to the tendency of extreme data values to “revert” to the overall mean value. Correlation between predictor variables which is also known as factors. %inactivity and %obesity are considered as factors (predictor variable) for %diabetes. A high R-squared value determines how well the model fits in the observed data. Cross Validation is a technique to check error in data and to check overfitting.

September 15,2023

The linear model’s significance in data science projects must be emphasized. By simplifying calculations, we can understand why we measure distance parallel to the Y-axis in linear models. It is advised to use transformations like log or exponential in linear models.

September 13, 2023

Heteroscedasticity is important to ensure that the assumptions of the model are met and that the results are valid. Breusch-Pagan test is used to test heteroscedasticity analytically. The Breusch–Pagan test, developed in 1979 by Trevor Breusch and Adrian Pagan. There are a couple of variants of the Breusch-Pagan test. It uses the residuals from the original linear regression of %diabetes against %inactivity. Calculate Pearson’s R-Squared  and then the calculate the p-value. A p-value is a peculiar value which is a method to assess whether the results of an experiment or study are statistically significant or whether they could have occurred by chance. We need to see the p-value is as large as n  where n = 1370 which is the number of %inactivity data points for a chi-squared distribution with degree of freedom. Typically, p-value is very small such as 0.05, it means it has strong evidence against the null hypothesis. In these types of cases, they choose to reject the null hypothesis that the original linear model is homoscedastic.

September 11,2023

The Federal Information Processing Standards (FIPS ) common to all data sets shows 354 rows of data that contains information on all 3 variables %obesity, %inactivity and %diabetes. There are relatively large number of data points for both diabetes and inactivity. For the given %diabetes data has a kurtosis of approximately 4, which is slightly higher than the value of 3 for a normal distribution, and a quantile-quantile plot reveals a significant departure from normality and for the given %inactivity data has a kurtosis of about 2, which is somewhat lower the value of 3 for normal distribution. Kurtosis is critical. The linear least squares model is a technique used to identify the line that best illustrates the relationship between the variables, by minimizing the sum of the squared differences between the anticipated and observed values of the data points. In any linear model it is very important to examine residuals. Residuals represents the error or unexplained variations in the data. When the residual is examined it is noticed that the higher than 3 value of 4.07 for the kurtosis, and the quantile plot,  its indicates a deviation from normality for the residuals. This can create issue in testing for heteroscedasticity. Heteroscedasticity is important to ensure that the assumptions of the model are met and that the results are valid. Breusch-Pagan test is used to test heteroscedasticity analytically. When we plot residuals versus the predicted values from linear model, the fanning out of residuals as the fitted value gets large is an indicator that the linear model is not reliable hence it is heteroscedastic.