Health Insurance Explained - RoundTable Medical Consultants

 

Data science applied on health Insurance data

All the works related to the following investigations could be find in: GitHub Repository.

Intro

In this project, I aimed to showcase the power of data science in solving real-world problems in the insurance industry. Utilizing a dataset sourced from Kaggle,

I delved into the detailed information of insurance customers, including their age, sex, body mass index (BMI), number of children, smoking status, and region. By analyzing these data points, we gain valuable insights into customer behavior and the factors that contribute to their insurance charges. This knowledge can be used to create insurance plans and marketing campaigns that target specific demographics, ultimately increasing affordability and accessibility for all. Additionally, this dataset also presents an opportunity to explore deeper questions and build prediction models to estimate charges for various situations, further enhancing the effectiveness of insurance solutions. Overall, this project demonstrates the potential of data science as a tool for finding data-driven solutions in the insurance industry. The ultimate goal would be finding and training the best predictor model regarding to estimation of charges for test uses.

In this article, we will be implementing various regression models and different types of scaling methods on the data and comparing the results. This approach allows us to evaluate the performance of different models and determine the most appropriate one for our specific use case. By comparing the results of different scaling methods, we can also determine the best approach for handling the data, which can greatly impact the accuracy of the model. Through this process, we aim to optimize the performance of our model and gain a deeper understanding of the data

Exploratory data analysis on data.

The foundation of any successful data science project is a thorough investigation of the data at the onset. This initial exploration allows for a better understanding of the data source, as well as the business insights it holds. By gaining knowledge about the nature of the business, the entire data mining process, and the expectations for the data, we can set a clear direction for the project and ensure that it aligns with the overall goals and objectives. This step is crucial in ensuring the success of the project and should not be overlooked. In short, data investigation is a vital step in any data science project, providing valuable insights and direction for the project's success.

At the start of the project, I investigated the data to identify any potential issues or inconsistencies. This includes checking for null values and addressing any problems found in the data. Additionally, it is important to become familiar with the data types and categorize them accordingly, such as distinguishing between categorical and numerical values. This step not only helps to identify any potential problems in the data, but also enables a better understanding of the data, laying the foundation for accurate and effective data analysis. This step is a fundamental part of any data science project, and it is important not to overlook the importance of data investigation and familiarization.

I start the data manipulation with use of python and necessarily libraries as following

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

Here is a brief description of the data situation.

<code>

df.info()

df.describe()

</code>

 

Table

Description automatically generatedTable

Description automatically generated

The data set looks clear and there is no null value, how ever based on observations, I decided to separate categorical and numeric columns for better analysis in next steps as following:

<code>

categorical = [column for column in df.columns if df[column].dtype == "O"]

numeric = [column for column in df.columns if column not in categorical]

</code>

 

 

Chart, histogram

Description automatically generatedChart, histogram

Description automatically generatedChart, histogram

Description automatically generated

Based on the results of our analysis, it has been concluded that the Body Mass Index (BMI) is normally distributed, and no outliers were identified in any of the numeric columns. Furthermore, the charges column was found to be right-skewed, as previously determined by comparing the mean, mode, and median of the charges column through the use of the describe method applied to the data. These findings provide valuable insights into the distribution of the data and inform any necessary actions to ensure the integrity and accuracy of the analysis. And for further modeling those values needs to be scaled which we going to discuss about.

 

Feature selection

In continuation of our analysis, further exploration will be conducted on the data to identify the features that have the most impact on the target value, in this case, the charges. This step is critical in understanding the underlying relationships between the features and target variable, allowing us to make data-driven decisions and optimize the performance of our model. By applying advanced analytical techniques, we aim to gain a deeper understanding of the data and improve the accuracy of our predictions.

Chart, scatter chart

Description automatically generatedChart, scatter chart

Description automatically generated

Based on the results of our analysis, a linear relationship between Body Mass Index (BMI) and charges as well as age and charges has been determined. This finding provides a deeper understanding of the underlying relationships between these variables and can inform further actions to optimize the performance of our model. The linear behavior identified in this analysis is crucial in developing accurate predictions and understanding the factors that contribute to the charges

 

Categorical values,

In order to thoroughly investigate the potential issue of imbalance in categorical variable data, a thorough analysis is performed to demonstrate the ratio of value counts for each categorical column. This method allows for a comprehensive understanding of the distribution of values within the dataset and can inform further actions to address any imbalance issues. By conducting this analysis, we are able to gain valuable insights and make data-driven decisions to ensure the integrity and accuracy of the data

The results would be as following :

Chart, bar chart

Description automatically generatedChart, bar chart

Description automatically generatedChart, bar chart

Description automatically generatedChart, bar chart

Description automatically generated

 

A comprehensive distribution analysis is performed on numeric columns to thoroughly examine the data for any outliers or anomalies. This method allows for a detailed understanding of the distribution of values within the dataset and can inform further actions to address any identified issues. By conducting this analysis, we are able to gain valuable insights and make data-driven decisions to ensure the integrity and accuracy of the data.

 

When analyzing the distribution of data with categorical values, box plots are a valuable tool to use. These plots provide a clear representation of the distribution of values within a dataset, allowing for the identification of outliers and patterns. They are also useful for comparing the distribution of multiple groups or categories, making it easier to detect any significant differences. By utilizing box plots in our analysis, we can gain a deeper understanding of the data and make data-driven decisions to optimize the performance of our model.

Chart, box and whisker chart

Description automatically generatedChart, box and whisker chart

Description automatically generatedChart, box and whisker chart

Description automatically generated

Based on the analysis, it has been determined that there is a high variance in the charges above 40000. As a result, further analysis will be conducted specifically on that subset of data to gain a deeper understanding of the distribution and any potential outliers. This approach allows for a more focused analysis and can inform actions to optimize the performance of our model. By isolating and analyzing this specific subset of data, we can gain valuable insights into the underlying relationships and patterns.

Chart, box and whisker chart

Description automatically generatedChart

Description automatically generatedChart, box and whisker chart

Description automatically generated

 

 

Dummy variables.

In order to effectively analyze and train our data, it is necessary to convert our categorical variables to numerical values. The following code snippet is utilized for this purpose, allowing for more accurate and efficient analysis and modeling. This step is crucial in preparing the data for machine learning algorithms, which typically work with numerical data, and enables a better understanding of the underlying relationships between the features and target variable.

df.replace({"sex":{"male":1,"female":2}, "smoker":{"yes":1,"no":0}},inplace=True)

df = pd.concat([df, pd.get_dummies(df['region'])], axis=1).drop(['region'],axis=1)

df = pd.concat([df, pd.get_dummies(df['children'])], axis=1).drop(['children'],axis=1)

 

In addition to the previous analysis, a correlation heat map has been generated to provide an overview of the relationships between all of the results values. This visualization technique is useful for identifying patterns and correlations in the data, and can inform further actions to optimize the performance of our model. The correlation heat map provides a clear and intuitive representation of the data, allowing for quick and easy identification of the most relevant features and relationships

Chart

Description automatically generated

 

Scaling:

The objective of this approach is to compare the performance of our models when using standard scaler and min-max scaler. By implementing these two different scaling methods, we aim to gain a deeper understanding of how the scaling affects the performance of the model and identify the most appropriate method for our specific use case. By investigating the differences in performance, we can make data-driven decisions to optimize the accuracy of our model and ensure the integrity of the analysis.

In this analysis, we will be utilizing the sklearn library in Python for scaling the data. This library provides a wide range of tools for data pre-processing, including the Standard Scaler method. By using this library, we can easily and efficiently apply the scaling method to our data, allowing for a more accurate and robust analysis. The use of industry standard libraries such as sklearn ensures that the results are reliable and widely accepted in the field.

 

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

 

#Scaling with StandardScaler

s_df = s.fit_transform(df[['bmi','age']])

s_df = pd.DataFrame(s_df, columns=df[['bmi','age']].columns.values)

df1 = df.drop(columns=['age','bmi'],axis=1)

df1 = pd.concat([df1, s_df], axis=1)

 

# scaling with MinMaxScaler

mm = MinMaxScaler()

mm_df = mm.fit_transform(df[['bmi','age']])

mm_df = pd.DataFrame(mm_df, columns=df[['bmi','age']].columns.values)

df2 = df.drop(columns=['age','bmi'],axis=1)

df2 = pd.concat([df2, mm_df], axis=1)

 

In following, sklearn library is utilized in Python to separate our model into training and testing groups for validation purposes. This approach is widely accepted in the field of machine learning and allows for a more accurate assessment of the model's performance. Given the limited size of our dataset, with only 3000 records, a simple split of the data into training and testing groups with a ratio of 20% will suffice for this case. By using this method, we can ensure the integrity and robustness of our analysis while following best practices in the field.

 

X = df2.drop(['charges'],axis=1)

y = df2['charges']

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size = 0.2,random_state=42)

 

Prediction models

This issue would fall under the category of supervised learning, specifically regression. In the initial phase, several models will be selected for examination in terms of performance. These include LinearRegression, DecisionTreeRegressor, and SGDRegressor. The R2 score and mean squared error metrics will be utilized to evaluate the models' performance. Which sklearn library is used for that.

 

All the models are trained on traindata both once with applying MinMaxScaler and Once with StandardScaling method,

The results of R2 score and MSE is calculated based on prediction of Test data as follow:

Graphical user interface, application, Teams

Description automatically generated

 

Ensemble method (Voting regression):

Voting regression, ensemble method is chosen to increase the accuracy of prediction. It's a simple yet powerful method that can be used to improve the accuracy of a regression model by combining the predictions of multiple models. It can also be used to reduce the variance of a single model by combining multiple models with different architectures. GradientBoostingRegressor, RandomForestRegressor and linear regression are chosen to be used on our VotinRegressor as follow:

reg1 = GradientBoostingRegressor(random_state=1)

reg2 = RandomForestRegressor(random_state=1)

reg3 = LinearRegression()

ereg = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('lr', reg3)])

ereg = ereg.fit(X_train, y_train)

 

A significant increase in R2_Score and decrease in MSE in case of using Standard scaled data announce the advantage of using VotingRegressor in this particular problem combine with StandardScaler as follows.

Graphical user interface

Description automatically generated

Chart, line chart

Description automatically generatedChart, line chart

Description automatically generated

Visualization on performance of all models in two situation of Standard scaled and MinMaxScaled indicates that the best results would be in case of using Votinregressor model on Standard scaled method.

 

Residual Calculation :

In a regression model, the residual is the difference between the observed value of the dependent variable (y) and the predicted value of the dependent variable (y hat) given a set of independent variables (x). The residual is calculated by subtracting the predicted value of y (y hat) from the actual value of y (y) for each observation in the dataset.

 

Residuals are used to evaluate the performance of a regression model. A good regression model should have residuals that are randomly distributed around zero, with no patterns or trends. If the residuals exhibit patterns or trends, it indicates that the model is not capturing all of the information in the data, and the model may need to be modified or refined.

Residual calculation for all the models used in this problem are visualized as follows:

Chart, scatter chart

Description automatically generatedChart, scatter chart

Description automatically generated

 

Chart, scatter chart

Description automatically generatedChart, scatter chart

Description automatically generated

 

Chart, scatter chart

Description automatically generatedChart, scatter chart

Description automatically generated

Chart, scatter chart

Description automatically generatedChart, scatter chart

Description automatically generated

 

Conclusion:

Based on the analysis conducted on the four models mentioned above and considering the limited amount of data available for this specific problem, it is recommended to implement a Voting Regressor approach for constructing a predictor. To optimize the performance of the model, it is suggested to consider the following:

 

       Utilizing cross-validation techniques to evaluate the model's performance and prevent overfitting.

       Fine-tuning the hyperparameters of the individual models through techniques such as grid search or random search to optimize their performance.

       Incorporating additional relevant data and features, if available, to improve the model's ability to capture the underlying relationship.

       Regularly evaluating and monitoring the model's performance on new data to ensure its continued effectiveness."