Understanding Life Expectancy

Brendan Hoss
6 min readSep 25, 2020

A mystery as old as time. Where did it begin, where will it end. No one knows for certain what the future holds but taking a statistical approach helps us understand what makes someone live longer and what negatively impacts someone's chances of living longer. Everyone understands that life is precious, cruel, joyful, and scary. What no one understands is the unfortunate day that everyone’s life changes. Everyday you’re tasked with avoiding critical harm in a ever evolving world. The introduction into production automobiles created a clear vision of what new technology could bring to the table. As stated here, “Between 1913 and 2018, the number of motor-vehicle deaths in the United States (which include all types of motor vehicles, including passenger cars, trucks, buses, and motorcycles) increased 838%, from 4,200 deaths in 1913 to 39,404 in 2018. However, the role cars play in daily life is vastly different now than when tracking began.” Although the automobile was one of mankind’s greatest inventions its clear that with all success comes some sort of consequence. To take a better look into this issue, I syphoned a data set provided by WHO(World Health Organization) which went into detail about 172 different countries to find their life expectancy. They used contributing factors such as if the country was developed or developing, alcohol consumption per capita, HIV/AIDS rate, obesity, and a handful of other health and lifestyle related data. To begin with I took the data set and created a correlation matrix showing the top ten variables that have the highest relationship with my target of predicting life expectancy.

Taking a look at this heatmap gives us insight on what factors are greatest to affecting our model for predicting life expectancy. We can tell that Income composition of resource which helps describe the human development in an area is a leading contender for predicting life expectancy. We also notice that Schooling plays a major role in the prediction. Seeing as how these two factors greatly impact life expectancy I decided to take a closer look at them.

Taking a look at this graph shows a very positive correlation between schooling and life expectancy. Some of the countries along the baseline of the X axis are developing countries mostly in Africa that didn't record schooling information to WHO. On the opposite end, the highest data point here is New Zealand. With an average student spending 20 years in school and having a life expectancy of 89. Right below this data point we have the likes of Italy, Germany, Spain, Norway, Portugal, Sweden, France, and Belgium. These nations topped out on our dataset all having a life expectancy of roughly 89. I believe this to be true as you learn more and develop your understanding of the world and life around you, you become more knowledgeable on things like dieting and drug use. Secondly we looked at Income composition of resource. This data gave us information on how well a country uses their resources productively, whether that being for new drug research or hospitals.

Likewise as above there is an extremely positive correlation between the amount of money countries spend on making their self's more productive in day to day activity and life expectancy. As was the case earlier the same countries that excelled in schooling also excel in using income composition of resources usefully. Now reviewing the data further I found there to be a negative correlation between adult mortality rate and life expectancy.

As known the adult mortality rate is the likelihood someone dies between the ages 15 and 60 so trying to use this as a variable you would think would negatively affect the prediction but its actually useful for us as even negative correlations give us insight.

Predictive Machine Learning Model’s

After my initial exploration of the data I took it upon myself to start trying to create a model to predict life expectancy. Knowing this was a regression problem I started out by creating a baseline which only stood at .014% so hopefully I could drastically improve this. I started with a basic ridge regression model and fit it to my training data. In order to measure the improvements made I took the mean square error and r(squared) scores for our models. I chose the mean square error as it gives me the average squared difference between our estimated value and our actual value giving us insight of how big of a discrepancy there is. Along with that metric I used the r(squared) score, which shows how close my data is to the fitted regression line or to put in simpler terms the variance in the dependent variable predicted from my independent variables. After my initial ridge model I was pleased to see the the mean squared error at 40.43 and my r(squared) score coming in at 62.40%. I knew I could achieve better than this. I started exploring my data again and looking at it in new ways. I created a partial dependence plot to take a better look at what’s driving my data. I again looked at my two highest correlated points being schooling and income composition of resources.

I was upset to see there wasn’t much to be taken away from our new viewpoint on how each feature directly correlated to our model. I then decided to try to create a pipeline focused around using XGBRegressor. I quickly learned that this model would be substantially better than our regular ridge regression model because it was now taking a gradient boosting route to help create the best results possible and it almost did that. Coming in with a mean squared error of only 5.89, and a r(squared) score of 94.52% it was very clear to see the major impact using this gradient booster had. But I still felt I could juice out a little more from this. I spent time trying to fine tune the parameters of my model to get any advantage possible. What I learned was that my data had outliers. Something these gradient boosters don't handle as well as something like a RandomForestRegressor would be able to. After fiddling with a new model I again fit my model to my training data. I found that there was again an increase in my scores, starting out with the mean squared error dropping down to a measly 2.19, and my r(squared) score now coming in at a whooping 97.95%. I was excited to see such an incredible score showing that using the proper machine learning algorithm dramatically impacts your outcomes. I tried my best to omit leakage by initially taking away adult mortality as I thought it was too close a predictor for life expectancy but learned that it actually was only accounting for roughly .12% of my feature importance. Our main contributing factors to our outcome stood as HIV/AIDS, income composition of resources, schooling and then adult mortality. Understanding our data shows that to predict a country's life expectancy we must first look at a few major contributing factors such as the four listed above, from here we can start to gauge where we believe our projected number should be.

--

--