Robina Shaheen

Certificate Student 19-20


Certificate Program


Capstone Project

Wildfires in California: Causes and Consequences

By Robina Shaheen

The rising carbon dioxide in the atmosphere is contributing to constant increase in global temperatures. Over the last two decades, humanity has observed record-breaking extreme weather events. A comparison with the historical data indicates higher frequency, larger magnitude, longer duration, and timing of many of these events around the world has also changed singniciantly (Seneviratne et al., 2018). Wildfires are no exception, as seen in the recent years, devastating blazes across the globe (Amazon forest in 2019, California 2017-18 and Australia in 2019-2020. In the western U.S., wildfires are projected to increase in both frequency and intensity as the planet warms (Abatzoglou and Williams, 2016). The state of California with ~ 40 million residents and ~ 4.5 million housings and properties has experienced the most devastating economic, ecological and health consequences during and after wildfires (Baylis and Boomhower, 2019; Liao Y and Kousky C, 2020,).

During 2014 wildfires in San Diego, I volunteered to help people in the emergency shelters and watching the devastating effects of wildfires on children and adults emotional and physical health had a profound impact on me and moved me at a deeper level to search for potential markers/predictors of these events in the nature. This analysis is not only an academic endeavor but also a humble beginning to understand the complex interactions between atmosphere and biosphere in the nature and how we can plan for a better future for those who suffered the most during these catastrophic wildfires.

This goal of this study is to understand the weather data that can trigger wildfires and consequences of wildfires on the atmospheric chemistry and how we can predict and plan for a better future to mitigate disastrous consequences. I have laid out this study in multiple sections:

  1. Understanding weather pattern using advanced machine learning tools to identify certain markers that can be used to forecast future events.

  2. Understanding changes in the chemical composition of the atmosphere and identify markers that can trigger respiratory stress and cardio-vascular diseases.

Picture of an unprecdented wildfire near Camp Pendelton.

Pulgas Fire 2014, San Diego, CA (source= wikimedia) May 2014 Wildfires.

wildfire glowing behind city lights


  1. Import packages and modules

  2. Import datetime conversion tools beteween panda and matplotlib for time series analysis

  3. Download air quality data from the EPA website

  4. Set working directory to "earth-analytics"

  5. Define paths to download data files from data folder 'sd_fires_2014'

  6. Import data into dataframes using appropriate functions(date-parser, indexing, remove missing values)

    • weather data Jan-Dec. 2014

    • Atmospheric gases and particulate matter data Jan - Dec. 2014

    • Annual precipitation (2007-2020) San Diego, CA.

  1. view nature and type of data

  2. Use Scikit learn multivariate analysis to predict ozone levels.

  3. Plot data to view any anomalies in data and identify source of the anomaly .

  4. discuss plots and conclusions.


Data exploration and analysis

The EPA provides data for the entire state and it is a large data set, often slowing the processing time. Therefore, it is important to select data required to check air quality and weather conditions. I have selected ozone, oxides of nitrogen and carbon monoxide that are produced during wildfires. Additionally, black carbon and particulate matter is emitted during wildfires which is dangerous to inhale. These dataset will allow me to conduct my preliminary analysis of the effects of wildfires on the air quality in San Diego County.

scatter matrix function to view relationships between various components of weather system

The figure above shows a Gaussian distribution of temperature and pressure, whereas relative humidity data is skewed towards lower humidity levels.The wind pattern shows two distinct populations. Both relative humidity and wind pattern indicates extreme weather changes during Santa Ana events. The correlation matrix graph shows inverse correlation between temperature and relative humidity. One can easily identify how extremely low relative humidity (<30%) can easily damage shrubs and other vegetations. The inverse relationship between pressure, temperature and winds is obvious from these graphs. A time series data has already shown that a high resolution record of temperature, relative humidity and wind can be used as a useful indicator of wildfire season.

Autocorrelation plot and lag plot

Autocorrelation plots are often used for checking randomness in time series. This is done by computing autocorrelations for data values at varying time lags. If time series is random, such autocorrelations should be near zero for any and all time-lag separations. If time series is non-random then one or more of the autocorrelations will be significantly non-zero. The horizontal lines displayed in the plot correspond to 95% and 99% confidence bands. The dashed line is 99% confidence band.

  1. The auto correlation plot shows the weather parameters are strongly related to each other.

  2. A lag plot is a special type of scatter plot with the two variables (X,Y) and their frequency of occurrence. A lag plot is used to checks a. Outliers (data points with extremely high or low values). b. Randomness (data without a pattern). c. Serial correlation (where error terms in a time series transfer from one period to another). d. Seasonality (periodic fluctuations in time series data that happens at regular periods).

A lag plot of San Diego weather data shows two distinct populations, anomalous values during wildfires in the right upper corner and normal values are linearly correlated in the lag plot. The lag plot help us to choose appropriate model for Machine learning.

autocorrelation by lag plot


scatter plot of y and y+1

The Chemical Composition of the Atmosphere

The atmosphere is composed of gases and aerosols (liquid and solid particles in the size range of few nanometer to micrometers). In the cells below we are going to explore machine learning models and stats to understand the relations between various parameters. We will use Scikit learn multivariate analytical tools to predict the ozone concentrations. Ozone is a strong oxidant and very irritating gas. It can enter the respiratory system and damage thin linings of the respiratory system by producing free radicals.


Scikit-learn is a simple and efficient tools for predictive data analysis.

It is designed for machine learning in python and built on NumPy, SciPy, and matplotlib.

In this notebook I have employed Scikkit-learn to understand relationship between ozone and its precursors.

Ozone formation depends on the presence of oxide of nitrogen, carbon monoxide and volatile organic compounds.

However, VOC's are not measured at all the stations and few stations have values measured for half of the year. These msissing values cannot be predicted or filled due to spatial variability of its sources.

Ozone vs Nitric Oxide scatter plotscatter plot of ozone vs pm 2.5

Our next step is to divide the data into “attributes” and “labels”. Attributes are the independent variables while labels are dependent variables whose values are to be predicted. In our dataset, we only have two columns. We want to predict the oxide of nitrogen depending upon the carbon monoxide (CO) recorded.

Next, we will split 80% of the data to the training set while 20% of the data to test set using below code. The test_size variable is where we actually specify the proportion of the test set.

comparison of actual and predicted ozone values bar chart


  1. The ozone prediction from model = (6.51/30.69)*100 ~ 20% less than actual value. still reasonable model fit.

  2. Bar grpah with predicted values indicated huge deviation on May 01, 2014 which is the wildfire day at its peak.

  3. Taking wildfire days out will improve the relationship between normal parameters and hence MLR coefficents.

  1. The ozone fromation depends on temp, pressure and destruction with OH radical, photolysis and collisons with air moleucles. The data for one of the important source of ozone formation VOC is incomplete and it can signficntly brings predicted values in alignment with the observed one.

  2. A comparison with the rain in the previous years tells us an intersting story that good rainy season in 2011-12 resulted in huge biomass accumulation in San Diego.

  3. Intense drought in 2012-2013 caused the biomass to turn into dry fuel for fire ignition.

  4. In future studies, it would be valuable to compare rain patterns with wildfires in California.

Future Outlook

There are many exciting avenues to expand this work to more wildfire events and comparison with weather and atmospheric conditions to test our model.

Preliminary Analysis of Rain Patterns in San Diego.

Figure . cumulative rain and number of rainy days in San Diego California, USA.

bar plot showing the cumulative rain and number of rainy days in San Diego California, USA

The figure above shows that precipitation in 2010 was more than normal and resulted in rapid growth of shrubs, grasses and other vegetations which later served as tinder for fires in early 2014. The black line indicates exponential decrease in precipitation after heavy rains in 2010. Tinder is an easily combustible material (dried grass and shrubs) and is used to ignite fires. The black line indicates exponential decrease in precipitation after heavy rains in 2010. A further analysis of rain pattern in 2020 indicated 110mm of rain in Dec. 2020 which is almost half of the annual rain in San Diego. There was no rain from July-Aug. 2020 after deadly wildfires and had even worse consequences for the health of locals. As climate is changing the populations are suffering not only from the immediate impact of wildfires but lingering effects of fine particulate matter and toxic gases are even worst especially for children and individuals suffering from respiratory diseases (Amy Maxman 2019).


Abatzoglou, T. J., and Williams P. A., Proceedings of the National Academy of Sciences Oct 2016, 113 (42) 11770-11775; DOI: 10.1073/pnas.1607171113

Seneviratne, S. et al., ,Philos Trans A Math Phys Eng Sci. 2018 May 13; 376(2119): 20160450, doi: 10.1098/rsta.2016.0450

Baylis, P., and Boomhower, J., Moral hazard, wildfires, and the economic incidence of natural disasters, Posted: 2019,

Liao, Yanjun and Kousky, Carolyn, The Fiscal Impacts of Wildfires on California Municipalities (May 27, 2020). or

Amy Maxman, California biologists are using wildfires to assess health risks of smoke. Nature 575, 15-16 (2019),doi: 10.1038/d41586-019-03345-2