Anna Pavlenko

Certificate Student 19-20


Certificate Program


Capstone Project

NO2 Prediction using Machine Learning Analyses in Google EE

By Anna Pavlenko


Nitrogen Dioxide (NO2) air pollution.

The World Health Organization estimates that air pollution kills 4.2 million people every year.

The main effect of breathing in raised levels of NO2 is the increased likelihood of respiratory problems. NO2 inflames the lining of the lungs, and it can reduce immunity to lung infections.

Even there are connections between respiratory deceases / also exposure to viruses and more deadly cases and level of NO2 pollution in our atmosphere.

Sources of NO2:

The rapid population growth, The fast urbanization:

- Industrial facilities
- Fossil fuels (coal, oil and gas)
- The increase of transportation – 80 %.

The affect air pollution (NO2): population health, and global warming.


Fig_1. Pollution in Industrial cities


workflow diagram

Fig_2. WorkFlow of this project

Study Area / Input Data:

Study area for research Project: Los Angeles, CA. Data: Image collection of Landsat 8 for 2014 – 2019 years, Sentinel 5-P (TROPOMI) 2018 -2019.

LA from sentinel

Fig_3. Los Angeles, CA / Landsat 8 / Sentinel 5-P (TROPOMI)


ML Regression Analysis uses in this Project

The machine learning toolbox includes several linear and non-linear supervised learners, predicting either numeric outputs (regressors) or nominal outputs (classifiers).

Classification Workflow

  1. Build

  2. Train

  3. Apply

  4. Assessment

Classification Workflow

var training = image.sample(region, scale) var classifier = ee.Classifier.randomForest().train(training) var result = image.classify(classifier) var predictor = classifier.setOutputMode(Regression) var confusionMatrix = classifier.confusionMatrix() var accuracy = confusionMatrix.accuracy()


Classification and regression trees.

Linear Regression: Random Forest - Random Decision Forest, SVM - Support Vector Machine

Classifier Output Mode


Classification - Discrete input/output classes

Regression - Continuous valued output

Probability - binary classifiers only

// ___ Support Regression: Random Forest, SVM // ___ Support Probability: Cart, NaiveBayes, IKPamir*, Pegasos, SVM, Perceptron

Methods in this project

Random Sampling

Training DataSEt created by using Random Points. Random Points were collected by areas selected in 9 different levels of NO2 by TROPOMI satellite imagery 2019.


Fig_4. TROPOMI imagery 2019

tropomi points

Fig_5. Random points collection from TROPOMI 2019


Supervised Classification

In project was used Random Forest method.

// ___


Fig_6. Random Forest Classification 2019

We can assess the accuracy of the trained classifier using a confusionMatrix.

error matrix

valid matrix


Two ways of Predict continuous values: Across Space & Over Time:

Regression: Predict continuous values output Across Space

Contexts of Linear Regression in GEE:

across space diagram

Fig_7. Regression Across Space 2018


Fig_8. NDVI and Regression 2018


Linear Regression 2018


chart for 2018

Fig_9. Regression 2018

ndvi and regression chart

Fig_10. NDVI and Regression 2018


Model Accuracy:

For 2018 there are TROPOMI data exist, also Predicted value NO2 calculated from Predictor 2019. It is possible to evaluate quality of Predictor.

difference raster plot

Fig_11. Difference Raster 2018 - Difference between actual value NO2 and Predicted value NO2


Predictions 2018 -2015 years:

predictions for next four years

Fig_12. NO2 Predictions for 2018 -2015 years

Predict continuous values output Across Time (Random Forest REGRESSION):

over time

Fig_13. Context of Linear Regression in GEE - over Time

chart over time

Fig_14. Predicted NO2 level Over Time (2018 - 2014)

power plants chart

Fig_15. Power Plants in GEE

point prediction for power plants

Fig_16. Power Plants Predictions in GEE


The goal to create data of NO2 for past years (2018, 2017, 2016, 2015) by using data for 2019 (TROPOMI and Landsat 8) was reached.


Summary / Conclusions:

To improve accuracy of Regression / Prediction Model combination from multiple Machine Learning Algorithms or Multiple Predictions several times from the same Algorithm to make more accurate predictions - Ensemble Model.


  1. Bert Brunekreef, Stephen T Holgate "Air pollution and health". THE LANCET • Vol 360 • October 19, 2002 •

  2. M.L. Brusseau, A.D. Matthias, A.C. Comrie and S.A. Musil "Atmospheric Pollution". Environmental and Pollution Science. Copyright © 2019 Elsevier Inc. All rights reserved.

  3. M.L. Brusseau "Physical Processes Affecting Contaminant Transport and Fate". Environmental and Pollution Science. Copyright © 2019 Elsevier Inc. All rights reserved. 103.

  4. Gustavo Camps-Valls "Machine Learning in Remote Sensing Data Processing" Conference Paper · September 2009 DOI: 10.1109/MLSP.2009.5306233.

  5. LEO BREIMAN "Random Forests". Machine Learning, 45, 5–32, 2001  c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

  6. Robert H. Shumway • David S. Stoffer "Third edition Time Series Analysis and Its Applications". 2011

  7. "Sentinel-5 precursor/TROPOMI Level 2 Product User Manual Nitrogendioxide". 2017