Tutorial 1 - ICEDEG 2023

 

Overseeing government with AI: automated ranking and filtering of legal notices in the government gazette 

Tutorial Outline

  • Introduction to the goal of the tutorial: to rank notices published on online government gazettes according to their relevance as a way of improving the efficiency and responsiveness of government oversight tasks.
  • Machine Learning (ML) basics: how patterns on data are learned, bias and variance of models, and the importance of splitting the dataset. Discuss how to use each data split (training, validation, and test).
  • Data annotation: recommendations on how to label documents in terms of relevance and on how to store the data.
  • Google Colab setup: how to set up Google Colab to use GPUs and personal load packages.
  • Load sample data: load into Colab an annotated corpus of notices from the Brazilian Government Gazette as a sample dataset for the tutorial.
  • Split the data: discuss data splitting strategies and use them to build training, validation, and test sets.
  • Setting a metric: present a few metrics used in document ranking and choose one for the tutorial.
  • Define a model performance baseline: compute the metric associated with random ranking to check that the ML model is working.
  • Process texts into vectors: discuss how texts are represented in ML models (vectorization), introduce the n-gram bag-of-words representation, and apply it to the sample dataset.
  • Introduction to regularized linear models: discuss the workings of the Ridge regression model (already known to be an optimal ML model for this task).
  • Create a Pipeline: create a complete pipeline, including text processing and the ML model.
  • Fit and fine-tune the pipeline: use grid search and random search to adjust the model’s hyperparameters (and the text processing parameters) to optimize its performance on the validation set. Check the best model’s metric on the validation set.
  • Introduction to transformers: compare the “sequence of word vectors” with the bag-of-words text representation, and discuss the advantages of the attention mechanism, transfer learning, and self-supervised learning.
  • Load a pre-trained BERT model: use Hugging Face to load a pre-trained state-of-the-art Natural Language Processing Model, along with its tokenizer, and set it to a regression task.
  • Tokenize the data: use the tokenizer to process the corpus and transform the corpus into a Tensorflow dataset.
  • Fit the final layer: freeze the first layers of the model and fit the last layer to the annotated corpus.
  • Fit the whole model: unfreeze all the layers and fit the model to the data again, using early stopping and the validation set to avoid overfitting.
  • Test and save the model: apply the fine-tuned model to the test set, measure the metric and save the model. Compare the performance with the ML model.
  • Load and test the model: double check that the saved model presents the same performance on the test model as before and also test it on new data.
  • Set a relevance cutoff: run an error analysis to define the predicted score cutoff used to select the most relevant notices for human scrutiny.

Target audience and prerequisite knowledge

This tutorial targets people from government, academia, civil society organizations, and enterprises working on government oversight and monitoring at any level that are familiar with programming and have access to an online version of a government gazette. The prerequisite knowledge is the Python programming language. Prior knowledge that may be helpful (but not required) are Jupyter notebooks; Python packages sklearn, pandas, and TensorFlow; and machine learning.

Importance for the ICEDEG community

This tutorial is based on a successful experience in Brazil since 2020. The use of Machine Learning (ML) and Artificial Intelligence (AI) for monitoring the federal government gazette has enabled an efficient and quick government oversight tool used in Government-to-Government and Government-to- Citizen contexts: an open daily bulletin with the gazettes’ most relevant acts. The bulletin has been used by over 1,200 people from the government, academia, civil society, and private sectors. The system is relatively simple and can be applied to many countries and at government levels.

Tutorial Instructor

Henrique S. Xavier is a Physicist with 10+ years of experience in programming, mathematical modeling, statistics, and data analysis. Besides academic involvement in three different universities worldwide, I have also worked in the private, government, and third sectors as a data scientist/analyst, teacher, and freelancer in data analysis of non-physics topics for news organizations.

 

 

 

Next Event

ICEDEG 2024
24 - 26 June 2024
Lucerne, Switzerland

About Us

We are ICEDEG - a series of conferences organized annually in the area of eDemocracy & eGovernment.

 

Stay Connected on:

Contacts

For general information about the conference/tutorials/sponsorship, including registration, please contact us at:

  • info (@) edem-egov.org
  • +41 26 300 83 55
  • +41 26 300 97 26
  • Boulevard de Pérolles 90,1700 Fribourg, Switzerland