WorryFree Computers   »   [go: up one dir, main page]

Temporal and Geo-referenced Traffic Management with Python+Streamlit

Applying modern tools to visualize time and spatial data in a dashboard

João Pedro
10 min readJan 29, 2023
Photo by Robin Pierre on Unsplash

Introduction

Learning Data Science never has been so easy. Even being a relatively “new" area (or just statistics with a cool name) there are already A LOT of books, courses, and videos teaching us practically everything we need. This aligned with the excellent python packages developed by the community make ds a breeze.

But not everything is so easy.

A common pain in developing with Python is that many packages exist to make the scientific/statistical/numerical data manipulation, graph plotting, etc, but not many to deploy things into production (serve ML predictions, distribute graphs).

Recently, new tools arrive to narrow this gap between data science and the operational part of deploying and managing production applications (and, as with everything in Python, with little to no effort).

This post will detail how is possible to deploy a dashboard on a web page using Streamlit with a practical example using geo-referenced traffic data.

The problem

In my posts, I always try to bring some idea of ‘public utility’, usually working with open data from the Brazillian Gov., and today is no different.

One of the major problems in modern cities is traffic.

I used to play Cities Skylines a lot and, several times, I just dropped my city because of traffic — I just could not make it better. So, if it is hard in a video game, imagine it in real life.

The data used in this post are the readings from traffic sensors in the city of Belo Horizonte (BH), the capital of Minas Gerais (Brazil), that I already used in my post about Spark Structured Streaming and Kafka.

In that post, I talk about how to build a streaming infrastructure to receive and process these readings to feed a supposed dashboard/app that shows traffic in real-time.

Today we’re doing something like that.

We’ll build the dashboard but without the ‘real-time’ part. The user should be able to visualize the traffic volume on a map and travel through time by selecting a time window (like 20-JAN to 10-FEB between 10:30–12:30).

Here is a spoiler on how the final app will look like.

End result spoiler. Image by Author.

This post will focus on how it’s possible to easily build and deploy an app like this with streamlit.

The data

I’ve already talked a little about the data previously. It’s a HUGE dataset in its original form (many Gb of JSON) to process using just plain python+pandas.

So, in the GitHub project, there are some pyspark scripts that I used to preprocess and summarize the data. The docker-compose file is already configured with a small standalone Spark cluster to run these scripts (learn more on how to execute them in this post).

The preprocessing aggregates the readings using 15min time windows, counting the number of vehicles of each class detected by each sensor.

The final result is a table with the following columns:

  • LATITUDE/LONGITUDE — Sensor coordinates
  • MIN_TIME — Start time of the window (Timestamp)
  • CLASS — Vehicle class: CAR, MOTORCYCLE, TRUCK/BUS, UNDEFINED
  • DATA HORA: Hour of MIN_TIME
  • MONTH: Month of MIN_TIME
  • COUNT: Number of vehicles detected

The table is stored in a parquet file partitioned by MONTH and VEHICLE CLASS.

Data used in the project, partitioned by MONTH and CLASS. AUTOMOVEL=CAR, CAMINHAO_ONIBUS=TRUCK/BUS, MOTO=MOTORCYCLE, INDEFINED=UNDEFINED

The Implementation

Setting up the environment

All you need is docker and docker-compose.
All the code is available in this GitHub repository.

docker-compose.yaml file:

version: '3'
services:
pystreamlit:
build:
context: .
volumes:
- ./app:/app
- ./data:/data
ports:
- 8501:8501
spark:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=master
volumes:
- ./data:/data
- ./jobs:/jobs
spark-worker:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=4G
- SPARK_EXECUTOR_MEMORY=4G
- SPARK_WORKER_CORES=4
volumes:
- ./data:/data
- ./jobs:/jobs

To start the environment, just run

docker-compose up

And open your browser at localhost:8501

If you want to start just the Streamlit service:

docker-compose up pystreamlit

Streamlit in a nutshell

Stremlit is a python package for developing web data applications.

It abstracts all the frontend/web dev plumbing needed to deploy a web application behind high-level components, making the development simple and fast.

Because of this, in the end, the scripts are just plain python code, with no need to configure HTML files or other frontend components. So, if you are a Data Scientist wanting to share your reports or ML models in a nice visual way, that’s the package for you.

Creating a Streamlit App

A Streamlit app is just a Python script.

To start, just create a main.py (or any other name) and import Streamlit.

import streamlit as st

From now on, it is just a game of adding pieces to a blank canvas. For example, the script below adds a title to the main page and a subtitle on the sidebar.

if __name__ == "__main__":

# App header
TITLE = "How, when and where people move in the roads of Belo Horizonte?"
SUBTITLE = "Georeferenced and temporal analysis of traffic in the capital of Minas Gerais"

st.title(TITLE)
st.sidebar.markdown("## "+SUBTITLE)

To execute, just run

streamlit run main.py

And open your browser at localhost:8501

Adding widgets

Widgets are interactive UI elements that return a value inside our application. We’ll be using the widgets to build the filters for our dashboard.

Basically, if you want to add a widget to the page, all that’s needed is to call the respective method on the st object imported.

Let’s add a selection menu where the user can choose the vehicle classes to plot.

import streamlit as st

def widget_vehicle_class():
vehicle_type = st.sidebar.multiselect(
"Vehicle Class",
["BUS/TRUCK", "CAR", "MOTORCYCLE", "UNDEFINED"], # Possible values
["BUS/TRUCK", "CAR", "MOTORCYCLE", "UNDEFINED"] # Default vaules
)

return vehicle_type

if __name__ == "__main__":

# App header
TITLE = "How, when and where people move in the roads of Belo Horizonte?"
SUBTITLE = "Georeferenced and temporal analysis of traffic in the capital of Minas Gerais"

st.title(TITLE)
st.sidebar.markdown("## "+SUBTITLE)

vehicle_classes = widget_vehicle_class()

The vehicle_type variable stores the returned value of the multiselect widget.

After refreshing the browser:

Every time we add a new element to the UI it is placed below the previous elements.

And don’t get any harder than this.

Our application also needs to receive two dates representing the time interval (start date and end date) considered. We can just add two data input widgets in the same way we did above with the multiselect object, but will be much more visually appealing if they’re side by side.

It’s very easy to achieve this by using columns, as shown below:

MIN_DATE = pd.to_datetime("2022-01-01")
MAX_DATE = pd.to_datetime("2022-02-28")

def widget_dates_range():
lateral_columns = st.sidebar.columns([1, 1])
min_date = lateral_columns[0].date_input(
"From", MIN_DATE,
min_value=MIN_DATE,
max_value=MAX_DATE
)
max_date = lateral_columns[1].date_input(
"To", MAX_DATE,
min_value=MIN_DATE,
max_value=MAX_DATE
)

return min_date, max_date

The columns() method accepts a list where each value represents (proportionally) the length of the column and returns a column list. The column object has the same methods as the st object so, to add a widget to one, just call the respective method.

As the two columns in the code above were created with the same proportions they have the same final length. If you need different-sized columns, just use different proportions, as in the code below:

def widget_hour_range():

# slider with the hour and minute
columns = st.sidebar.columns([1, 20, 1])
columns[0].write(":city_sunset:")
min_hour, max_hour = columns[1].slider(
"",
pd.datetime(2019, 1, 1, 0, 0),
pd.datetime(2019, 1, 1, 23, 59),
(pd.datetime(2019, 1, 1, 0, 0), pd.datetime(2019, 1, 1, 23, 59)),
format="HH:mm",
label_visibility="collapsed"
)
columns[2].write(":night_with_stars:")

return min_hour, max_hour

The results:

All the buttons above were set to collect values to filter our data.

The returned values (min_date, max_date, min_hour, max_hour, vehicle_type) refresh every time we interact with their respective widgets and the page is ‘recreated’ (e.g. the main.py script is re-executed).

if __name__ == "__main__":

# App header
TITLE = "How, when and where people move in the roads of Belo Horizonte?"
SUBTITLE = "Georeferenced and temporal analysis of traffic in the capital of Minas Gerais"

st.title(TITLE)
st.sidebar.markdown("## "+SUBTITLE)
st.markdown("")

# Adding widgets
# on the sidebar

vehicle_classes = widget_vehicle_class()
min_date, max_date = widget_dates_range()
min_hour, max_hour = widget_hour_range()

The next step is to read the data, use these values to filter it, and plot the graph.

Plotting the graphs

Plotting graphs with Streamlit is somewhat of an outsourced task. The package has its own plotting methods (line plot, bar plot, maps, etc.) but they are too simple.

Its power resides in the ability to display graphs from other packages. This is an interesting way of doing things: don’t reinvent the wheel, leave the heavy lifting to the libraries that already have experience with plotting.

And this is also very interesting for us programmers because we don’t have to learn a new visualization tool and can easily embed our previous work inside a web page. Let’s see how easy this is:

import streamlit as st

# Import visualization tools
# and (Geo)Pandas
import geopandas as gpd
import contextily as cx

import numpy as np
import pandas as pd
import pyarrow.parquet as pq

import matplotlib.patheffects as pe

This project uses the Geopandas package to load, process, and plot the georeferenced data. The pyarrow package is used to read the parquet file more efficiently (more on that later). Pandas and numpy are old data science friends and ‘contextily’ is an auxiliary package to Geopandas used to plot real-world information in the map (streets, cities, rivers, etc).

As I don’t wanna turn this post into a gigantic matplotlib/geopandas tutorial, I’ll not detail how the graphs are made. But I’ll explain at a high level some decisions taken.

First, the data is loaded using the filters previously collected by the widgets.

VEHICLE_CLASSES_TRANSLATE = {
"BUS/TRUCK": "CAMINHAO_ONIBUS", "CAR": "AUTOMOVEL",
"MOTORCYCLE": "MOTO", "UNDEFINED": "INDEFINIDO"
}

@st.cache
def read_traffic_count_data(
min_date, max_date, min_hour, max_hour, vehicle_classes
):
# translate the vehicle classes
vehicle_classes = [
VEHICLE_CLASSES_TRANSLATE[vehicle_class]
for vehicle_class in vehicle_classes
]

# Create all month numbers between min_date and max_date
month_numbers = np.arange(
min_date.month, max_date.month + 1
).tolist()

# Use pyarrow to open the parquet file
df_traffic = pq.ParquetDataset(
"/data/vehicles_count.parquet",
filters=[
("MONTH", "in", month_numbers),
("CLASS", "in", vehicle_classes),
]
).read_pandas().to_pandas()

# filter by date
df_traffic = df_traffic.query(
"MIN_TIME >= @min_date and MIN_TIME <= @max_date"
)

# filter by hour
df_traffic['HOUR'] = df_traffic['MIN_TIME'].dt.hour
df_traffic = df_traffic.query(
"HOUR >= @min_hour.hour and HOUR <= @max_hour.hour"
)

return df_traffic

There are two important things to learn from the code snippet above.

First, as the Parquet is partitioned by MONTH and CLASS, it’s possible to read only the partitions that gonna be used. For example, if the user sets the vehicle_class = [‘CAR’] there is no need to read the partitions for MOTORCLYCLE, BUS/TRUCK and UNDEFINED.

This is very important because Streamlit will re-run all the python Script every time someone accesses the page, and this helps us to speed up data loading from the disk (a very costly operation).

Secondly, there is a @st.cache annotation over the function used. This tells Streamlit to cache the results of this function and try to reuse them later. This is useful for big costly operations, that may slow down the user experience on the page (more on this link).

With the data loaded, all that rest is to manipulate it and plot the graphs. Again, I’ll not detail how this process is done (the code isn’t hard but big), but you can always refer to the source on GitHub. Here is our final main:

if __name__ == "__main__":

# App header
TITLE = "How, when and where people move in the roads of Belo Horizonte?"
SUBTITLE = "Georeferenced and temporal analysis of traffic in the capital of Minas Gerais"

st.title(TITLE)
st.sidebar.markdown("## "+SUBTITLE)
st.markdown("")

# Adding widgets
# on the sidebar

vehicle_classes = widget_vehicle_class()
min_date, max_date = widget_dates_range()
min_hour, max_hour = widget_hour_range()
text_sidebar_about()

if len(vehicle_classes) == 0:
st.stop()

# Main app
# Read data
df_traffic = read_traffic_count_data(
min_date, max_date,
min_hour, max_hour,
vehicle_classes
).copy()
df_traffic_geogrouped = group_traffic_count_data(df_traffic).copy()
df_traffic_classgrouped = group_traffic_count_data_by_class(
df_traffic).copy()

total_count = df_traffic_classgrouped["COUNT"].sum()

# Title with total count
st.header(
f"{total_count:,} ".replace(',', ' ')
+ "vehicles detected"
)

# Plot georeferenced data
ax = read_map_data(df_traffic_geogrouped)
ax = plot_count_data(df_traffic_geogrouped, ax)
st.pyplot(ax.figure)

# Plot class grouped data
plot_class_counts_data(df_traffic_classgrouped, vehicle_classes)

To plot a graph on streamlit, just use the respective function of the library you’re using. In this case, Geopandas uses matplotlib(pyplot) internally, so is just a matter of calling st.pyplot() and passing the figure:

# Do everything you need using ax (pyplot object)
# blah blah blah
# when ready give the result to st.pyplot

# Plot georeferenced data
ax = read_map_data(df_traffic_geogrouped)
ax = plot_count_data(df_traffic_geogrouped, ax)
st.pyplot(ax.figure)

The results

Playing with the filters

Conclusion

When we are learning about Data Science, whether in books or online courses, we are taught to do everything inside Jupyter Notebooks. They’re the perfect match between coding, documentation, and visualization. But, in my opinion, even though they are not raw scripts, they’re still very technical documents, and not the ideal way of showing our work to customers and non-IT people.

But we also don’t want to learn a bunch of front-end stuff just to present a simple report, and that’s where Streamlit comes in.

Whether you’re looking to better present your data science report, make available a demo of a Machine Learning model, or create an interactive dashboard, Streamlit can help you by providing an extremely easy way of building web data apps. It provides a set of minimalist commands and functions to create a simple UI interface and wrap up your raw Data Science work in a presentable shell.

My idea with this post is to present a little about this package, and how it can be used in a ‘real scenario’ to create an interactive Dashboard.

As always, this was just a small post, and I’ll leave some references below if you want to explore deeper.

Hope I’ve helped somehow, thank you for reading! :)

References

All the code is available in this GitHub repository.
Data used — Contagens Volumétricas de Radares, Open data, Brazilian Gov.

[1] Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big data processing made simple. “ O’Reilly Media, Inc.”.
[2] Streamlit official Documentation. https://docs.streamlit.io/
[3] Geopandas official Documentation. https://geopandas.org/en/stable/
[4] Streamlit official Documentation — Optimize performance with st.cache.

--

--

João Pedro

Bachelor of IT at UFRN. Graduate of BI at UFRN — IMD. Strongly interested in Machine Learning, Data Science and Data Engineering.