Temporal and Geo-referenced Traffic Management with Python+Streamlit
Applying modern tools to visualize time and spatial data in a dashboard
Introduction
Learning Data Science never has been so easy. Even being a relatively “new" area (or just statistics with a cool name) there are already A LOT of books, courses, and videos teaching us practically everything we need. This aligned with the excellent python packages developed by the community make ds a breeze.
But not everything is so easy.
A common pain in developing with Python is that many packages exist to make the scientific/statistical/numerical data manipulation, graph plotting, etc, but not many to deploy things into production (serve ML predictions, distribute graphs).
Recently, new tools arrive to narrow this gap between data science and the operational part of deploying and managing production applications (and, as with everything in Python, with little to no effort).
This post will detail how is possible to deploy a dashboard on a web page using Streamlit with a practical example using geo-referenced traffic data.
The problem
In my posts, I always try to bring some idea of ‘public utility’, usually working with open data from the Brazillian Gov., and today is no different.
One of the major problems in modern cities is traffic.
I used to play Cities Skylines a lot and, several times, I just dropped my city because of traffic — I just could not make it better. So, if it is hard in a video game, imagine it in real life.
The data used in this post are the readings from traffic sensors in the city of Belo Horizonte (BH), the capital of Minas Gerais (Brazil), that I already used in my post about Spark Structured Streaming and Kafka.
In that post, I talk about how to build a streaming infrastructure to receive and process these readings to feed a supposed dashboard/app that shows traffic in real-time.
Today we’re doing something like that.
We’ll build the dashboard but without the ‘real-time’ part. The user should be able to visualize the traffic volume on a map and travel through time by selecting a time window (like 20-JAN to 10-FEB between 10:30–12:30).
Here is a spoiler on how the final app will look like.
This post will focus on how it’s possible to easily build and deploy an app like this with streamlit.
The data
I’ve already talked a little about the data previously. It’s a HUGE dataset in its original form (many Gb of JSON) to process using just plain python+pandas.
So, in the GitHub project, there are some pyspark scripts that I used to preprocess and summarize the data. The docker-compose file is already configured with a small standalone Spark cluster to run these scripts (learn more on how to execute them in this post).
The preprocessing aggregates the readings using 15min time windows, counting the number of vehicles of each class detected by each sensor.
The final result is a table with the following columns:
- LATITUDE/LONGITUDE — Sensor coordinates
- MIN_TIME — Start time of the window (Timestamp)
- CLASS — Vehicle class: CAR, MOTORCYCLE, TRUCK/BUS, UNDEFINED
- DATA HORA: Hour of MIN_TIME
- MONTH: Month of MIN_TIME
- COUNT: Number of vehicles detected
The table is stored in a parquet file partitioned by MONTH and VEHICLE CLASS.
The Implementation
Setting up the environment
All you need is docker and docker-compose.
All the code is available in this GitHub repository.
docker-compose.yaml file:
version: '3'
services:
pystreamlit:
build:
context: .
volumes:
- ./app:/app
- ./data:/data
ports:
- 8501:8501
spark:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=master
volumes:
- ./data:/data
- ./jobs:/jobs
spark-worker:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=4G
- SPARK_EXECUTOR_MEMORY=4G
- SPARK_WORKER_CORES=4
volumes:
- ./data:/data
- ./jobs:/jobs
To start the environment, just run
docker-compose up
And open your browser at localhost:8501
If you want to start just the Streamlit service:
docker-compose up pystreamlit
Streamlit in a nutshell
Stremlit is a python package for developing web data applications.
It abstracts all the frontend/web dev plumbing needed to deploy a web application behind high-level components, making the development simple and fast.
Because of this, in the end, the scripts are just plain python code, with no need to configure HTML files or other frontend components. So, if you are a Data Scientist wanting to share your reports or ML models in a nice visual way, that’s the package for you.
Creating a Streamlit App
A Streamlit app is just a Python script.
To start, just create a main.py (or any other name) and import Streamlit.
import streamlit as st
From now on, it is just a game of adding pieces to a blank canvas. For example, the script below adds a title to the main page and a subtitle on the sidebar.
if __name__ == "__main__":
# App header
TITLE = "How, when and where people move in the roads of Belo Horizonte?"
SUBTITLE = "Georeferenced and temporal analysis of traffic in the capital of Minas Gerais"
st.title(TITLE)
st.sidebar.markdown("## "+SUBTITLE)
To execute, just run
streamlit run main.py
And open your browser at localhost:8501
Adding widgets
Widgets are interactive UI elements that return a value inside our application. We’ll be using the widgets to build the filters for our dashboard.
Basically, if you want to add a widget to the page, all that’s needed is to call the respective method on the st object imported.
Let’s add a selection menu where the user can choose the vehicle classes to plot.
import streamlit as st
def widget_vehicle_class():
vehicle_type = st.sidebar.multiselect(
"Vehicle Class",
["BUS/TRUCK", "CAR", "MOTORCYCLE", "UNDEFINED"], # Possible values
["BUS/TRUCK", "CAR", "MOTORCYCLE", "UNDEFINED"] # Default vaules
)
return vehicle_type
if __name__ == "__main__":
# App header
TITLE = "How, when and where people move in the roads of Belo Horizonte?"
SUBTITLE = "Georeferenced and temporal analysis of traffic in the capital of Minas Gerais"
st.title(TITLE)
st.sidebar.markdown("## "+SUBTITLE)
vehicle_classes = widget_vehicle_class()
The vehicle_type variable stores the returned value of the multiselect widget.
After refreshing the browser:
Every time we add a new element to the UI it is placed below the previous elements.
And don’t get any harder than this.
Our application also needs to receive two dates representing the time interval (start date and end date) considered. We can just add two data input widgets in the same way we did above with the multiselect object, but will be much more visually appealing if they’re side by side.
It’s very easy to achieve this by using columns, as shown below:
MIN_DATE = pd.to_datetime("2022-01-01")
MAX_DATE = pd.to_datetime("2022-02-28")
def widget_dates_range():
lateral_columns = st.sidebar.columns([1, 1])
min_date = lateral_columns[0].date_input(
"From", MIN_DATE,
min_value=MIN_DATE,
max_value=MAX_DATE
)
max_date = lateral_columns[1].date_input(
"To", MAX_DATE,
min_value=MIN_DATE,
max_value=MAX_DATE
)
return min_date, max_date
The columns() method accepts a list where each value represents (proportionally) the length of the column and returns a column list. The column object has the same methods as the st object so, to add a widget to one, just call the respective method.
As the two columns in the code above were created with the same proportions they have the same final length. If you need different-sized columns, just use different proportions, as in the code below:
def widget_hour_range():
# slider with the hour and minute
columns = st.sidebar.columns([1, 20, 1])
columns[0].write(":city_sunset:")
min_hour, max_hour = columns[1].slider(
"",
pd.datetime(2019, 1, 1, 0, 0),
pd.datetime(2019, 1, 1, 23, 59),
(pd.datetime(2019, 1, 1, 0, 0), pd.datetime(2019, 1, 1, 23, 59)),
format="HH:mm",
label_visibility="collapsed"
)
columns[2].write(":night_with_stars:")
return min_hour, max_hour
The results:
All the buttons above were set to collect values to filter our data.
The returned values (min_date, max_date, min_hour, max_hour, vehicle_type) refresh every time we interact with their respective widgets and the page is ‘recreated’ (e.g. the main.py script is re-executed).
if __name__ == "__main__":
# App header
TITLE = "How, when and where people move in the roads of Belo Horizonte?"
SUBTITLE = "Georeferenced and temporal analysis of traffic in the capital of Minas Gerais"
st.title(TITLE)
st.sidebar.markdown("## "+SUBTITLE)
st.markdown("")
# Adding widgets
# on the sidebar
vehicle_classes = widget_vehicle_class()
min_date, max_date = widget_dates_range()
min_hour, max_hour = widget_hour_range()
The next step is to read the data, use these values to filter it, and plot the graph.
Plotting the graphs
Plotting graphs with Streamlit is somewhat of an outsourced task. The package has its own plotting methods (line plot, bar plot, maps, etc.) but they are too simple.
Its power resides in the ability to display graphs from other packages. This is an interesting way of doing things: don’t reinvent the wheel, leave the heavy lifting to the libraries that already have experience with plotting.
And this is also very interesting for us programmers because we don’t have to learn a new visualization tool and can easily embed our previous work inside a web page. Let’s see how easy this is:
import streamlit as st
# Import visualization tools
# and (Geo)Pandas
import geopandas as gpd
import contextily as cx
import numpy as np
import pandas as pd
import pyarrow.parquet as pq
import matplotlib.patheffects as pe
This project uses the Geopandas package to load, process, and plot the georeferenced data. The pyarrow package is used to read the parquet file more efficiently (more on that later). Pandas and numpy are old data science friends and ‘contextily’ is an auxiliary package to Geopandas used to plot real-world information in the map (streets, cities, rivers, etc).
As I don’t wanna turn this post into a gigantic matplotlib/geopandas tutorial, I’ll not detail how the graphs are made. But I’ll explain at a high level some decisions taken.
First, the data is loaded using the filters previously collected by the widgets.
VEHICLE_CLASSES_TRANSLATE = {
"BUS/TRUCK": "CAMINHAO_ONIBUS", "CAR": "AUTOMOVEL",
"MOTORCYCLE": "MOTO", "UNDEFINED": "INDEFINIDO"
}
@st.cache
def read_traffic_count_data(
min_date, max_date, min_hour, max_hour, vehicle_classes
):
# translate the vehicle classes
vehicle_classes = [
VEHICLE_CLASSES_TRANSLATE[vehicle_class]
for vehicle_class in vehicle_classes
]
# Create all month numbers between min_date and max_date
month_numbers = np.arange(
min_date.month, max_date.month + 1
).tolist()
# Use pyarrow to open the parquet file
df_traffic = pq.ParquetDataset(
"/data/vehicles_count.parquet",
filters=[
("MONTH", "in", month_numbers),
("CLASS", "in", vehicle_classes),
]
).read_pandas().to_pandas()
# filter by date
df_traffic = df_traffic.query(
"MIN_TIME >= @min_date and MIN_TIME <= @max_date"
)
# filter by hour
df_traffic['HOUR'] = df_traffic['MIN_TIME'].dt.hour
df_traffic = df_traffic.query(
"HOUR >= @min_hour.hour and HOUR <= @max_hour.hour"
)
return df_traffic
There are two important things to learn from the code snippet above.
First, as the Parquet is partitioned by MONTH and CLASS, it’s possible to read only the partitions that gonna be used. For example, if the user sets the vehicle_class = [‘CAR’] there is no need to read the partitions for MOTORCLYCLE, BUS/TRUCK and UNDEFINED.
This is very important because Streamlit will re-run all the python Script every time someone accesses the page, and this helps us to speed up data loading from the disk (a very costly operation).
Secondly, there is a @st.cache annotation over the function used. This tells Streamlit to cache the results of this function and try to reuse them later. This is useful for big costly operations, that may slow down the user experience on the page (more on this link).
With the data loaded, all that rest is to manipulate it and plot the graphs. Again, I’ll not detail how this process is done (the code isn’t hard but big), but you can always refer to the source on GitHub. Here is our final main:
if __name__ == "__main__":
# App header
TITLE = "How, when and where people move in the roads of Belo Horizonte?"
SUBTITLE = "Georeferenced and temporal analysis of traffic in the capital of Minas Gerais"
st.title(TITLE)
st.sidebar.markdown("## "+SUBTITLE)
st.markdown("")
# Adding widgets
# on the sidebar
vehicle_classes = widget_vehicle_class()
min_date, max_date = widget_dates_range()
min_hour, max_hour = widget_hour_range()
text_sidebar_about()
if len(vehicle_classes) == 0:
st.stop()
# Main app
# Read data
df_traffic = read_traffic_count_data(
min_date, max_date,
min_hour, max_hour,
vehicle_classes
).copy()
df_traffic_geogrouped = group_traffic_count_data(df_traffic).copy()
df_traffic_classgrouped = group_traffic_count_data_by_class(
df_traffic).copy()
total_count = df_traffic_classgrouped["COUNT"].sum()
# Title with total count
st.header(
f"{total_count:,} ".replace(',', ' ')
+ "vehicles detected"
)
# Plot georeferenced data
ax = read_map_data(df_traffic_geogrouped)
ax = plot_count_data(df_traffic_geogrouped, ax)
st.pyplot(ax.figure)
# Plot class grouped data
plot_class_counts_data(df_traffic_classgrouped, vehicle_classes)
To plot a graph on streamlit, just use the respective function of the library you’re using. In this case, Geopandas uses matplotlib(pyplot) internally, so is just a matter of calling st.pyplot() and passing the figure:
# Do everything you need using ax (pyplot object)
# blah blah blah
# when ready give the result to st.pyplot
# Plot georeferenced data
ax = read_map_data(df_traffic_geogrouped)
ax = plot_count_data(df_traffic_geogrouped, ax)
st.pyplot(ax.figure)
The results
Playing with the filters
Conclusion
When we are learning about Data Science, whether in books or online courses, we are taught to do everything inside Jupyter Notebooks. They’re the perfect match between coding, documentation, and visualization. But, in my opinion, even though they are not raw scripts, they’re still very technical documents, and not the ideal way of showing our work to customers and non-IT people.
But we also don’t want to learn a bunch of front-end stuff just to present a simple report, and that’s where Streamlit comes in.
Whether you’re looking to better present your data science report, make available a demo of a Machine Learning model, or create an interactive dashboard, Streamlit can help you by providing an extremely easy way of building web data apps. It provides a set of minimalist commands and functions to create a simple UI interface and wrap up your raw Data Science work in a presentable shell.
My idea with this post is to present a little about this package, and how it can be used in a ‘real scenario’ to create an interactive Dashboard.
As always, this was just a small post, and I’ll leave some references below if you want to explore deeper.
Hope I’ve helped somehow, thank you for reading! :)
References
All the code is available in this GitHub repository.
Data used — Contagens Volumétricas de Radares, Open data, Brazilian Gov.
[1] Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big data processing made simple. “ O’Reilly Media, Inc.”.
[2] Streamlit official Documentation. https://docs.streamlit.io/
[3] Geopandas official Documentation. https://geopandas.org/en/stable/
[4] Streamlit official Documentation — Optimize performance with st.cache.