PinnedJoão PedroinTowards Data ScienceMy First Billion (of Rows) in DuckDBFirst Impressions of DuckDB handling 450Gb in a real project12 min read·May 1, 2024--7--7
João PedroinTowards Data ScienceAnatomy of Windows FunctionsTheory and practice of an underappreciated SQL operation12 min read·5 days ago--1--1
João PedroinTowards Data ScienceAutomatically Detecting Label Errors in Datasets with CleanLabA Tale of AI and wrongly-classified Brazilian Federal Laws10 min read·Jul 22, 2023----
João PedroinTowards Data ScienceAutomatically Managing Data Pipeline Infrastructures With TerraformI know the manual work you did last summer15 min read·May 2, 2023----
João PedroinTowards Data ScienceData Pipeline with Airflow and AWS Tools (S3, Lambda & Glue)Learning a little about these tools and how to integrate them17 min read·Apr 6, 2023--2--2
João PedroinTowards Data ScienceCreating a Data Pipeline with Spark, Google Cloud Storage and Big QueryOn-premise and cloud working together to deliver a data product10 min read·Mar 6, 2023--2--2
João PedroinTowards Data ScienceHands-On Introduction to Delta Lake with (py)SparkConcepts, theory, and functionalities of this modern data storage framework10 min read·Feb 16, 2023--3--3
João PedroTemporal and Geo-referenced Traffic Management with Python+StreamlitApplying modern tools to visualize time and spatial data in a dashboard10 min read·Jan 29, 2023--1--1
João PedroinTowards Data ScienceFirst Steps in Machine Learning with Apache SparkBasic concepts and topics of Spark MLlib package11 min read·Jan 4, 2023----
João PedroinTowards Data ScienceA Fast Look at Spark Structured Streaming + KafkaLearning the basics of how to use this powerful duo for stream-processing tasks11 min read·Nov 5, 2022--4--4