Hello, and welcome to the data engineering podcast.
My name is Nahid, and I am a data engineer at Dolead. In the podcast below I will talk about the data environment at Dolead.
Today I will talk about the data environment at Dolead.
First, I will introduce you to the challenges we have at Dolead :
- Data reporting
- Data dashboarding
- Data exploration
- Data intelligence
Then, I will talk about the technical stack we chose to tackle these challenges :
- Google Cloud
- BigQuery
- DBT
- Airbyte
- Airflow
And finally, I will explain the different challenges we are currently working on.
Data environment
I will start with our data environment and the motivation behind our data-driven approach.
Dolead is a digital advertising company and as such, we receive hundreds of Gigabytes everyday from different sources from all units in the company.
These sources include but are not limited to :
- Social network data that we synchronize from Meta, Google, Tiktok or Bing
- Ad Network data from Outbrain or Taboola
- User behavior tracking on our Landing Pages
- Online forms and surveys
- Advertiser data such as the number of sales stemming from our ads
All for these sources creates a complex data environment that I have to synthesize, organize in order for our users and stakeholders to make the most out of it in a self service mode.
Data challenges and usecases
- Data dashboarding is the act of providing tailored data dashboards and vizualisation for day to day monitoring. Data quality, data freshness & data transformations are important. Data freshness is usually at the day level, or even at the hour level for some data
- Data reporting is about periodic data exports and reports for meetings and milestones. Quality & transformation is crucial
- Data exploration : Period data exploration by the technical teams, data scientists. Data quality & data freshness is required
- Data intelligence : Automation, Machine Learning algorithms. Data quality is paramount for a good training
Technical stack
- Google Cloud
- BigQuery
- Serverless SQL data warehouse
- Scale up and scale down resources as needed
- Pay as you go model related to the amount of data you process in your SQL queries and transforms
- Looker
- Reporting & dashboarding software that is connected to BigQuery for displaying and visualizing the data hosted on BigQuery
- Cloud Storage
- Raw data storage facility that we use to store files, objects, media or staging step before loading to BigQuery
- Airbyte
- Extract and Load software that exists as a Cloud option as well as an open source software
- You can write your Python data pipelines as code and host it in an Airbyte server that runs your code
- You can use on-the-shelf data pipelines written by the community for standard Extract and Load pipelines
- Not suited for transformations as it runs on a single server ; it’s better to let the transformations happen in your data warehouse, such as BigQuery
- DBT
- SQL Transform tool that give you the ability to write your data transformations as SQL code and runs it using your data warehouse
- Lets your write macros - meaning reusable SQL functions - to package and optimize your code
- It also comes with data testing facilities
- ChallengesScalability of the data warehouseAs I mentionned earlier, BigQuery offers a virtually endless scalability.But in reality, it’s pay-as-you-go model makes it more and more expensive to process larger amounts of data.Moreover, understanding and predicting costs can be complicated due to the variable pricing model.It means that we still need to optimize our storage, our processes and our code to be able to scale up without hitting our budgets.We are constantly challenging, reviewing and refactoring our code to meet these expectations.Data qualitySecond challenge is data quality.Data quality refers to the condition of data based on factors like accuracy, completeness, reliability, and relevance. High-quality data should be correct, up-to-date, comprehensive, and applicable to the needs of the user or application.Poor data quality can lead to erroneous conclusions and inefficient processes, which can have severe implications in business settings—ranging from minor inefficiencies to major blunders in strategic decisions.The 3 challenges when it comes to data quality are the following :
- Orchestrator
- Airflow acts like a project manager for your data tasks, making sure everything runs smoothly, on schedule, and in the correct order. Like a recipe that involves steps to prepare a dish.
- Airflow allows you to program workflows using Python scripts. These workflows are designed as "Directed Acyclic Graphs" (DAGs). In simple terms, a DAG is just a way of organizing tasks where each task has a specific order and none of the steps loop back on themselves (that's what acyclic means).
- Volume of data: The sheer amount of data generated today makes manual checks impractical.
- Variety of sources: Data coming from different sources may have different formats or standards.
- Velocity: The speed at which data is generated requires automated tools to manage and maintain quality.
The solution we have is to using
- Continuous Monitoring: Regularly checking data quality metrics to catch issues early.