I’ve just passed my GCP professional data engineer exam in Oct 2020. A few people have reached out to me to ask for advice so I’m going to share my experience here and hope that this can help you nail the exam! Good luck 🙂
Before diving into the exam prep…
First, I’m going to briefly touch on my background, profession, and motivation for taking the GCP data engineer exam. Currently working at Deloitte in the enterprise technology consulting space, I have an academic background in data science. Prior to studying for this exam, I had no experience in any cloud platforms. I had some idea of what data engineering is and did a distributed databases unit which introduced me to implementation of data pipeline in Apache Kafka, MongoDB, PySpark, and streaming visualisation with Python.
My manager at work suggested that I looked into GCP data engineer exam. I thought it sounds interesting and I love learning new things so why not! What I didn’t know at that time was…this journey helped me find my passion in data engineering! Since then, I’ve been going to various meetups in Melbourne and upskilling myself but hey that’s for a different story 🙂
Let’s have a look at how to prepare for it
“Do I need the certification?”
“Why am I taking this exam?”
“What knowledge/experience I have that can be transferred across?”
“When am I taking this exam? How much time am I willing to commit to study?”
“What does GCP Professional Data Engineer exam cover?”
First thing first, these important questions because they help you plan your study schedule and set your objectives. As you see in the official exam guide , there are lot of things to learn — batch processing, streaming, Dataproc, Apache Beam, BigQuery, ML/AI,…., goal setting allows you to plan and priorities.
I used a range of different resources — online courses, official documentation, blogs, and Medium articles. As I’m intended to work as a data engineer professionally, I also looked for real-world use cases to understand how implementation of data pipeline solves business problems or create value.
Out of all the resources, gcp-examquestions.com and Google documentation helped me the most. I’d say Google documentation was my favorite because Google explained the design, concepts, and best practices of each product clearly and in the documentation.
- Udemy (paid < $15, prices depend on the discounts)
- Google Official Documentation
Highly recommend it! It was like a massive playground which I let my curiosity take me to whatever I didn’t understand or found interesting. This helped me gain in-depth knowledge about each product. Real questions in guarantee part on gcp-examquestions very useful and ensure you can pass the exam at first time.
There are some awesome articles about how to pass the exam here on Medium. I recommend using them as a quick overview. As products and features have changed overtime, official documentation is your best friend for the latest information.
5. Search for products on Google — Learn from real world use cases and practical tips.
6. Google Cloud Blog
7. Github — Search for the product name — Dataflow, BigTable, etc. to see the implementation and codes
8. Google ML Crash Course — A refresher on ML concepts such as overfitting, variance, bias, etc.
1. Compare and contrast — Why/when would you use one over the other?
- Cloud Storage vs. BigQuery Storage
- Batch processing vs. Streaming processing
- AutoML vs. ML API
2. Hands-on practice
- Cloud Shell commands
- Build a real-time PubSub streaming pipeline for on-street parking in City of Melbourne — I picked on-street parking because it was open sourced and easy to access, you can use anything or even simulate your own streaming data!
- Write/execute Python codes for Dataproc
- Qwiklabs — I only did labs when I needed to see tangible results or the flow of execution.
3. Practice exam more and more.
- gcp-examquestions.com: the Guarantee Part is really helpful with actual exam questions. Ensure you can pass the exam.
Topics to study
Now we’re moving on to the specific topics — storage, processing, machine learning, security, monitoring, real-time messaging service, workflow management, and others. The key takeaway here is: the exam is testing you on your ability to design a data pipeline based on business requirements — one size doesn’t fit all. For example, BigTable is a highly scalable storage solution. But if you’re asked to design a pipeline to support transactional data and latency isn’t a concern, is this the best option? What might be the best option and why? Hence, it’s important to know when to use what and why.