This exam is quite exhaustive and covers almost all GCP services and concepts. The focus is on in depth understanding of the various GCP products and services including but not limited to Identity and access management, designing for security and compliance, selecting appropriate and cost effective storage technologies, designing scalable data pipelines, building highly-scalable, and cost-effective cloud data warehouse with data analytics capability, training ML models in a distributed environment, all the while keeping in mind Google recommended best practices.
The official guideline from Google:
https://cloud.google.com/certification/guides/data-engineer-2/
So what are the GCP services in which you will need to develop a deeper understanding in order to pass this exam?
Preparation
Well, it’s a long list and here you go:
- Understand effective use of managed services, optimizing storage costs and performance and life cycle management of data.
- Understand the various Google data storage technologies (Google SQL, Cloud Spanner, BigQuery, DataStore, BigTable, Cloud storage).
- Understand streaming input, batch input, data sources, data processing, event storage, rules execution and outbound results.
- Understand how to build data pipelines for batch and streaming data (Python, Java, Go, Apache beam & Dataflow, Dataprep).
- Working with data on distributed systems (Hadoop, Compute engine, Dataproc).
- Scaling data-flows and being able to leverage real-time data streams (Dataflow, Pub/sub).
- Securing your data projects (Managing access rights with IAM and VPC networks).
- Cloud Pub/Sub, Cloud Dataflow & Cloud Functions >> Understand Data processing systems and Architecture options (e.g., message brokers, message queues, event driven messaging and functions, middleware, service-oriented architecture, serverless functions).
- Architectures Complex event processing.
- What is cloud pub/sub?
- Cloud Dataflow
- Cloud Functions
- Designing data pipelines — Batch and streaming data (e.g., Cloud Dataflow, Cloud Dataproc, Apache Beam, Apache Spark and Hadoop ecosystem, Cloud Pub/Sub, Apache Kafka)
- Cloud Dataflow is fully managed service for stream and batch data processing based on Apache Beam.
- Understand DataFlow terminology and architecture (Elements, PCollection, Transforms, Side Inputs, Windowing & Triggers)
- Understand how DataFlow handles out of sequence event data (windowing)
- Cloud Dataproc is fully-managed cloud service for running Apache Spark and Apache Hadoop clusters.
- Understand the difference between Cloud Dataproc and Cloud Dataflow. They can both be used for data processing and there is definite overlap in their batch and streaming capabilities.
- Data warehousing and data processing.
BigQuery (BQ) is a major component in the Data Engineer exam. So get to know BigQuery as much in-depth as possible.
Get familiar working with datasets, schemas and tables (including partitioned and clustered tables).
Understand BigQuery Authorized views — An authorized view allows you to share query results with particular users and groups without giving them access to the underlying tables.
Understand BigQuery IAM roles and permissions. The three types of resources available in BigQuery are organizations, projects, and datasets. In the IAM policy hierarchy, datasets are child resources of projects. Tables and views are child resources of datasets — they inherit permissions from their parent dataset.
Understand how to use external data sources, loading data into BigQuery (batch loads & streaming), Running and managing jobs, the BigQuery Data Transfer Service, estimating storage and query costs and monitoring BigQuery using StackDriver.
TIP: Know some basic SQL query syntax. Can be helpful especially for the BigQuery related questions.
Understand Cloud Spanner schema design, Data types, Secondary Indexes, DML best practices.
Understand how to prevent hotspots, swapping the order of keys, hashing the unique key and spread the writes across logical shards, interleaved tables and using a universally unique identifier (UUID).
Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
Develop a good understanding of data preparation with Dataprep and how it natively integrates with other GCP services like Cloud Storage, Google BigQuery, Cloud Dataflow and Cloud ML Engine.
You will need a basic understanding of machine learning models including measuring, monitoring, and troubleshooting the models, appropriate ML training and serving infrastructure, machine learning terminology (e.g., features, labels, models, regression, classification, recommendation, supervised and unsupervised learning, evaluation metrics) and common sources of errors (under fitting, over-fitting).
Understand working with DataLab Jupyter notebooks.
Understand the various Google offered pre-built ML models as a service, the ML APIs (e.g., Vision API, Speech API), customizing ML APIs (e.g., AutoML Vision, Auto ML text) and conversational experiences (e.g., Dialogflow).
Understand the different Data Transfer Services and related use cases.
Storage Transfer Service >> Transfer your data from one cloud to another
Online Transfer >> Use your network to move data to Google Cloud Storage.
Transfer Appliance >> Securely capture, ship, and upload your data to Google Cloud Storage using the Transfer Appliance 100 TB or 480 TB models
BigQuery Data Transfer Service >> Schedule and automate data transfers from your SaaS applications to Google BigQuery.
IAM roles and permissions are different for each GCP service. Recommend memorizing what roles and access permissions are required for each service in a given scenario.
Understand how to grant permissions based on the principle of least privilege and at what level — for example Dataflow Worker role can design workflows but not see the data.
Coursera
The Preparing for the Google Cloud Professional Data Engineer Exam course on Coursera follows the exam guide outline and will help you learn and build your knowledge and skills. It contains very useful exam preparation tips and also has a practice exam that simulates the actual exam taking experience
https://www.coursera.org/learn/preparing-cloud-professional-data-engineer-exam
Exam Dumps
This site is the best resources which I found. They had all actual exam questions I used it for my preparation.
https://www.gcp-examquestions.com
So now you have completed or getting real close to completion of the training courses…what next??
As you near completion of training, register for the exam. Now that you have a deadline, complete your training and go over what you have learnt. Go through the practice exams multiple times till your accuracy percentages increase.
Do understand that the training courses may not cover some of the questions that were on the actual exam. When you complete the exam you’ll only receive a pass or fail result. Once you have passed you will get an email with a redemption code for an exclusive Google Cloud Certified store which has a lot of swag like t-shirts, backpacks, hoodies etc. You will also get an invite to join the Google Cloud Certified private community.