google professional-data-engineer online test

Professional Data Engineer on Google Cloud Platform

What students need to know about the professional-data-engineer exam

  • Total 184 Questions & Answers

Question 1

Your neural network model is taking days to train. You want to increase the training speed. What can you do?

  • A. Subsample your test dataset.
  • B. Subsample your training dataset.
  • C. Increase the number of input features to your model.
  • D. Increase the number of layers in your neural network.
Answer:

D

Explanation:
Reference: https://towardsdatascience.com/how-to-increase-the-accuracy-of-a-neural-network-9f5d1c6f407d

Discussions

Question 2

Your company is using WILDCARD tables to query data across multiple tables with similar names. The SQL statement is
currently failing with the following error:

Which table name will make the SQL statement work correctly?

  • A. ‘bigquery-public-data.noaa_gsod.gsod‘
  • B. bigquery-public-data.noaa_gsod.gsod*
  • C. ‘bigquery-public-data.noaa_gsod.gsod’*
  • D. ‘bigquery-public-data.noaa_gsod.gsod*`
Answer:

D

Explanation:
Reference: https://cloud.google.com/bigquery/docs/wildcard-tables

Discussions

Question 3

You work for a shipping company that has distribution centers where packages move on delivery lines to route them
properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in
transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real
time while the packages are in transit. Which solution should you choose?

  • A. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
  • B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
  • C. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.
  • D. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
Answer:

A

Discussions

Question 4

You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a
separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard
functions to generate daily and monthly reports for all time ranges. Recently, you discovered that some queries that cover
long date ranges are exceeding the limit of 1,000 tables and failing. How can you resolve this issue?

  • A. Convert all daily log tables into date-partitioned tables
  • B. Convert the sharded tables into a single partitioned table
  • C. Enable query caching so you can cache data from previous months
  • D. Create separate views to cover each month, and query from these views
Answer:

A

Discussions

Question 5

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud.
Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about
a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?

  • A. Assign global unique identifiers (GUID) to each data entry.
  • B. Compute the hash value of each data entry, and compare it with all historical data.
  • C. Store each data entry as the primary key in a separate database and apply an index.
  • D. Maintain a database table to store the hash value and other metadata for each data entry.
Answer:

D

Discussions

Question 6

You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery.
The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to
BigQuery for analysis. Which job type and transforms should this pipeline use?

  • A. Batch job, PubSubIO, side-inputs
  • B. Streaming job, PubSubIO, JdbcIO, side-outputs
  • C. Streaming job, PubSubIO, BigQueryIO, side-inputs
  • D. Streaming job, PubSubIO, BigQueryIO, side-outputs
Answer:

A

Discussions

Question 7

You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work
in progress on your clusters. What should you do?

  • A. Increase the cluster size with more non-preemptible workers.
  • B. Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
  • C. Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
  • D. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
Answer:

D

Explanation:
Reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex

Discussions

Question 8

You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting
messages from IoT devices globally. Because large parts of globe have poor internet connectivity, messages sometimes
batch at the edge, come in all at once, and cause a spike in load on your Kafka cluster. This is becoming difficult to manage
and prohibitively expensive. What is the Google-recommended cloud native architecture for this scenario?

  • A. Edge TPUs as sensor devices for storing and transmitting the messages.
  • B. Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages.
  • C. An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.
  • D. A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world.
Answer:

C

Discussions

Question 9

You want to rebuild your batch pipeline for structured data on Google Cloud. You are using PySpark to conduct data
transformations at scale, but your pipelines are taking over twelve hours to run. To expedite development and pipeline run
time, you want to use a serverless tool and SOL syntax. You have already moved your raw data into Cloud Storage. How
should you build the pipeline on Google Cloud while meeting speed and processing requirements?

  • A. Convert your PySpark commands into SparkSQL queries to transform the data, and then run your pipeline on Dataproc to write the data into BigQuery.
  • B. Ingest your data into Cloud SQL, convert your PySpark commands into SparkSQL queries to transform the data, and then use federated quenes from BigQuery for machine learning.
  • C. Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.
  • D. Use Apache Beam Python SDK to build the transformation pipelines, and write the data into BigQuery.
Answer:

D

Discussions

Question 10

You want to analyze hundreds of thousands of social media posts daily at the lowest cost and with the fewest steps.
You have the following requirements:
You will batch-load the posts once per day and run them through the Cloud Natural Language API.

You will extract topics and sentiment from the posts.

You must store the raw posts for archiving and reprocessing.

You will create dashboards to be shared with people both inside and outside your organization.

You need to store both the data extracted from the API to perform analysis as well as the raw social media posts for
historical archiving. What should you do?

  • A. Store the social media posts and the data extracted from the API in BigQuery.
  • B. Store the social media posts and the data extracted from the API in Cloud SQL.
  • C. Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery.
  • D. Feed to social media posts into the API directly from the source, and write the extracted data from the API into BigQuery.
Answer:

D

Discussions
To page 2