Databricks Certified Associate Developer for Apache Spark 3.0 시험 - Databricks실제시험문제와 답 - 180문항

Question No : 1

Which of the following statements about DAGs is correct?

A.DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.
B.DAG stands for "Directing Acyclic Graph".
C.Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.
D.In contrast to transformations, DAGs are never lazily executed.
E.DAGs can be decomposed into tasks that are executed in parallel.

정답:
Explanation:
DAG stands for "Directing Acyclic Graph".
No, DAG stands for "Directed Acyclic Graph".
Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.
No, quite the opposite. You can access DAGs through the Spark UI and they can be of great help when optimizing queries manually.
In contrast to transformations, DAGs are never lazily executed.
DAGs represent the execution plan in Spark and as such are lazily executed when the driver requests the data processed in the DAG.

Question No : 2

Which of the following is a characteristic of the cluster manager?

A.Each cluster manager works on a single partition of data.
B.The cluster manager receives input from the driver through the SparkContext.
C.The cluster manager does not exist in standalone mode.
D.The cluster manager transforms jobs into DAGs.
E.In client mode, the cluster manager runs on the edge node.

정답:
Explanation:
The cluster manager receives input from the driver through the SparkContext. Correct. In order for the driver to contact the cluster manager, the driver launches a SparkContext. The driver then asks the cluster manager for resources to launch executors. In client mode, the cluster manager runs on the edge node.
No. In client mode, the cluster manager is independent of the edge node and runs in the cluster.
The cluster manager does not exist in standalone mode.
Wrong, the cluster manager exists even in standalone mode. Remember, standalone mode is an easy means to deploy Spark across a whole cluster, with some limitations. For example, in standalone mode, no other frameworks can run in parallel with Spark. The cluster manager is part of Spark in standalone deployments however and helps launch and maintain resources across the cluster.
The cluster manager transforms jobs into DAGs.
No, transforming jobs into DAGs is the task of the Spark driver.
Each cluster manager works on a single partition of data.
No. Cluster managers do not work on partitions directly. Their job is to coordinate cluster resources so that they can be requested by and allocated to Spark drivers. More info: Introduction to Core Spark Concepts • BigData

Question No : 3

dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))
E. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

정답: C
Explanation:
This QUESTION NO: is tricky. Two things are important to know here:
First, the syntax for createDataFrame: Here you need a list of tuples, like so: [(1,), (2,)]. To define a tuple in Python, if you just have a single item in it, it is important to put a comma after the item so
that Python interprets it as a tuple and not just a normal parenthesis.
Second, you should understand the to_timestamp syntax. You can find out more about it in the documentation linked below.
For good measure, let's examine in detail why the incorrect options are wrong: dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
This code snippet does everything the QUESTION NO: asks for C except that the data type of the date column is a string and not a timestamp. When no schema is specified, Spark sets the string
data type as default.
dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])
dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))
In the first row of this command, Spark throws the following error: TypeError: Can not infer schema for type: <class 'str'>. This is because Spark expects to find row information, but instead finds
strings. This is why you need to specify the data as tuples. Fortunately, the Spark
documentation (linked below) shows a number of examples for creating DataFrames that
should help you get on
the right track here.
dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss"))
The issue with this answer is that the operator withColumnRenamed is used. This operator simply renames a column, but it has no power to modify its actual content. This is why withColumn should
be used instead. In addition, the date format yyyy-MM-dd HH:mm:ss does not reflect the format of the actual timestamp: "23/01/2022 11:28:12".
dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))
Here, withColumnRenamed is used instead of withColumn (see above). In addition, the rows are not expressed correctly C they should be written as tuples, using parentheses. Finally, even the date
format is off here (see above).
More info: pyspark.sql.functions.to_timestamp ― PySpark 3.1.2 documentation and pyspark.sql.SparkSession.createDataFrame ― PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 2, 38.(Databricks import instructions)

Question No : 4

col(["transactionId", "predError", "value", "f"])

정답: C
Explanation:
Correct code block:
transactionsDf.select(["transactionId", "predError", "value", "f"])
The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError",
"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a
list. Likewise, all columns being specified in a single string like "transactionId, predError,
value, f" is not valid
syntax.
filter and where filter rows based on conditions, they do not control which columns to return.
Static notebook | Dynamic notebook: See test 2,

Question No : 5

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

A.Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
B.Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
C.Use a narrow transformation to reduce the number of partitions.
D.Use a wide transformation to reduce the number of partitions. Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

정답:
Explanation:
Use a narrow transformation to reduce the number of partitions.
Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the
DataFrame.
Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" C this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info: distributed computing - Spark - repartition() vs coalesce() - Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.
Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info:
pyspark.sql.DataFrame.coalesce ―
PySpark 3.1.2 documentation
Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient
way of reducing the number of partitions of all listed options.
Use a wide transformation to reduce the number of partitions.
No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

Question No : 6

spark.sql ("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")
F. transactionsDf.filter(col(transactionId).isin([3,4,6]))

정답: D
Explanation:
Output of correct code block:
+---------+-----+
|predError|value|
+---------+-----+
| 6| 7|
| null| null|
| 3| 2|
+---------+-----+
This is not an easy QUESTION NO: to solve. You need to know that % stands for the module operator in Python. % 2 will return true for every second row. The statement using spark.sql gets it
almost right (the modulo operator exists in SQL as well), but % 2 = 2 will never yield true, since modulo 2 is either 0 or 1.
Other answers are wrong since they are missing quotes around the column names and/or use filter or select incorrectly.
If you have any doubts about SparkSQL and answer options 3 and 4 in this question, check out the notebook I created as a response to a related student question.
Static notebook | Dynamic notebook: See test 1,53.(Databricks import instructions)

Question No : 7

Which is the highest level in Spark's execution hierarchy?

A.Task
B.Executor
C.Slot
D.Job
E.Stage

정답:
Explanation:

Question No : 8

+------+-----------------------------+-------------------+

A.itemsDf.withColumn('attributes', sort_array(col('attributes').desc()))
B.itemsDf.withColumn('attributes', sort_array(desc('attributes')))
C.itemsDf.withColumn('attributes', sort(col('attributes'), asc=False))
D.itemsDf.withColumn("attributes", sort_array("attributes", asc=False))
E.itemsDf.select(sort_array("attributes"))

정답:
Explanation:
Output of correct code block:
+------+-----------------------------+-------------------+
|itemId|attributes |supplier |
+------+-----------------------------+-------------------+
|1 |[winter, cozy, blue] |Sports Company Inc.|
|2 |[summer, red, fresh, cooling]|YetiX |
|3 |[travel, summer, green] |Sports Company Inc.|
+------+-----------------------------+-------------------+
It can be confusing to differentiate between the different sorting functions in PySpark. In this case, a particularity about sort_array has to be considered: The sort direction is given by the second argument, not by the desc method. Luckily, this is documented in the documentation (link below). Also, for solving this QUESTION NO: you need to understand the difference between sort and sort_array. With sort, you cannot sort values in arrays. Also, sort is a method of DataFrame, while sort_array is a method of pyspark.sql.functions.
More info: pyspark.sql.functions.sort_array ― PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2,32.(Databricks import instructions)

Question No : 9

Which of the following describes characteristics of the Dataset API?

A.The Dataset API does not support unstructured data.
B.In Python, the Dataset API mainly resembles Pandas' DataFrame AP
C.In Python, the Dataset API's schema is constructed via type hints.
D.The Dataset API is available in Scala, but it is not available in Python.
E.The Dataset API does not provide compile-time type safety.

정답:
Explanation:
The Dataset API is available in Scala, but it is not available in Python.
Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In
Python, you use the DataFrame API, which is based on the Dataset API.
The Dataset API does not provide compile-time type safety.
No C in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.
The Dataset API does not support unstructured data.
Wrong, the Dataset API supports structured and unstructured data.
In Python, the Dataset API's schema is constructed via type hints.
No, this is not applicable since the Dataset API is not available in Python.
In Python, the Dataset API mainly resembles Pandas' DataFrame API.
The Dataset API does not exist in Python, only in Scala and Java.

Question No : 10

Which of the following are valid execution modes?

A.Kubernetes, Local, Client
B.Client, Cluster, Local
C.Server, Standalone, Client
D.Cluster, Server, Local
E.Standalone, Client, Cluster

정답:
Explanation:
This is a tricky QUESTION NO: to get right, since it is easy to confuse execution modes and deployment modes. Even in literature, both terms are sometimes used interchangeably.
There are only 3 valid execution modes in Spark: Client, cluster, and local execution modes. Execution modes do not refer to specific frameworks, but to where infrastructure is
located with respect to each other.
In client mode, the driver sits on a machine outside the cluster. In cluster mode, the driver sits on a machine inside the cluster. Finally, in local mode, all Spark infrastructure is started in a single JVM
(Java Virtual Machine) in a single computer which then also includes the driver. Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses specific frameworks outside Spark. Valid deployment modes are standalone, Apache YARN,
Apache Mesos and Kubernetes.
Client, Cluster, Local
Correct, all of these are the valid execution modes in Spark.
Standalone, Client, Cluster
No, standalone is not a valid execution mode. It is a valid deployment mode, though. Kubernetes, Local, Client
No, Kubernetes is a deployment mode, but not an execution mode.
Cluster, Server, Local
No, Server is not an execution mode.
Server, Standalone, Client
No, standalone and server are not execution modes. More info: Apache Spark Internals - Learning Journal

Question No : 11

Which of the following statements about lazy evaluation is incorrect?

A.Predicate pushdown is a feature resulting from lazy evaluation.
B.Execution is triggered by transformations.
C.Spark will fail a job only during execution, but not during definition.
D.Accumulators do not change the lazy evaluation model of Spark.
E.Lineages allow Spark to coalesce transformations into stages

정답:
Explanation:
Execution is triggered by transformations.
Correct. Execution is triggered by actions only, not by transformations.
Lineages allow Spark to coalesce transformations into stages.
Incorrect. In Spark, lineage means a recording of transformations. This lineage enables lazy evaluation in Spark.
Predicate pushdown is a feature resulting from lazy evaluation.
Wrong. Predicate pushdown means that, for example, Spark will execute filters as early in the process as possible so that it deals with the least possible amount of data in subsequent transformations, resulting in a performance improvements.
Accumulators do not change the lazy evaluation model of Spark.
Incorrect. In Spark, accumulators are only updated when the query that refers to the is actually executed. In other words, they are not updated if the query is not (yet) executed due to lazy evaluation.
Spark will fail a job only during execution, but not during definition.
Wrong. During definition, due to lazy evaluation, the job is not executed and thus certain errors, for example reading from a non-existing file, cannot be caught. To be caught, the job needs to be executed, for example through an action.

Question No : 12

The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching value in column itemId with a value in column transactionsId of DataFrame transactionsDf.
Find the error.
Code block:
itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)

A.The join statement is incomplete.
B.The union method should be used instead of join.
C.The join method is inappropriate.
D.The merge method should be used instead of join.
E.The join expression is malformed.

정답:
Explanation:
Correct code block:
itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.transactionId)
The join statement is incomplete.
Correct! If you look at the documentation of DataFrame.join() (linked below), you see that the very first argument of join should be the DataFrame that should be joined with. This first argument is
missing in the code block.
The join method is inappropriate.
No. By default, DataFrame.join() uses an inner join. This method is appropriate for the scenario described in the question.
The join expression is malformed.
Incorrect. The join expression itemsDf.itemId==transactionsDf.transactionId is correct syntax.
The merge method should be used instead of join.
False. There is no DataFrame.merge() method in PySpark.
The union method should be used instead of join.
Wrong. DataFrame.union() merges rows, but not columns as requested in the question.
More info: pyspark.sql.DataFrame.join ― PySpark 3.1.2 documentation, pyspark.sql.DataFrame.union ― PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, 44.(Databricks import instructions)

Question No : 13

Which of the following is a problem with using accumulators?

A.Only unnamed accumulators can be inspected in the Spark U
B.Only numeric values can be used in accumulators.
C.Accumulator values can only be read by the driver, but not by executors.
D.Accumulators do not obey lazy evaluation.
E.Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

정답:
Explanation:
Accumulator values can only be read by the driver, but not by executors.
Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good way to do that.
Only numeric values can be used in accumulators.
No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).
Accumulators do not obey lazy evaluation.
Incorrect C accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a subsequent action is run.
Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.
Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be repeated.
Only unnamed accumulators can be inspected in the Spark UI.
No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.
More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide - Spark 3.1.2 Documentation, pyspark.Accumulator ― PySpark 3.1.2
documentation, and pyspark.AccumulatorParam ― PySpark 3.1.2 documentation

Question No : 14

5

정답: D
Explanation:
The correct code block is:
transactionsDf.filter(col("storeId")==25).take(5)
Any of the options with collect will not work because collect does not take any arguments, and in both cases the argument 5 is given.
The option with toLocalIterator will not work because the only argument to toLocalIterator is prefetchPartitions which is a boolean, so passing 5 here does not make sense.
The option using head will not work because the expression passed to select is not proper syntax. It would work if the expression would be col("storeId")==25.
Static notebook | Dynamic notebook: See test 1,

Question No : 15

Which of the following describes tasks?

A.A task is a command sent from the driver to the executors in response to a transformation.
B.Tasks transform jobs into DAGs.
C.A task is a collection of slots.
D.A task is a collection of rows.
E.Tasks get assigned to the executors by the driver.

정답:
Explanation:
Tasks get assigned to the executors by the driver.
Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver. Tasks transform jobs into DAGs.
No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more
tasks.
A task is a collection of rows.
Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.
A task is a command sent from the driver to the executors in response to a transformation. Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors
only in response to actions.
A task is a collection of slots.
No. Executors have one or more slots to process tasks and each slot can be assigned a task.

Databricks Databricks Certified Associate Developer for Apache Spark 3.0 시험