DY0-001 시험 - CompTIA실제시험문제와 답 - 85문항

Question No : 1

A data scientist built several models that perform about the same but vary in the number of features.
Which of the following models should the data scientist recommend for production according to Occam's razor?

A.The model with the fewest features and highest performance
B.The model with the fewest features and the lowest performance
C.The model with the most features and the lowest performance
D.The model with the most features and the highest performance

정답:
Explanation:
According to Occam’s razor, when models perform equivalently, you choose the simplest one - in this case, the model that achieves the needed performance with the fewest features.

Question No : 2

Which of the following JOINS would generate the largest amount of data?

A.RIGHT JOIN
B.LEFT JOIN
C.CROSS JOIN
D.INNER JOIN

정답:
Explanation:
A CROSS JOIN produces the Cartesian product of the two tables (every row from the first paired with every row from the second), yielding far more rows than any of the other join types.

Question No : 3

A computer vision model is trained to identify cats on a training set that is composed of both cat and
dog images. The model predicts a picture of a cat is a dog.
Which of the following describes this error?

A.Error due to reality
B.False positive error
C.Sampling error
D.Type II error

정답:
Explanation:
Classifying an actual cat (positive instance) as a dog (negative prediction) is a false negative, which corresponds to a Type II error.

Question No : 4

In a modeling project, people evaluate phrases and provide reactions as the target variable for the model.
Which of the following best describes what this model is doing?

A.Sentiment analysis
B.Named-entity recognition
C.TF-IDF vectorization
D.Part-of-speech tagging

정답:
Explanation:
The model predicts people’s reactions (e.g., positive, negative, neutral) to given phrases, which is the core of sentiment analysis.

Question No : 5

Which of the following techniques enables automation and iteration of code releases?

A.Virtualization
B.Markdown
C.Code isolation
D.CI/CD

정답:
Explanation:
Continuous Integration/Continuous Deployment pipelines automate the building, testing, and delivery of code, enabling rapid, repeatable, and iterative releases with minimal manual intervention.

Question No : 6

Which of the following does k represent in the k-means model?

A.Number of model tests
B.Number of data splits
C.Number of clusters
D.Distance between features

정답:
Explanation:
In k-means clustering, the parameter k directly defines how many clusters the algorithm will partition the data into.

Question No : 7

An analyst wants to show how the component pieces of a company's business units contribute to the company's overall revenue.
Which of the following should the analyst use to best demonstrate this breakdown?

A.Box-and-whisker chart
B.Sankey diagram
C.Scatter plot matrix
D.Residual chart

정답:
Explanation:
A Sankey diagram visualizes flows from individual business units into the total, with the width of each flow proportional to its revenue contribution, making it ideal for showing how each component feeds the overall total.

Question No : 8

A data scientist is building a forecasting model for the price of copper. The only input in this model is the daily price of copper for the last ten years.
Which of the following forecasting techniques is the most appropriate for the data scientist to use?

A.Autoregressive
B.Moving average
C.Dynamic time warping
D.Relative strength

정답:
Explanation:
An autoregressive model uses past values of the series itself (here, historical daily copper prices) as predictors for future values, making it the most suitable technique when only the time-series history is available.

Question No : 9

Which of the following distance metrics for KNN is best described as a straight line?

A.Radial
B.Euclidean
C.Cosine
D.Manhattan

정답:
Explanation:
Euclidean distance measures the straight-line distance between two points in space, matching the geometric “as-the-crow-flies” notion of distance.

Question No : 10

A data scientist needs to analyze a company's chemical businesses and is using the master database of the conglomerate company. Nothing in the data differentiates the data observations for the different businesses.
Which of the following is the most efficient way to identify the chemical businesses' observations?

A.Ingest the data from all of the hard drives and perform exploratory data analysis to identify which business is responsible for chemical operations.
B.Perform analysis on all of the data and create a summary report on the results relevant to chemical operations.
C.Consult with the business team to identify which sites are responsible for chemical operations and ingest only the relevant data for analysis.
D.Ingest data from the hard drive containing the most data and present sample results on the chemical operations.

정답:
Explanation:
Engaging the business team leverages domain expertise to pinpoint which records pertain to chemical operations, allowing you to extract and analyze just the relevant subset. This avoids the time and resource waste of ingesting and sifting through unrelated data.

Question No : 11

A statistician notices gaps in data associated with age-related illnesses and wants to further aggregate these observations.
Which of the following is the best technique to achieve this goal?

A.Label encoding
B.Linearization
C.Binning
D.Imputing

정답:
Explanation:
Binning groups continuous age values into discrete intervals (e.g., age ranges), filling gaps by aggregating observations into broader categories. This directly addresses uneven or sparse age data by creating consistent age groups.

Question No : 12

Which of the following best describes the minimization of the residual term in a ridge linear regression?

A.|e|
B.e
C.e2
D.0

정답:
Explanation:
Ridge regression extends ordinary least squares by adding an L2 penalty on the coefficients, but it still minimizes the sum of squared residuals (e²) as its loss term.

Question No : 13

A data scientist has constructed a model that meets the minimum performance requirements specified in the proposal for a prediction project. The data scientist thinks the model's accuracy should be improved, but the proposed deadline is approaching.
Which of the following actions should the data scientist take first?

A.Continue collecting data.
B.Request additional funding.
C.Consult the key project stakeholder.
D.Test additional model specifications.

정답:
Explanation:
Since the model already meets the agreed-upon requirements and the deadline is near, the first step is to confirm with the stakeholder whether pursuing further accuracy gains is worth the additional time and resources. This ensures you align with business priorities before collecting more data, requesting funding, or tweaking the model further.

Question No : 14

Which of the following distribution methods or models can most effectively represent the actual arrival times of a bus that runs on an hourly schedule?

A.Binomial
B.Exponential
C.Normal
D.Poisson

정답:
Explanation:
Scheduled buses tend to arrive around a fixed time with random delays that cluster symmetrically around the hour. A normal distribution effectively models those continuous, bell-shaped deviations from the exact schedule.

Question No : 15

During EDA, a data scientist wants to look for patterns, such as linearity, in the data.
Which of the following plots should the data scientist use?

A.Violin
B.Box-and-whisker
C.Scatter
D.Q-Q

정답:
Explanation:
Scatter plots display pairs of numeric values on two axes, letting you visually assess relationships and patterns, such as linear trends, between variables.

CompTIA DY0-001 시험