시험덤프
매달, 우리는 1000명 이상의 사람들이 시험 준비를 잘하고 시험을 잘 통과할 수 있도록 도와줍니다.
  / MLA-C01 덤프  / MLA-C01 문제 연습

Amazon MLA-C01 시험

AWS Certified Machine Learning Engineer - Associate 온라인 연습

최종 업데이트 시간: 2025년06월06일

당신은 온라인 연습 문제를 통해 Amazon MLA-C01 시험지식에 대해 자신이 어떻게 알고 있는지 파악한 후 시험 참가 신청 여부를 결정할 수 있다.

시험을 100% 합격하고 시험 준비 시간을 35% 절약하기를 바라며 MLA-C01 덤프 (최신 실제 시험 문제)를 사용 선택하여 현재 최신 125개의 시험 문제와 답을 포함하십시오.

 / 3

Question No : 1


A company uses a generative model to analyze animal images in the training dataset to record variables like different ear shapes, eye shapes, tail features, and skin patterns.
Which of the following tasks can the generative model perform?

정답:
Explanation:
Correct option:
The model can recreate new animal images that were not in the training dataset
Generative artificial intelligence (generative AI) is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. AI technologies attempt to mimic human intelligence in nontraditional computing tasks like image recognition, natural language processing (NLP), and translation.
Generative models can analyze animal images to record variables like different ear shapes, eye shapes, tail features, and skin patterns. They learn features and their relations to understand what different animals look like in general. They can then recreate new animal images that were not in the training set.
via - https://aws.amazon.com/what-is/generative-ai/
Incorrect options:
The model can classify a single species of animals such as cats
The model can classify multiple species of animals such as cats, dogs, etc
Traditional machine learning models were discriminative or focused on classifying data points. They attempted to determine the relationship between known and unknown factors. For example, they look at images―known data like pixel arrangement, line, color, and shape―and map them to words―the unknown factor. Only discriminative models can act as single-class classifiers or multi-class classifiers. Therefore, both these options are incorrect.
The model can identify any image from the training dataset - This option has been added as a distractor. A generative model is not an image-matching algorithm. It cannot identify an image from the training dataset.
Reference: https://aws.amazon.com/what-is/generative-ai/

Question No : 2


What is a key difference in feature engineering tasks for structured data compared to unstructured data in the context of machine learning?

정답:
Explanation:
Correct option:
Feature engineering for structured data often involves tasks such as normalization and handling missing values, while for unstructured data, it involves tasks such as tokenization and vectorization
Feature engineering for structured data typically includes tasks like normalization, handling missing values, and encoding categorical variables. For unstructured data, such as text or images, feature engineering involves different tasks like tokenization (breaking down text into tokens), vectorization (converting text or images into numerical vectors), and extracting features that can represent the content meaningfully.
Incorrect options:
Feature engineering for structured data focuses on image recognition, whereas for unstructured data, it focuses on numerical data analysis - Structured data can include numerical and categorical data, while unstructured data includes text, images, audio, etc. The focus is not limited to image recognition or numerical data analysis.
Feature engineering for structured data is not necessary as the data is already in a usable format, whereas for unstructured data, extensive preprocessing is always required - Feature engineering is important for both structured and unstructured data. While structured data may require less preprocessing, tasks like normalization and handling missing values are still crucial. Unstructured data typically requires more extensive preprocessing.
Feature engineering tasks for structured data and unstructured data are identical and do not vary based on data type - Feature engineering tasks vary significantly between structured and unstructured data due to the inherent differences in data types and the requirements for preprocessing each type.
Reference: https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/feature-engineering.html

Question No : 3


You are a data scientist at an insurance company that uses a machine learning model to assess the risk of potential clients and set insurance premiums accordingly. The model was trained on data from the past few years, but recently, the company has expanded its services to new regions with different demographic characteristics. You are concerned that these changes in the data distribution might affect the model's performance and lead to biased or inaccurate predictions. To address this, you decide to use Amazon SageMaker Clarify to monitor and detect any significant shifts in data distribution that could impact the model.
Which of the following actions is the MOST EFFECTIVE for detecting changes in data distribution using SageMaker Clarify and mitigating their impact on model performance?

정답:
Explanation:
Correct option:
Set up a continuous monitoring job with SageMaker Clarify to track changes in feature distribution over time and alert you when a significant feature attribution drift is detected, allowing you to investigate and potentially retrain the model
A drift in the distribution of live data for models in production can result in a corresponding drift in the feature attribution values, just as it could cause a drift in bias when monitoring bias metrics. Amazon SageMaker Clarify feature attribution monitoring helps data scientists and ML engineers monitor predictions for feature attribution drift on a regular basis.
Continuous monitoring with SageMaker Clarify is the most effective approach for detecting changes in data distribution. By tracking feature distributions over time, you can identify when a significant shift occurs, investigate its impact on model performance, and decide if retraining is necessary. This proactive approach helps ensure that your model remains accurate and fair as the underlying data evolves.
via -
https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-monitor-feature-attribution-drift.html
Incorrect options:
Use SageMaker Clarify’s bias detection capabilities to analyze the model’s output and identify any disparities between different demographic groups, retraining the model only if significant bias is detected
- While SageMaker Clarify’s bias detection is useful, focusing solely on bias in the model’s output doesn’t address the broader issue of shifts in feature distribution that can impact overall model performance. Continuous monitoring is needed to detect such changes proactively.
Implement a random sampling process to manually review a subset of incoming data each month, comparing it with the original training data to check for distribution changes - Manual reviews of data can be labor-intensive, error-prone, and may not catch distribution changes in a timely manner. Automated monitoring with SageMaker Clarify is more efficient and reliable.
Use SageMaker Clarify to perform a one-time bias analysis during model training, ensuring that the model is initially fair and accurate, and manually monitor future data distribution changes - A one-time bias analysis during training helps ensure initial fairness, but it doesn’t address ongoing changes in data distribution after the model is deployed. Continuous monitoring is necessary to maintain model performance over time.
Reference: https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-monitor-feature-attribution-drift.html

Question No : 4


You are an ML Engineer at a financial services company tasked with deploying a machine learning model for real-time fraud detection in production. The model requires low-latency inference to ensure that fraudulent transactions are flagged immediately. However, you also need to conduct extensive testing and experimentation in a separate environment to fine-tune the model and validate its performance before deploying it. You must provision compute resources that are appropriate for both environments, balancing performance, cost, and the specific needs of testing and production.
Which of the following strategies should you implement to effectively provision compute resources for both the production environment and the test environment using Amazon SageMaker, considering the different requirements for each environment? (Select two)

정답:
Explanation:
Correct options:
Use CPU-based instances in the test environment to save on costs during experimentation
For the test environment, CPU-based instances can be used to run experiments and validate the model, which helps reduce costs without compromising the ability to test different configurations and models.
Leverage AWS Inferentia accelerators in the production environment to meet high throughput and low latency requirements
AWS Inferentia accelerators are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications. The first-generation AWS Inferentia accelerator powers Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances that deliver higher throughput and lower latency than comparable Amazon EC2 instances. They also offer up to 70% lower cost per inference than comparable Amazon EC2 instances. Therefore, you can meet performance requirements in the production environment while optimizing costs.
Incorrect options:
Use GPU-based instances in both production and test environments to ensure that the model inference and testing are both performed at maximum speed, regardless of cost - While using GPU-based instances in both environments ensures high performance, it is not cost-effective. The test environment does not typically require the same level of performance as production, making GPU instances unnecessary and expensive.
Provision CPU-based instances in both production and test environments to reduce costs, as CPU instances are generally cheaper than GPU instances - Provisioning only CPU-based instances in both environments might save costs but would likely fail to meet the low-latency requirements in production. Inference times could be unacceptably slow, which is critical for real-time fraud detection.
Provision identical instances in both production and test environments to ensure consistent performance between the two, eliminating the risk of discrepancies during deployment - Although using identical instances in both environments ensures consistency, it is not cost-efficient. The test environment does not need to replicate the full performance of the production environment, so using less powerful and less expensive instances is more appropriate for testing purposes.
Reference: https://aws.amazon.com/machine-learning/inferentia/

Question No : 5


You are a DevOps engineer at a tech company that is building a scalable microservices-based application. The application is composed of several containerized services, each responsible for different parts of the application, such as user authentication, data processing, and recommendation systems. The company wants to standardize and automate the deployment and management of its infrastructure using Infrastructure as Code (IaC). You need to choose between AWS CloudFormation and AWS Cloud Development Kit (CDK) for defining the infrastructure. Additionally, you must decide on the appropriate AWS container service to manage and deploy these microservices efficiently.
Given the requirements, which combination of IaC option and container service is MOST SUITABLE for this scenario, and why?

정답:
Explanation:
Correct option:
Use AWS CDK for infrastructure as code, allowing you to define the infrastructure in a high-level programming language, and deploy the containerized microservices using Amazon EKS (Elastic Kubernetes Service) for advanced orchestration and scalability
AWS CDK offers the flexibility of using high-level programming languages (e.g., Python, JavaScript) to define infrastructure, making it easier to manage complex infrastructure setups programmatically. via - https://docs.aws.amazon.com/cdk/v2/guide/home.html
Amazon EKS is designed for running containerized microservices with Kubernetes, providing advanced orchestration, scalability, and integration with CI/CD pipelines. This combination is ideal for a microservices-based application with complex deployment and scaling needs.
via - https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html
Incorrect options:
Use AWS CloudFormation to define and deploy the infrastructure as code, and Amazon ECR (Elastic Container Registry) with Fargate for running the containerized microservices without needing to manage the underlying servers - AWS CloudFormation is powerful for defining infrastructure using JSON or YAML. However, Amazon ECR is an AWS managed container image registry service and it cannot be used to manage and run containers.
Use AWS CloudFormation with YAML templates for infrastructure automation and deploy the containerized microservices using Amazon Lightsail Containers to simplify management and reduce costs - AWS CloudFormation with YAML templates is suitable for traditional IaC, but Amazon Lightsail Containers is better for simple, low-cost container deployments. It may lack the scalability and orchestration features required for a complex microservices architecture.
Use AWS CDK with Amazon ECS on EC2 instances to combine the flexibility of programming languages with direct control over the underlying server infrastructure for the microservices - AWS CDK combined with Amazon ECS on EC2 gives more control over the underlying infrastructure but adds complexity in managing the servers. For a microservices-based application, this might introduce unnecessary overhead compared to using managed services like Fargate or EKS.
References:
https://docs.aws.amazon.com/cdk/v2/guide/home.html https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html

Question No : 6


Your data science team is working on developing a machine learning model to predict customer churn. The dataset that you are using contains hundreds of features, but you suspect that not all of these features are equally important for the model's accuracy. To improve the model's performance and reduce its complexity, the team wants to focus on selecting only the most relevant features that contribute significantly to minimizing the model's error rate.
Which feature engineering process should your team apply to select a subset of features that are the most relevant towards minimizing the error rate of the trained model?

정답:
Explanation:
Correct option:
Feature selection
Feature selection is the process of selecting a subset of extracted features. This is the subset that is relevant and contributes to minimizing the error rate of a trained model. Feature importance score and correlation matrix can be factors in selecting the most relevant features for model training.
Incorrect options:
Feature creation - Feature creation refers to the creation of new features from existing data to help with better predictions. Examples of feature creation include: one-hot-encoding, binning, splitting, and calculated features.
Feature transformation - Feature transformation and imputation include steps for replacing missing features or features that are not valid. Some techniques include: forming Cartesian products of features, non-linear transformations (such as binning numeric variables into categories), and creating re extraction involves reducing the amount of data to be processed using dimensionality reduction techniques. These
techniques include: Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA).
Reference: https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/feature-engineering.html

Question No : 7


You are a machine learning engineer at a fintech company that has developed several models for various use cases, including fraud detection, credit scoring, and personalized marketing. Each model has different performance and deployment requirements. The fraud detection model requires real-time predictions with low latency and needs to scale quickly based on incoming transaction volumes. The credit scoring model is computationally intensive but can tolerate batch processing with slightly higher latency. The personalized marketing model needs to be triggered by events and doesn’t require constant availability.
Given these varying requirements, which deployment target is the MOST SUITABLE for each model?

정답:
Explanation:
Correct option:
Deploy the fraud detection model using SageMaker endpoints for low-latency, real-time predictions, deploy the credit scoring model on Amazon ECS for batch processing, and deploy the personalized marketing model using AWS Lambda for event-driven execution
Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling. SageMaker endpoints are optimized for low-latency, real-time predictions, making them ideal for the fraud detection model.
Amazon ECS provides a service scheduler for long-running tasks and applications. It also provides the ability to run standalone tasks or scheduled tasks for batch jobs or single run tasks. You can specify the task placement strategies and constraints for running tasks that best meet your needs. Amazon ECS is well-suited for batch processing tasks, making it a good choice for the credit scoring model.
AWS Lambda is ideal for the event-driven nature of the personalized marketing model, allowing it to scale on-demand with minimal cost.
via - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduling_tasks.html
Incorrect options:
Deploy the fraud detection model using AWS Lambda for serverless, on-demand execution, deploy the credit scoring model on Amazon EKS for scalable batch processing, and deploy the personalized marketing model on SageMaker endpoints to handle event-driven inference - AWS Lambda is serverless and ideal for event-driven tasks, but it may not provide the low-latency, real-time performance required for fraud detection. SageMaker endpoints are better suited for this use case. The credit scoring model is better suited for ECS, where batch processing can be efficiently managed, while personalized marketing is a good fit for AWS Lambda.
Deploy all three models on a single Amazon EKS cluster to take advantage of Kubernetes orchestration, ensuring consistent management and scaling across different use cases - Deploying all models on a
single Amazon EKS cluster could be overkill and lead to unnecessary complexity. While Kubernetes provides powerful orchestration, it might be excessive for simple, event-driven or batch workloads.
Deploy the fraud detection model on Amazon ECS for auto-scaling based on demand, deploy the credit scoring model using SageMaker endpoints for real-time scoring, and deploy the personalized marketing model on Amazon EKS for event-driven processing - While Amazon ECS can handle auto-scaling, it is not as optimized for real-time, low-latency predictions as SageMaker endpoints. Additionally, using SageMaker endpoints for the credit scoring model does not align well with batch processing needs. The personalized marketing model is better suited to AWS Lambda rather than Amazon EKS, which is more complex and designed for containerized applications with continuous workloads.
References:
https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduling_tasks.html

Question No : 8


A company has recently migrated to AWS Cloud and it wants to optimize the hardware used for its AI workflows.
Which of the following would you suggest?

정답:
Explanation:
Correct option:
Leverage AWS Trainium for high-performance, cost-effective Deep Learning training. Leverage AWS Inferentia for the deep learning (DL) and generative AI inference applications
AWS Inferentia accelerators are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications. The first-generation
AWS Inferentia accelerator powers Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, which deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances.
AWS Trainium is the machine learning (ML) chip that AWS purpose-built for deep learning (DL) training of 100B+ parameter models. Each Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instance deploys up to 16 Trainium accelerators to deliver a high-performance, low-cost solution for DL training in the cloud.
Incorrect options:
Leverage either AWS Trainium or AWS Inferentia for the deep learning (DL) and generative AI inference applications
Leverage either AWS Trainium or AWS Inferentia for high-performance, cost-effective Deep Learning training
Leverage AWS Inferentia for high-performance, cost-effective Deep Learning training. Leverage AWS Trainium for the deep learning (DL) and generative AI inference applications
These three options contradict the explanation provided above, so these options are incorrect.
References:
https://aws.amazon.com/machine-learning/inferentia/
https://aws.amazon.com/machine-learning/trainium/

Question No : 9


How would you differentiate between K-Means and K-Nearest Neighbors (KNN) algorithms in machine learning?

정답:
Explanation:
Correct option:
K-Means is an unsupervised learning algorithm used for clustering data points into groups, while KNN is a supervised learning algorithm used for classifying data points based on their proximity to labeled examples
K-Means is an unsupervised learning algorithm used to partition a dataset into distinct clusters by minimizing the variance within each cluster. KNN, on the other hand, is a supervised learning algorithm that classifies new data points based on the majority class among its k-nearest neighbors in the training data.
Incorrect options:
K-Means is a supervised learning algorithm used for classification, while KNN is an unsupervised learning algorithm used for clustering - K-Means is an unsupervised learning algorithm, and KNN is a supervised learning algorithm.
K-Means requires labeled data to form clusters, whereas KNN does not use labeled data for making predictions - K-Means does not require labeled data; it is used for clustering. KNN, however, requires labeled data for classification.
K-Means is primarily used for regression tasks, while KNN is used for reducing the dimensionality of data
- K-Means is not used for regression tasks, and KNN is not primarily used for dimensionality reduction. KNN is used for classification and regression tasks based on proximity to neighbors.
References:
https://aws.amazon.com/blogs/machine-learning/k-means-clustering-with-amazon-sagemaker/
https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html

Question No : 10


You are a data scientist at a financial services company tasked with deploying a lightweight machine learning model that predicts creditworthiness based on a customer’s transaction history. The model needs to provide real-time predictions with minimal latency, and the traffic pattern is unpredictable, with occasional spikes during business hours. The company is cost-conscious and prefers a serverless architecture to minimize infrastructure management overhead.
Which approach is the MOST SUITABLE for deploying this solution, and why?

정답:
Explanation:
Correct option:
Deploy the model directly within AWS Lambda as a function, and expose it through an API Gateway endpoint, allowing the function to scale automatically with traffic and provide real-time predictions Deploying the model within AWS Lambda as a function and exposing it through an API Gateway endpoint is ideal for lightweight, serverless, real-time inference. Lambda’s automatic scaling and pay-per-use model align well with unpredictable traffic patterns and the need for cost efficiency.
via -
https://aws.amazon.com/blogs/compute/deploying-machine-learning-models-with-serverless-templates/
Incorrect options:
Deploy the model as a SageMaker endpoint for real-time inference, and configure AWS Lambda to preprocess incoming requests before sending them to the SageMaker endpoint for prediction - While deploying the model as a SageMaker endpoint is suitable for more complex models requiring managed infrastructure, this approach might be overkill for a lightweight model, especially if you want to minimize costs and management overhead. Lambda functions can serve the model directly in a more cost-effective manner.
Use an Amazon EC2 instance to host the model, with AWS Lambda functions handling the communication between the API Gateway and the EC2 instance for prediction requests - Hosting the model on an Amazon EC2 instance with Lambda managing communication adds unnecessary complexity and overhead. EC2-based deployments require more management and may not be as cost-effective for serverless and real-time use cases.
Deploy the model using Amazon ECS (Elastic Container Service) and configure an AWS Lambda to trigger the ECS service on-demand, ensuring that the model is only running during peak traffic periods - Using Amazon ECS triggered by AWS Lambda adds complexity and may not provide the same level of real-time responsiveness as directly deploying the model in Lambda.
Reference: https://aws.amazon.com/blogs/compute/deploying-machine-learning-models-with-serverless-templates/

Question No : 11


You are a data scientist working on a deep learning model to classify medical images for disease detection. The model initially shows high accuracy on the training data but performs poorly on the validation set, indicating signs of overfitting. The dataset is limited in size, and the model is complex, with many parameters. To improve generalization and reduce overfitting, you need to implement appropriate techniques while balancing model complexity and performance.
Given these challenges, which combination of techniques is the MOST LIKELY to help prevent overfitting and improve the model’s performance on unseen data?

정답:
Explanation:
Correct option:
Combine data augmentation to increase the diversity of the training data with early stopping to prevent overfitting, and use ensembling to average predictions from multiple models
via - https://aws.amazon.com/what-is/overfitting/
This option combines data augmentation to artificially expand the training dataset, which is crucial when
data is limited, with early stopping to prevent the model from overtraining. Additionally, ensembling helps improve generalization by averaging predictions from multiple models, reducing the likelihood that overfitting in any single model will dominate the final prediction. This combination addresses both data limitations and model overfitting effectively.
Incorrect options:
Apply early stopping to halt training when the validation loss stops improving, and use dropout as a regularization technique to prevent the model from becoming too reliant on specific neurons - Dropout is a form of regularization used in neural networks that reduces overfitting by trimming codependent neurons. Early stopping and dropout are effective techniques for preventing overfitting, particularly in deep learning models. However, while they can help, they may not be sufficient alone, especially when dealing with limited data. Combining these techniques with others, such as data augmentation or ensembling, would provide a more robust solution.
Use ensembling by combining multiple versions of the same model trained with different random seeds, and apply data augmentation to artificially increase the size of the dataset - Ensembling and data augmentation are powerful techniques, but ensembling by combining multiple versions of the same model trained with different random seeds might not provide significant diversity in predictions. A combination of diverse models or more comprehensive techniques might be more effective.
Prune the model by removing less important layers and nodes, and use L2 regularization to reduce the magnitude of the model’s weights, preventing overfitting - Regularization helps prevent linear models from overfitting training data examples by penalizing extreme weight values. L1 regularization reduces the number of features used in the model by pushing the weight of features that would otherwise have very small weights to zero. L1 regularization produces sparse models and reduces the amount of noise in the model. L2 regularization results in smaller overall weight values, which stabilizes the weights when there is high correlation between the features. Pruning and L2 regularization are useful for reducing model complexity and preventing overfitting. However, pruning can sometimes lead to underfitting if not done carefully, and using these techniques alone might not fully address the overfitting issue, especially with limited data.
References:
https://aws.amazon.com/what-is/overfitting/
https://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html
https://docs.aws.amazon.com/machine-learning/latest/dg/training-parameters.html

Question No : 12


You are working as a data scientist at a financial services company tasked with developing a credit risk prediction model. After experimenting with several models, including logistic regression, decision trees, and support vector machines, you find that none of the models individually achieves the desired level of accuracy and robustness. Your goal is to improve overall model performance by combining these models in a way that leverages their strengths while minimizing their weaknesses.
Given the scenario, which of the following approaches is the MOST LIKELY to improve the model’s performance?

정답:
Explanation:
Correct option:
Apply stacking, where the predictions from logistic regression, decision trees, and support vector machines are used as inputs to a meta-model, such as a random forest, to make the final prediction
via -
https://aws.amazon.com/blogs/machine-learning/efficiently-train-tune-and-deploy-custom-ensembles-usi
ng-amazon-sagemaker/
In bagging, data scientists improve the accuracy of weak learners by training several of them at once on multiple datasets. In contrast, boosting trains weak learners one after another.
Stacking involves training a meta-model on the predictions of several base models. This approach can significantly improve performance because the meta-model learns to leverage the strengths of each base model while compensating for their weaknesses.
For the given use case, leveraging a meta-model like a random forest can help capture the relationships between the predictions of logistic regression, decision trees, and support vector machines.
Incorrect options:
Use a simple voting ensemble, where the final prediction is based on the majority vote from the logistic regression, decision tree, and support vector machine models - A voting ensemble is a straightforward way to combine models, and it can improve performance. However, it typically does not capture the complex interactions between models as effectively as stacking.
Implement boosting by training sequentially different types of models - logistic regression, decision trees, and support vector machines - where each new model corrects the errors of the previous ones - Boosting is a powerful technique for improving model performance by training models sequentially, where each model focuses on correcting the errors of the previous one. However, it typically involves the same base model, such as decision trees (e.g., XGBoost), rather than combining different types of models.
Use bagging, where different types of models - logistic regression, decision trees, and support vector machines - are trained on different subsets of the data, and their predictions are averaged to produce the final result - Bagging, like boosting, is effective for reducing variance and improving the stability of models, particularly for high-variance models like decision trees. However, it usually involves training multiple instances of the same model type (e.g., decision trees in random forests) rather than combining different types of models.
References:
https://aws.amazon.com/blogs/machine-learning/efficiently-train-tune-and-deploy-custom-ensembles-usi ng-amazon-sagemaker/
https://aws.amazon.com/what-is/boosting/

Question No : 13


You are a data scientist working on a binary classification model to predict whether customers will default on their loans. The dataset is highly imbalanced, with only 10% of the customers having defaulted in the past. After training the model, you need to evaluate its performance to ensure it effectively distinguishes between defaulters and non-defaulters. Given the class imbalance, accuracy alone is not sufficient to assess the model’s performance. Instead, you decide to use the Receiver Operating Characteristic (ROC) curve and the Area Under the ROC Curve (AUC) to evaluate the model.
Which of the following interpretations of the ROC and AUC metrics is MOST ACCURATE for assessing the model’s performance?

정답:
Explanation:
Correct option:
An AUC close to 1.0 indicates that the model has excellent discriminatory power, effectively distinguishing between defaulters and non-defaulters
Area Under the (Receiver Operating Characteristic) Curve (AUC) represents an industry-standard accuracy metric for binary classification models. AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold.
The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is highly accurate. Values near 0.5 indicate an ML model that is no better than guessing at random. Values near 0 are unusual to see, and typically indicate a problem with the data. Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are flipped from reality ('0's are predicted as '1's and vice versa). The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting.
via -
https://aws.amazon.com/blogs/machine-learning/is-your-model-good-a-deep-dive-into-amazon-sagemaker-canvas-advanced-metrics/
An AUC close to 1.0 signifies that the model has excellent discriminatory power, meaning it can effectively distinguish between the positive class (defaulters) and the negative class (non-defaulters) across all thresholds. This is desirable in a classification task, especially in scenarios with class imbalance.
via - https://docs.aws.amazon.com/machine-learning/latest/dg/binary-model-insights.html
Incorrect options:
A ROC curve that is close to the diagonal line (AUC ~ 0.5) indicates that the model performs well across all thresholds - A ROC curve close to the diagonal line (AUC ~ 0.5) indicates that the model has no discriminatory power and is performing no better than random guessing. This suggests poor model performance, not that the model performs well across all thresholds.
A ROC curve that is closer to the top-left corner of the plot (AUC ~ 1) shows that the model is overfitting, and its predictions are too optimistic - A ROC curve closer to the top-left corner of the plot (AUC closer to 1.0) indicates strong model performance, not overfitting. Overfitting is typically identified by other indicators, such as a large gap between training and validation performance, not by the shape of the ROC curve alone.
An AUC close to 0 indicates that the model is highly accurate, correctly classifying almost all instances of defaulters and non-defaulters - An AUC close to 0 is problematic, as it indicates that the model is consistently making incorrect predictions (i.e., it classifies negatives as positives and vice versa). A high AUC (close to 1) is what signifies strong model performance.
References:
https://docs.aws.amazon.com/machine-learning/latest/dg/binary-model-insights.html https://aws.amazon.com/blogs/machine-learning/creating-high-quality-machine-learning-models-for-financial-services-using-amazon-sagemaker-autopilot/ https://aws.amazon.com/blogs/machine-learning/is-your-model-good-a-deep-dive-into-amazon-sagemaker-canvas-advanced-metrics/

Question No : 14


You are a data scientist at a financial institution tasked with building a model to detect fraudulent transactions. The dataset is highly imbalanced, with only a small percentage of transactions being fraudulent. After experimenting with several models, you decide to implement a boosting technique to improve the model’s accuracy, particularly on the minority class. You are considering different types of boosting, including Adaptive Boosting (AdaBoost), Gradient Boosting, and Extreme Gradient Boosting (XGBoost).
Given the problem context and the need to effectively handle class imbalance, which boosting technique is MOST SUITABLE for this scenario?

정답:
Explanation:
Correct option:
Apply Extreme Gradient Boosting (XGBoost) for its ability to handle imbalanced datasets effectively through regularization, weighted classes, and optimized computational efficiency
The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
Its robust handling of a variety of data types, relationships, distributions.
The variety of hyperparameters that you can fine-tune.
XGBoost is an extension of Gradient Boosting that includes additional features such as regularization, handling of missing values, and support for weighted classes, making it particularly well-suited for imbalanced datasets like fraud detection. It also offers significant computational efficiency, which is beneficial when working with large datasets.
via - https://aws.amazon.com/what-is/boosting/
Incorrect options:
Use Adaptive Boosting (AdaBoost) to focus on correcting the errors of weak classifiers, giving more weight to incorrectly classified instances during each iteration - AdaBoost works by focusing on correcting the errors of weak classifiers, assigning more weight to misclassified instances in each iteration. However, it may struggle with noisy data and extreme class imbalance, as it can overemphasize hard-to-classify instances.
Implement Gradient Boosting to sequentially train weak learners, using the gradient of the loss function to improve performance on the minority class - Gradient Boosting is a powerful technique that uses the gradient of the loss function to improve the model iteratively. While it can be adapted to handle class imbalance, it does not inherently provide the same level of flexibility and computational optimization as XGBoost for this specific problem.
Use Gradient Boosting and manually adjust the learning rate and class weights to improve performance on the minority class, avoiding the complexities of XGBoost - While manually adjusting the learning rate and class weights in Gradient Boosting can help, XGBoost already provides built-in mechanisms to handle these challenges more effectively, including advanced regularization techniques and hyperparameter optimization.
References:
https://aws.amazon.com/what-is/boosting/
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
https://aws.amazon.com/blogs/gametech/fraud-detection-for-games-using-machine-learning/
https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Build_a_fraud_detection_system_with_Amazon_SageMaker_AIM359-R1.pdf

Question No : 15


A company specializes in providing personalized product recommendations for e-commerce platforms. You’ve been tasked with developing a solution that can quickly generate high-quality product descriptions, tailor marketing copy based on customer preferences, and analyze customer reviews to identify trends in sentiment. Given the scale of data and the need for flexibility in choosing foundational models, you decide to use an AI service that can integrate seamlessly with your existing AWS infrastructure while also offering managed foundational models from third-party providers.
Which AWS service would best meet your requirements?

정답:
Explanation:
Correct option:
Amazon Bedrock
Amazon Bedrock is the correct choice for the given use case. It is designed to help businesses build and scale generative AI applications quickly and efficiently. Bedrock offers access to a range of pre-trained foundational models from Amazon and third-party providers like AI21 Labs, Anthropic, and Stability AI. This makes it ideal for tasks such as generating product descriptions, creating marketing copy, and performing sentiment analysis on customer reviews. Bedrock allows users to easily integrate these AI capabilities into their applications without managing the underlying infrastructure, making it a perfect fit for your business needs.
via - https://aws.amazon.com/bedrock/faqs/
Incorrect options:
Amazon Rekognition - Amazon Rekognition is primarily used for image and video analysis, such as detecting objects, text, and activities. It is not designed for generating text or analyzing sentiment based on large datasets, so it would not meet the requirements in this scenario.
Amazon SageMaker - While Amazon SageMaker is a powerful service for building, training, and deploying machine learning models, it requires more manual setup and expertise compared to Amazon Bedrock. SageMaker would be a more appropriate choice if you needed custom models rather than leveraging pre-trained foundational models with generative AI capabilities.
Amazon Personalize - Amazon Personalize is a fully managed machine learning service that uses your data to generate item recommendations for your users. It can also generate user segments based on the users' affinity for certain items or item metadata. It also lacks the flexibility provided by Bedrock in choosing from various foundational models.
via - https://docs.aws.amazon.com/personalize/latest/dg/what-is-personalize.html
References:
https://aws.amazon.com/bedrock/
https://aws.amazon.com/bedrock/faqs/ https://docs.aws.amazon.com/personalize/latest/dg/what-is-personalize.html

 / 3