Which steps can the data engineer take to accomplish this with the LEAST operational effort?

A data engineer is evaluating customer data in Amazon SageMaker Data Wrangler.The data engineer will use the customer data to create a new model to predict customer behavior.The engineer needs to increase the model performance by checking for multicollinearity in the dataset.

Which steps can the data engineer take to accomplish this with the LEAST operational effort?

(Choose two.)

Use SageMaker Data Wrangler to refit and transform the dataset by applying one-hot encoding to category-based variables.

Use SageMaker Data Wrangler diagnostic visualization. Use principal components analysis (PCA) and singular value decomposition (SVD) to calculate singular values.

Use the SageMaker Data Wrangler Quick Model visualization to quickly evaluate the dataset and to produce importance scores for each feature.

Use the SageMaker Data Wrangler Min Max Scaler transform to normalize the data.

Use SageMaker Data Wrangler diagnostic visualization. Use least absolute shrinkage and selection operator (LASSO) to plot coefficient values from a LASSO model that is trained on the dataset.

Explanations:

One-hot encoding is a method for transforming categorical variables into numerical format, which helps in model training but does not directly address multicollinearity. This step is more focused on data preparation rather than diagnosing or resolving multicollinearity.

Using diagnostic visualizations in SageMaker Data Wrangler with PCA and SVD allows the data engineer to analyze the dataset for multicollinearity by examining the singular values and the variance explained by each principal component. This helps in understanding the relationships between features and identifying multicollinearity.

The Quick Model visualization in SageMaker Data Wrangler provides a high-level overview and feature importance scores but does not specifically diagnose or resolve multicollinearity. While it offers insights into feature impact, it does not analyze multicollinearity directly.

The Min Max Scaler transform normalizes the data by scaling feature values to a specific range, which helps with model convergence but does not directly assess or address multicollinearity issues in the dataset.

The LASSO method is useful for identifying and addressing multicollinearity as it applies regularization to the model, effectively shrinking the coefficients of correlated features to zero. By plotting the coefficient values from a LASSO model trained on the dataset, the data engineer can gain insights into which features contribute to multicollinearity.

Learn & move to cloud

Which steps can the data engineer take to accomplish this with the LEAST operational effort?

Explanations:

Leave a Reply Cancel reply