Which feature engineering technique should the Data Scientist use to meet the objectives?

A credit card company wants to build a credit scoring model to help predict whether a new credit card applicant will default on a credit card payment.The company has collected data from a large number of sources with thousands of raw attributes.Early experiments to train a classification model revealed that many attributes are highly correlated, the large number of features slows down the training speed significantly, and that there are some overfitting issues.The Data Scientist on this project would like to speed up the model training time without losing a lot of information from the original dataset.

Which feature engineering technique should the Data Scientist use to meet the objectives?

Run self-correlation on all features and remove highly correlated features

Normalize all numerical values to be between 0 and 1

Use an autoencoder or principal component analysis (PCA) to replace original features with new features

Cluster raw data using k-means and use sample data from each cluster to build a new dataset

Explanations:

While removing highly correlated features can help reduce redundancy, it may not adequately address the issue of dimensionality reduction and could still leave many features, which may lead to slow training speeds and potential overfitting.

Normalizing numerical values improves the scale of features but does not reduce the number of features or the dimensionality of the dataset. It addresses issues related to feature scale rather than the correlation or feature count.

Using techniques like autoencoders or PCA effectively reduces the dimensionality of the dataset while preserving significant information. These methods create new features that summarize the original dataset, which can lead to faster training times and mitigate overfitting.

Clustering the data using k-means and sampling from each cluster does not systematically reduce dimensionality or correlate features. This approach may result in losing important information and might not lead to a representative dataset for building a reliable model.

Learn & move to cloud

Which feature engineering technique should the Data Scientist use to meet the objectives?

Explanations:

Leave a Reply Cancel reply