What is the MOST effective way to encode this categorical feature into a numeric feature?
Spell check the column. Use Amazon SageMaker one-hot encoding on the column to transform a categorical feature to a numerical feature.
Fix the spelling in the column by using char-RNN. Use Amazon SageMaker Data Wrangler one-hot encoding to transform a categorical feature to a numerical feature.
Use Amazon SageMaker Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers.
Use Amazon SageMaker Data Wrangler ordinal encoding on the column to encode categories into an integer between 0 and the total number of categories in the column.
Explanations:
While spell checking the column may improve data quality, using one-hot encoding will still result in a high-dimensional, sparse feature space with redundancy. This method doesn’t address the high cardinality effectively.
Although using char-RNN for spelling correction could help, applying one-hot encoding afterward still leads to a high-dimensional, sparse feature space with redundancy. This does not efficiently handle the high cardinality and redundancy of medication names.
Similarity encoding in Amazon SageMaker Data Wrangler is a more effective way to create embeddings. This method reduces redundancy and captures relationships between similar medications, resulting in more compact and meaningful numerical representations.
Ordinal encoding assigns integer values to categories but does not capture relationships between the medications. This method is not suitable for high-cardinality categorical data, as it may introduce spurious relationships based on arbitrary numerical assignments.