Which solution will meet these requirements with the MOST operational efficiency?

A company stores its documents in Amazon S3 with no predefined product categories.A data scientist needs to build a machine learning model to categorize the documents for all the company’s products.

Which solution will meet these requirements with the MOST operational efficiency?

Build a custom clustering model. Create a Dockerfile and build a Docker image. Register the Docker image in Amazon Elastic Container Registry (Amazon ECR). Use the custom image in Amazon SageMaker to generate a trained model.

Tokenize the data and transform the data into tabular data. Train an Amazon SageMaker k-means model to generate the product categories.

Train an Amazon SageMaker Neural Topic Model (NTM) model to generate the product categories.

Train an Amazon SageMaker Blazing Text model to generate the product categories.

Explanations:

While building a custom clustering model might be viable, it involves more complexity and operational overhead with creating Docker images and managing them in Amazon ECR and SageMaker. This process is less efficient compared to using built-in algorithms designed specifically for text categorization.

Transforming the data into a tabular format and training a k-means model may not effectively capture the nuances of document categorization. K-means is generally more suited for numerical data rather than textual data without significant preprocessing, which may not be operationally efficient for this task.

The Neural Topic Model (NTM) is designed for topic modeling and can effectively categorize documents by learning latent topics in the text. It is specifically tailored for handling textual data, making it an efficient choice for generating product categories from unstructured documents.

The BlazingText model is primarily designed for text classification tasks but may require labeled data for supervised learning. In this case, where predefined categories do not exist, using it might not yield the desired results compared to the unsupervised capabilities of NTM, making it less operationally efficient for generating categories directly from unlabeled documents.

Learn & move to cloud

Which solution will meet these requirements with the MOST operational efficiency?

Explanations:

Leave a Reply Cancel reply