Blog about this code project: https://alexandersimes.com/unsupervised/machine/learning/nlp/sagemaker/2019/09/01/got.html

Preprocess epub files into a vocab and documents with:

python3 epub_to_pages.py --booksInputDir books --pagesOutputDir pages --vocabOutputFile vocab.pkl

Then train your model with the data with:

python3 train_lda.py \
    --pageInputDir pages \
    --vocabFile vocab.pkl \
    --s3Bucket alex9311-sagemaker \
    --s3Prefix LDA-testing \
    --awsRole aws-sagemaker-execution-role

Then used the trained model for inference with

python3 infer_lda.py --pageInputDir pages \
    --vocabFile vocab.pkl \
    --trainingJobName your-lda-training-job-name