Definitions and Terminologies

Model:
is a mathematical representation of the learning process. Is a Trained Algorithm used to identify patterns (without rules) in data (trained by ML process) and applied to new data -> prediction
Dataset:
is the data that is collected and used to train, evaluate, and Test the models, collected from many sources, transformed, and pre-processed to use in Machine Learning training. types of data [ Corpus: such as News paper articles]
Data Labeling:
is the target attribute of the data, identifies the value of the dataset related to the machine learning analysis, the label is either provided (supervised), or calculated (unsupervised), a trained model predicts labels on new datasets, Data that is labeled is also called Ground Truth data
Datapoint:
is the coordinate between a feature (x-axis) and a label (y-axis), used to plot a point on the grid (x, f(x))
Algorithm:
Defines the structure of an ML model. Is a set of rules or processes used by an AI system to conduct tasks—most often to discover new data insights and patterns, or to predict output values from a given set of input variables. Algorithms enable machine learning (ML) to learn and uses parameters to tune the results.
Framework:
interfaces that allow data scientists and developers to build and deploy machine learning models faster and easier
Features (embedding):
parts of the dataset are important to determine the accuracy of the outcome. named feature column, feature variations in (x-axis), example product color, month of the year, stock market value, Features are either [Categorial:=Quality | Continuous:= Quantity] also referred to as embedding, and the number of features in an observation is the dimension
Observations:
The rows, consists of the values of each feature
Prediction (Target value, or label, Estimate):
estimated calculated value as a result of running the model against the real data named Prediction column (y-axis) ex. sales outcome, member enrollments.
Context
: Weighted features in determining the accuracy.

Where:
x: feature
a: weight – importance

ML Algorithms

ML Algorithm Categories:

1. Supervised:
  known inputs/outputs generalize future outputs, Teacher show answer, label data, lead by example, types [Classifications [Binary mapping (T : F) | multi-class mapping (a,b,c …,n)] | Regression no mapping jump value]
2. Unsupervised:
  Unknown input/outputs, finds patterns, auto create labels -> clustered data by labeling patterns of data anomaly detection Clustering Algorithm, groups data into different clusters based on similarities in features. Anomaly Detection:
3. Reinforcement:
  Interacts with the environment, learn to take actions, maximize rewards, continuously improve by feedback from previous (trial/Error) reward/penalty for actions by the agent(driver)
4. Deep Learning:
  Based on artificial neurons, Artificial Neural Network (ANN) AI -> rules provided by programmer ML-> Analyzing patterns DL->like human, given basic rules, then iterative complex approach over multiple (100s) of layers and recognize patterns more complex than ML Each layer summarizes and feeds information to the next layer, ultimately producing final output Deep Learning Computer Vision ILSVRC ImageNet Large Scale Visual Recognition Challenge

ML Model

Program file python.py

# this is comment line
import boto3
#Variables
var = "String variable"
var2 = 'String "Variable"'
age=32
price=5.99
list = ['item1","item2","item3"]
dictionary = {"key1":"value1","key2":"value2"...}
# A function that prints hello world
def hello_world():
    print('hello world')
    return "Welcome !"

# This line calls (runs) the function
# indentation define code level 0-main, and so on, 
greeting = hello_world()
print(var, var2, age, var2 + str(age), list[1], dictionary["key1"],greeting)

About DLAMI and Jupyter Notebook Setup

Identify DLAMI & Launch Instance

Deep Learning Amazon Machine Image,

$ aws ec2 describe-images --region us-west-2 --owners amazon \
--filters 'Name=name,Values=Deep Learning AMI (Ubuntu 18.04) Version ??.?' 'Name=state,Values=available' \
--query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text
ami-0a2b85e15b7c0ac34 <-- Ubuntu image

Configure and Install environments:

Conda is environment management tool that allows creation and use of isolated environments for ML work

$ sudo yum -y update
$ conda env list <--lists all possible environments
$ source activate  <env> ex. tensorflow2_p310
$ conda deactivate  
$ nvcc --version  <-- Install if doesn't exist

$ jupyter notebook password <-- setup password for Jupyter Notebook  /home/ubuntu/.jupyter/jupyter_notebook_config.json
$ mkdir ~/ssl && cd ~/ssl
$ openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout jupyter.key -out jupyter-cert.pem
$ jupyter notebook --certfile=~/ssl/jupyter-cert.pem --keyfile ~/ssl/jupyter.key <-- Ctrl-C to shutdown, listen on localhost:8888

Connect to Jupyter Notebook Server Putty or SSH

AWS ML Tools

Jupyter Notebook: Document Repository and runner
SageMaker: Fully managed service to ML pipeline
Bedrock
Bedrock

ML Frameworks & Libraries

Pandas: Transforms unstructured data into rows and columns in tabular format called “Data Frames” consists of series (columns hold any data types) and indexes. Used to drop missing values
NumPy: Formatting multi-dimensional arrays
Scikit-Learn: Data mining and analysis, used to input missing values [ mean, median ]
Matplotlib: Data Visualization
Seaborn: Data Visualization integrated with Pandas

Setup the environment

Download and install Python (Windows option)
Install AWS CLI v2
Install VSCode and python extensions AWS Toolkit

Virtual Environment

$paython -m venv my_venv
$source my_venv/bin/activate | C:>source my_venv/Scripts/activate
PS>.\my_venv\Scripts\Activate.ps1
deactivate <-- exit my_venv

VSCode terminal -> install python libraries

$ sudo apt install python3-pip
$ pip3 install [boto3|numpy|pandas]
$ pip freeze > requirements.txt
$ pip install -r requirements.txt

Configure AWS credentials
```
$ aws configure
```

High level vs Low level API calls

import boto3
# low level client map 1:1 to API calls
# dynamodb = boto3.client("dynamodb")

# high level simplified 
dynamodb = boto3.resource('dynamodb')
labCustomers = dynamodb.Table('LabCustomers')
    
def lambda_handler(event, context):
    print(event)
    id = event["queryStringParameters"]["id"]
    firstname = event["queryStringParameters"]["firstname"]
    lastname = event["queryStringParameters"]["lastname"]
    ####
    # Challenge: put the customer id, firstname, last name into the dynamo db table
    ####
    # Your code goes here
    # low level client calls
    # response= dynamodb.put_item(
    #     TableName="LabCustomers",
    #     Item={ 
    #         "ID": {"S":id},
    #         "Firstname": {"S": firstname},
    #         "Lastname": {"S": lastname}
    #     })
    
    # high level resources call
    response= labCustomers.put_item(
    Item={ 
        "ID": id,
        "Firstname": firstname,
        "Lastname": lastname
    })
    return {"result": "Saved"}

Train the model commands

# Install Libraries (preinstalled on Sagemaker studio)
%pip install sagemaker boto3 pandas  [ s3fs | "s3fs<=0.4" ] <-- if older pandas
%reset -f <-- restart the kernel
# Install dependencies
import boto3
import pandas as pd
import sagemaker
:
# Load data
df = pd.read_csv("s3://<bucket>/<prefix>/<object>.csv")
# Set up the TrainingInput objects

train_input = TrainingInput(train_path, content_type='text/csv')
validation_input = TrainingInput(validation_path, content_type='text/csv')
# Print data
df
# Configure the estimator
xgb_model = sagemaker.estimator.Estimator(
     image_uri = container,
     role = role,
     instance_count = 1,
     instance_type ='ml.m5.xlarge',
     output_path = output_path,
     sagemaker_session = sagemaker_session,
    rules=[
        Rule.sagemaker(rule_configs.create_xgboost_report())
    ])
# Configure Hyperparameters
xgb_model.set_hyperparameters(
    max_depth = 5,
    eta = 0.2,
    gamma = 4,
    min_child_weight = 6,
    subsample = 0.7,
    verbosity = 0,
    objective = 'binary:logistic',
    num_round = 800)
# Run the training job
xgb_model.fit(
    {
        "train": train_input,
        "validation": validation_input
    },
    wait=True) <-- Takes time
# Evaluate the model
%%capture
time.sleep(500) <--in ms, takes between 3-8 minutes
rule_output_path = xgb_model.output_path + "/" + xgb_model.latest_training_job.job_name + "/rule-output"
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive <-- Takes time wait until report is generated

The ML Pipeline

Problem Formulation

Articulating your business problem and converting it to an ML problem. by

defining a business problem
such as inaccurate demand prediction
identifying the business goal or outcome
such as keep inventory low,
Develop quantitative metrics
used to measure success, Model performance metrics known as accuracy, such as monthly unsold inventory less < 15% vs Business goals metrics on deployed model in real-world to identify inappropriate performance metrics
Convert to ML Problem such as each product inventory matches the sales figures.
Choose the ML model such as regression model
Understand the data:
Access to the data, amount of data needed, the desired solution, the data centralized repository.
Domain Expert:
When actual is higher than ML observations
Data Quality:
Identify missing data figures
Identify features and label data
A feature is an attribute that can be used to help identify patterns and predict future answers such as Credit Card (#, Transaction date), Car (doors, style, color) Labeled data is data for which is known already as a fact, such as Credit Card (fraud, not fraud), Car (black, sedan, 4 wheels)

Data Collection and Integration

Ensure your raw data is in one central, accessible place:

Data Collection

- Data Lakes:
  Solid Foundation single source of truth collect all types of data [Structured, Unstructured] from various sources [Databases, File systems, Disk storage, etc], ideally stored in S3 buckets, retrieval can be optimized with EFS or FSx for Luster
- AWS Kinesis Data Stream:
  Max 1000 shards, 1MB/s in, 2MB/s out (enhanced per consumer), stores data up to 7 days
- AWS Kinesis Firehose:
  GB per sec, Transform data with Lambda, pre-built blueprints for common sources, Pay $$$ per volume
- Amazon for Apache Flink:
  Managed service to process big data GB/s, streaming data in seconds, stateful processing, support for ML algorithms
- AWS Glue:
  With Glue Crawler create meta-data repository (Data catalog) to allow running jobs on demand or scheduled to ETL data, processes Scala and PySpark codes, Transforms DataRecords in DataFrames in schema aware CSV file UTF-8 without BOM (Byte Order Mark), with 2 leading columns labeling_set_id, and label, unmatched labels assigned a new unique label
- Amazon EMR:
  Elastic Map Reduce, cluster-centric workload process and analysis Apache Spark or Hadoop big data sets, A cluster consists of Primary – Manage the cluster , Core – runs tasks and store data HDFS, and Task nodes – run tasks only
- AWS Data Exchange:
  Subscribes to 3rd party data providers in the AWS Marketplace and processes datagrams.
- SageMaker Processing:
  Processes data PySpark and Scala
- Lake Formation:
  Uses AWS Glue crawlers to create analytic data lakes on S3 with fine-grained permissions similar to RDBMS permissions. Uses ML algorithm to clean and classify the data. FindMatches transform capabilities unrelated datasets by id.

Data Preprocessing and Visualization

Data Cleaning:
Eliminate inconsistent data (language, format, unit of scale)

Data Preprocessing

Descriptive Statistics
Looks into dataset to discover imbalanced data, use of mean for symmetric data distribution and median for asymmetric data distribution, eliminate outliers data spots (artificial or natural), drop missing data [rows:= not enough samples – overfitting | columns:= underfitting – missing features] or imputing missing values with [ mean | median ]

Medians Is the middle number in a sorted ascending or descending list of numbers and can be more descriptive of that data set than the average, arrange the data points from smallest to largest. If the number of data points is odd, the median is the middle data point in the list. If the number of data points is even, the median is the average of the two middle data points in the list. Used to calculate quarterlies

Where:
q:=quarter i:=1,2,3;
n:= number of samples;
Q:= The order position of the quarterly in the dataset [Q]

Data Visualization

Transform raw data into an understandable format and extract important features from the data

Outliers:

are an abnormal distance from other values, skew values away from others, could be indication that data belongs to another column need to be cleaned, but also could add richness to the dataset, and can make it harder to predict

- Natural outliers: is not the result of some artificial error but instead is reflective of some truth in the data.
- Artificial outliers: are caused by artificial errors, such as data collection errors

Missing Data

Is data that is missing due to collection errors, requires compute by human to fill the missing data. Missing data causes problem in building relationships if the missing data is huge,

Use isnull() function in Pandas ex. df1.isnull().sum()

Drop row/columns use dropna()

Overfitting:= The model performs good on training dataset, poorly (not enough) on Evaluation data,

Underfitting:= lose information in features

Missing Data Strategies:

Do nothing: missing values are marginal or algorithm dependent
Drop missing row/col: small %
Mean/Median/Avg: numeric data only, not accurate, column level no correlations
Most Frequent/constant: introduces bias, ignore correlations
Forward/Backward filling: succeeded or superseded values
Interpolation/Extrapolation: from other observations within the rage of a discrete set of known data points
Hot Deck Imputation: by respondent
Model Based:
- K-NN K-Nearest Neighbor: impute mean value, then measure closest neighbors, sensitive to outliers, slow
- Regression
- Deep Learning (datawig): categorial, uses Deep Neural Network DNN, slow, most accurate , non-correlations
- Multivariate Imputation by Chained Equation (MICE): impute multiple values for missing data

Data Visualization

Identify Patterns
Find corrupt data
Identify Outliers
Find imbalances in the data

Multivariate stats
- Correlations
  High correlations between attributes leads to poor model performance
- Contingency Table
  Organize information. A contingency table gives an overview of how much you’ve got of at least two different things at the same time. It can also be presented in a Venn diagram
- Scatterplot
  Scatterplot matrix shows relationships between two or more numeric features in the dataset Scatterplot with identification shows the relationship between attributes in different colors
- Correlation matrix
  Correlation matrix quantifies linear relationships convey strong(1) | weak(-1) relationships Correlation heat matrix identifies strong and weak relationships with color codded (GREEN,RED,YELLOW) with shades

Exploratory Data Analysis:

KPI, such as Net Performer Score (NPS), Customer Profitability Score (CPS), Conversion Rate
Clustering or grouping: distribution of data over defined intervals (not time)
- Histograms: one variable value
- Scatter charts: two variables related, size represent the value

Model Evaluation

Ideally the model performance evaluated according to bias and variance:

Model Fit

Underfit: Perform poorly on Training Data (not enough data) <- add complexity, increase features, increase n-gram size, decrease regulation rules.
Overfit: Performs well on Training data and poorly in Evaluation data, memorizes and can’t generalize <- Regularization: L1/L2 add a penalty term to the loss function, reduce features:= PCA, decrease number of numeric attributes, decrease regulation rules
Balanced: neither overfit nor underfit
Imbalanced: Categorial data distribution inconsistent, use techniques:
- Synthetic Minority Oversampling Technique SMOTE oversampling: creates new observations of the underrepresented class by interpolating between the minority class instances. with k-nearest minority class neighbor, data augmentation
- Random oversampling: copies of some of the minority class observations (randomly)
- Generative Adversarial Sampling GAS: Generates unique observations that more closely resemble the real minority observations without being so similar
- Edited Near Neighbor Undersampling : remove observations from the majority class
- Adaptive Synthetic Sampling (ADASYN): develop more synthetic samples for minority class samples that are more difficult to learn (close to decision boundary)
Skewed: regression data distribution inconsistent, use log or power transform on the right-scewed or the left-scewed data

Model Bias and Variance

Bias is how far are projected data targets are to actual target value,

Low Bias, Low Variance:
The model is balanced, ideal spot where the model is performing well in training and test models, the target values are close to real
High Bias:
if Variance is low, then it is Underfitted, overly simple, don’t perform well on training data nor test data, increase features and datasets, decrease regularization, if variance is high too, then it is unbalanced and needs to be re-evaluated
High Variance:
if Low Bias, then it is Overfitted, the model overly complex, performs poorly on test data although values are close to real targets, reduce features, increase regularization

Bias-variance tradeoff
Identify the complexity of the training model by measuring the Bias vs Variance and tradeoff to settle the model.

Transfer Learning
The model is trained on a large dataset, fine tuned to work on small task specific dataset. improves performance and reduces training time, common algorithms ResNet and V4.

Classifications Problems

The confusion matrix

can help classify why and how a model gets something wrong.

The following metrics can be derived by the confusion metrix:

Accuracy(score): Less effective lots of TNs, means dataset was easily detect the negatives, unreliable to detect the TPs. The percentage of correct estimates
Precision: The percentage of correct predicted positives with respect false positives, Best when the cost of FP is high; email scams, investment forecasts, or credit card approvals, ignores the Negatives,
Recall(Sensitivity): The percentage of correct predicted positives with respect to false negatives. taking into account the false negatives, Used when the cost of FN is high, such as results of medical cancer or tumor exams
(Specificity): The percentage of correct predicted negatives with respect to false positives. Used when the cost of FP is high, similar to precision but take into account the TN, such as results of elimination voters, or product discontinuation
F1.Score: Model predictions accuracy, Quantify Precision and Recall in one number, Used for class imbalance (equal number of samples for each class) but want to preserve the equality between precision and recall, binary/multi-class classifications, 1 means all predictions are accurate

Accuracy: All estimates are matched (equally)

Precision: All Postive Estimates are correct, None is False

Recall: All Postive Estimates are correct, No False negative estimates

Specificity: All Negative estimates are correct

F1 Score: All positive estimates are matched (with impact of All false FP & FN)

AUC-ROC Curve

AUC: Area Under Curve; logistic regression metric for supervised models, the degree or measure of separability, provides an aggregate measure (FPR) of performance across all possible classification thresholds. Best for data with mostly True Negatives, so it reduces their impact,

AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

ROC: Receiver Operator Curve,(Recall) the probability curve,

AUC-ROC curve: show what the curve for true positive vs false positive looks like at various thresholds, a threshold is the cut-off probability which determines if the above this threshold the probability is determined to be true otherwise false. the threshold is determined as the point where the curve turns toward FPR (the knee of the curve)

AUC (PR) False Positive Rate (1 – specificity) curve: FP vs TN

*Used when dataset imbalanced skewed to the TN side

ROC True Positive Rate curve (Recall-Sensitivity): TP vs FN

Ideally, the curve should be aligned mostly with TPR (y-axis), in the Emails example, Emails are ranked by the classifier’s score. (% of emails were classified correctly spam vs. not spam), based on the business case, plot the actual emails on the map and tune the model to produce the curve that meet business needs
Model 1: Model improved AUC 0.893 <- means the FPR is low majority of points
Model 2: Model at AUC 0.687 <- means FPR is slightly below
Model 3: Model at AUC 0.500 <- means the FPR is in the middle, the model accuracy is 50/50

AUC Calculation steps

Step 1: Capture all the positive predictions of the model (example 20 positive predictions both TP and FP)

Step 2: Calculate TP and FP Rates using this table

Calculate the hit and miss labels for FP and TP (0 or 1)
Calculate the Accumulated FP and TP respectively
Calculate the FPR and TPR ratio from Accumulated FP and TP

Step 3: Compute the AUC for each Datapoint row(n) using the formula

The sum of this computed AUC is the AUC in this example 76%.

Step 4: Find the coordinates (FPR, TPR) and plot them on the grid

Regression Problems

MSE Mean Square Error

Commonly used to measure Regression models, for continuous variables (datapoints incremental)

R²(R squared)

Explains the fraction of variance accounted for by the model, adding more variables increases R, which indicates overfitting, used with regression continuous variables, use Adjusted R² to address the problem

SSE:=Sum of Squared Error

Adjusted R²

The threshold for a good R squared depends on the the type of business.

Where:
D: Number of data points
V: Number of variables

Random Index

Unsupervised regression metric.

Regularizations

A set of methods for reducing overfitting in machine learning models. L1 regularization is also known as lasso regression, and L2 regularization is also known as ridge regression. L1 regularization adds the absolute value of the coefficient as a penalty term. L2 regularization adds the squared magnitude X² of the coefficient as a penalty term.

Correlationships

Variance

Statistical measure to measure the linear relationships between variables; reveals how two variables change together

Covariance

Statistical measure to measure the linear relationships between variables; reveals how two variables change together

Correlation

Statistical measure to measure how closely two variables are related to each other. uses the covariance to generate the unit independant measurement

E(x) is the expected value of x on the graph

Pearson correlation coefficient (PCC)

Statistical measure, measures the linear correlation between two sets of data, It is the ratio between the covariance of two variables and the product of their standard deviations; the threshold is +/- 0.5

Spearsman correlation coefficient (SCC)

Statistical measure, is the nonparametric version of the Pearson product-moment correlation. Spearman’s correlation coefficient, (ρ, also signified by rs) measures the strength and direction of association between two ranked variables.

Polychoric correlation coefficient (PCC)

Statistical measure, measures agreement between multiple variables for ordinal variables (sometimes called “ordered-category” data).

Ordinary Least Squares Regression (OLSR):

A generalized linear modeling technique. It is used for predicting all unknown parameters involved in a linear regression model using one or more dependent parameters, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the explanatory variables.

Data Optimization Techniques

Local Outlier Factor (LOF):

Discover outliers data points before applying dataset to the algorithm

Least-Angel Regression (LARS):

Regression Technique that predicts a dependent variable using one or more independent variables.

Model Deployment

The integration of the model and its resources into a production environment. so that it can be used to create predictions.

Layers:

Model Layer: The structure of data, the algorithm, and framework
Data Layer: Actual data(metrics) collected to train, evaluate, and test
Serving (App) Layer: computing power to serve and custodian of the data
Inference Layer: updates model and data to generate predictions
Monitoring: collects metrics from App Serving layer and inference layer to alert, trigger scaling, report errors, or anomalies (concept drift)

Concepts & Components:

Concept Drift: how the model accuracy gradually degrades over time and requires retraining
HTTP Endpoint: The model hosting service by SageMaker
Training Processing Jobs: Python SDK (scikit-learn SKLearnProcessor | Spark PySparkProcessor]
Target Leakage: The predictors include data that will not be available at the time you make predictions , such as time sensitive predictions, use R² and AUC-ROC

Deployment and Operational Techniques

Online Learning: training the model incrementally by inferring data observations as individual observations or mini-batches
Incremental Learning: use base model to extend with new data
Transfer Learning: start with off-the-shelf trained model and then apply closely aligned observations
Out-of-core Learning: train huge loaded datasets, and continuously add more datasets

Training models best practices

Dataset: Clean and representative
Data Splitting: use proper technique to avoid overfitting
Models: Experiment with multiple models
Monitor: Constant monitor the training models
Evaluation: on realistic data
Iterate and improve

AWS AI ML Stack

Amazon Rekognition

Analyzes Video and Images, Deep Learning for image classification, object detection, text in images, facial recognition, sentiment (Natural Language Processing (NLP), and public safety.

Uses Bounding boxes to capture objects algorithms such as SSD, R-CNN, Faster R-CNN, and YOLO

Use Semantic Segmentation to isolate or include a pixel into an object, useful for medical to identify cancer cells or tumor tissues.

Sources:

Images: file or byte-coded
Video: file or Kinesis video stream, asynchronous labeling to SNS topic or Kinesis Data Stream.

Amazon Textract

Extract intelligence from documents such as financial reports, medical records, tax forms, submission applications, insurance claims, etc. Beyond simple Optical Character Recognition (OCR)

Textract allows creation of workflows, no Analysis or semantics or classifications,

Use Cases:

Document search indexes
Natural Language Processing (NLP) cases extract words, lines, and tables
Input PDF, JPG, multiple sizes
Detect Document Hierarchy
Synchronous (file or Byte Array) or Asynchronous (start***) return key-value pair or SNS topic(Async)
Integrates with Amazon Augmented AI (A2I)

Amazon Augmented AI (A2I)

Get a secondary human review for low-confidence predictions or auditing a model performance. by defining a work team, use UI templates for instruction and finally inference by human — worker task template, work team can be [ public:= Mecahnical Turk | private := staff | 3rd Party external vendor ]

Use Cases:

Integrates with Amazon Rekognition, Textract
Custom models

Amazon Transcribe

Convert human speech waveforms into text (Automatic Speech Recognition (ASR)), uses neural networks models seq2seq

Use Cases:

Stream [HTTP/2 | WebSockets] and Batch mode [S3]
Multi language support, Mixed language conversations (LanguageOptions), Auto detect
Job Queuing built-in
Custom vocabulary and filtering
Automatic Content reduction or redact, Personal Identifiable Information (PII)
Speaker Identification
Amazon Transcribe Medical is medical domain specific ASR

Amazon Translate

Translates Text from various languages to another.

Use Cases:

Stream [HTTP/2 | WebSockets] and Batch mode [S3]
Multi language support, Mixed language conversations (LanguageOptions), Auto detect
Job Queuing built-in
Custom vocabulary and filtering
Automatic Content reduction or redact, Personal Identifiable Information (PII)
Speaker Identification
Amazon Transcribe Medical is medical domain specific ASR
Sync and Async(Batch processing) APIs
Non Customizable built-in training models allows Custom Terminologies (selected languages)

Amazon Polly

Convert text to speech (TTS) use text or Speech Synthetic Markup Language (SSML). by stringing basic speech units phonemes into natural sounding voice (vocoder) using Deep Learning models neural TTS (NTTS) uses sequence to sequence, The SSML <prosody> tag define how to render voices using pitch, volume, rate of speech,

Use Cases:

Chatbots voicing
Automated interactive Answering machines

Amazon Lex

Conversational interactive Chat interface, language support, by converting utterance (what the user actually wants) into the intent (what is the equivalent command) configurable with slots and slot types, with fallback intent to retrain.

Use Cases:

Chatbots
Customer support
Online Troubleshooting
Customer event management (booking appointments)
AIOps

Amazon Kendra

Search and mine unstructured data, Accepting search queries and return intelligent facts answers; use the Deep Learning algorithms, works by indexing, document organization, external data sources such MS SharePoint, S3, Confluence, Salesforce, Service Now

Use Cases:

Chatbots
Customer support
Online Troubleshooting
Customer event management (booking appointments)
AIOps

Amazon Personalize

Personalize user experience, recommendations which relies on [ User Data, Item Data, Interaction Data ] to build insights. methods like clustering, content-based filtering, or recently collaborative filtering by matrix factorization to decompose a very large sparse matrix into smaller matrices to extract hidden or latent vectors, without considerations to user purchase history.

Recurrent Neural Network (RNN) retrain using customer purchase history, Amazon extension HRNN-Metadata, and also the MultiArmed Bandits (MABs) which uses exploration-exploitation trade-off,

Use Cases Solutions:

User Personalization Recipes [ user, popularity, HRNN-metadata ]
Ranking-Based Recipes
Related Items Recipes

Amazon Forecast

Provides forecasts based on historic data, using Deep Learning algorithms, using algorithms such as

AutoRegressive Integrated Moving Average (ARIMA) based on time series differential, requires small datasets < 100, or
Prophit fits time series into data by detecting trends at different time intervals works against data irregularities and missing data (seasonal data), Facebook – Neural Prophit adds neural network.
Amazon DeepAR[+] uses long short-term memory LSTM based on probabilistic sampling techniques splitting the time series into windows(context-length) to predict the forecast horizon. foreword looking.
Exponential smoothing ETS: statistical algorithm, weighted average of prior features with exponential weight
The convolutional neural network quantile regression CNN-R: uses causal convolutional networks, doesn’t require future data for the forecast horizon.
Nonparametric time series (NPTS): used for seasonal data, sparse, or bursty data with a lot of intermitted values

Use Cases:

Allows addition of weather data
Data splitting is not possible, use backtesting on historic data (Ground Truth)
Measured by RMSE,(to amplify outliers), or Weighted Absolute Percentage Error (WAPE) also known as Mean Absolute Percentage Error (MAPE) or median forecasting
Quantile based probabilistic, ex. P10:= 10% of real values are less than prediction, for over stocking use higher P.
Weighted Quantile Loss (wQL) to penalize the model for underfitting versus overfitting to show the over vs under predicting

Amazon Comprehend

Extract insights from text documents consist of Entities, Key phrases, PII, Language, Sentiment [ Positive | Negative | mixed | neutral ], Syntax, Obfuscates PII information, Classifies labels

Use Cases:

Train custom models
- Custom document Classification: labeling
- Custom Entity detection: new entities
- Document topic modeling (LDA) : frequent words form topics
Character encoding UTF-8, < 5KB, 25 docs/s (Batch 250 docs/s, < 1MB, total size < 5 GB)

Amazon CodeGuru

Provides intelligent code recommendation,

Use Cases:

Reviewer:
- Detects potential code security vulnerabilities, offers suggestions in accordance to AWS best practices
- Amazon CodeGuru Security is a static application security testing (SAST)
- Code source: [AWS CodeCommit, S3, GitHub, BitBucket]
- Integrates with Application CI/CD
Profiler: Profile code on AWS Lambda

Amazon Bedrock

offers a choice of high-performing foundation models (FMs), uses techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Amazon Bedrock Studio is a new SSO-enabled web interface that provides the easiest way for developers across an organization to experiment with large language models (LLMs) and other foundation models (FMs), collaborate on projects, and iterate on generative AI applications.

Use Cases:

Rapid Generative AI application development with Bedrock Studio
Text Generation: Create new pieces of original content, such as blog posts, social media posts, and webpage copy.
Virtual Assistance: Build assistants that understand user requests, automatically break down tasks, engage in dialogue to collect information, and take actions to fulfill the request
Text/Image Search: Search and synthesize relevant information to answer questions and provide recommendations from a large corpus of text and image data.
Text Summarization: Get concise summaries of long documents such as articles, reports, research papers, technical documentation, and even books to quickly and effectively extract important information.
Image Generation: Quickly create realistic and visually appealing images for ad campaigns, websites, presentations, and more.

Amazon Q

Amazon Q generates code, tests, debugs, and has multistep planning and reasoning capabilities that can transform and implement new code generated from developer requests. Amazon Q also makes it easier for employees to get answers to questions across business data—such as company policies, product information, business results, code base, employees, and many other topics—by connecting to enterprise data repositories to summarize the data logically, analyze trends, and engage in dialogue about the data.

Use Cases:

Amazon Q Business: Answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems.
Amazon Q Developer: Assist coding, testing, and upgrading applications, to diagnosing errors, performing security scanning and fixes, and optimizing AWS resources.
Amazon Quicksight: Analytics dashboards
Amazon Connect
AWS SupplyChain
Elastic Serach
Kibana:
Amazon Fraud Detector: mode ONLONE_FRAUD_INSIGHTS

Model Training and Tuning

Iterative process that can be performed many different times throughout this workflow, perform additional feature engineering and tune the model’s hyperparameters

ML Frameworks

XGBoost:

eXtreme Gradient Boosting, uses parallel tree boosting (also known as GBDT, GBM), The weak learners are regression trees, and each regression tree maps an input data point to one of its leaves that contains a continuous score. XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions). The training proceeds iteratively, adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction. It’s called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

ML Algorithms

Algorithms
Use the algorithms suitable to the ML Problem, Amazon SageMaker built-in algorithms are pre-trained models in docker containers and highly customizable, it is also possible to develop a custom algorithm and use the pre-built docker images frameworks.

CatBoost:

open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.

Handles missing values by setting them to 0.

LightGBM:

GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT.

Linear Learner:

Linear models are supervised learning algorithms used for solving either classification or regression problems. For input, you give the model labeled examples (x, y). x is a high-dimensional vector and y is a numeric label. For binary classification problems, the label must be either 0 or 1. For multiclass classification problems, the labels must be from 0 to num_classes – 1. For regression problems, y is a real number. The algorithm learns a linear function, or, for classification problems, a linear threshold function, and maps a vector x to an approximation of the label y.

Linear regression:

predicting variable value based on another single variable, numeric value regression

Multivariate regression:

Similar to Linear regression but uses multiple variables to predict a variable value based on multiple variables, numeric value regression

Logistic Regression:

Predicting the probability of an outcome, event, or observation, Binary classification regression

KNN:

The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. Two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform. used to find items similarities for recommendation models.

XGBoost – eXtreme Gradient Boosting:

Implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. accept inferences in text/csv, text/libsvm, recordio-protobuf

Common XGBoost Hyperparameters:

numclass: The number of classes.

num_round: The number of rounds to run the training.

alpha: L1 regularization term on weights. Increasing this value makes models more conservative..

gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is.

eta: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.

base_score: The initial prediction score of all instances, global bias..

subsample: Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting.

Naïve Bayes:

Use principles of probability to perform classification tasks for text categorization problems. use maximum likelihood method

Decision Tree:

Follows a tree-like model of decisions and their possible consequences

Random Forest:

Combines the output of multiple decision trees to reach a single result

Factorization Machines:

The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. Used to capture interactions between features within high dimensional sparse datasets. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

TabTransformer:

A novel deep tabular data modeling architecture for supervised learning. The TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. The contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability.

Principal Component Analysis (PCA):

Dimension reduction algorithm, The data is linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified.

K-means clustering:

Partitioning a dataset into a pre-defined number of clusters. useful for tabular data, attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups

Linear Discriminant Analysis (LDA):

Finds a linear combination of features that characterizes or separates two or more classes of objects or events. Also known as normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher’s linear discriminant

Latent Dirichlet Allocation (LDA):

Similar to the clustering algorithm K-means, groups words and documents into a predefined number of clusters (i.e. topics). These topics can then be used to organize and search through documents.

BlazingText:

provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.

Sequence to Sequence:

Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence.

Neural Topic Model (NTM):

Organize a corpus of documents into topics that contain word groupings based on their statistical distribution.

Object2Vec:

A Neural embedding algorithm generalizes Word2Vec. It can learn low-dimensional dense embeddings of high-dimensional objects, preserves the semantics of the relationship between pairs of objects, learn by compute nearest neighbors of objects

Text Classification – TensorFlow:

Supports transfer learning with many pretrained models from the TensorFlow Hub. Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of text data is not available. The text classification algorithm takes a text string as input and outputs a probability for each of the class labels. Training datasets must be in CSV format.

Residual Neural Network (ResNet):

The weight layers learn residual functions with reference to the layer inputs. used for Transfer Learning.

Single Shot MultiBox Detector (SSD with VGG):

A single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.

Fully Convolutional Networks (FCN):

They employ solely locally connected layers, such as convolution, pooling and upsampling. Avoiding the use of dense layers means less parameters (making the networks faster to train). It also means an FCN can work for variable image sizes given all connections are local.

Fully Convolutional Networks:

Predict PS propensities based on the integration of residue-level and structure-level features.

DeepAR:

Forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series. They then use that model to extrapolate the time series into the future.

Semantic Segmentation:

A deep learning algorithm that associates a label or category with every pixel in an image.

- Formatting Data
  CVS Formatting – Labels on left, No headers RecordIO format- protobuff Formatting Python
- Data Splitting
  To generalize a training model, Split data into (1) Training to validate against (2) Evaluation data , and then apply against (3) Testing data, to avoid overfitting (learn model data very well), to avoid bias, randomize data, the methods to split:

- Testing and Validation Techniques
  - Simple Hold-out validation:
    Split data (Training:80%(Features + Labels), Evaluation:10%, Testing:10%(Features + Predict Labels)

- - Cross validation:
    Compare multiple models based on scoring, to estimate model performance on unseen data and select the hyperparameters by splitting the data into multiple folds (training, evaluation, and test)

- - K-fold cross validation:
    Randomly partition data into K segments, data outside the segment is Evaluation data, apply (k-1) times to the model and average the results. Smaller Ks more bias

Notes about splitting data:

Data ordering: can lead to bias; Randomize data
K-Size: smaller k size is more bias, use larger K size
K-fold with shuffling: apply k-fold with shuffle each iteration

- - Iterated k-Fold validation with shuffle
    Shuffle data between K-Folds, then average the results

Time Series Cross Validation
Uses forward chaining, ensure model is not overtrained and generalized.

Leave One Out Cross Validation (LOOCV)
Used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. uses k-fold but keeps one fold out each iteration for testing.

Train The Model

During training process iterations, the algorithm calculates the parameters (or weights) based on previous iteration so that the predicted values as close as the observed (true) value, the iteration continues until the error reached the target or the defined number of iterations has been reached.
The model converge if the errors decreases over iterations (successful) otherwise the model needs to be re-evaluated.
Computer calculates the parameters using the loss function (objective function) and optimization techniques measures the error given a set of weights, the deviation from actual value is the loss function.
- Minima:=the point with least amount of error, calculated by traversing the well (trough)
- Global minima:=the lowest trough
- Gradient descent:= technique used to find the minima, calculate the first slope (gradient) of the curve; works on small datasets, slow, dataset must fit in memory, possible fail to reach global minima; limitations can be optimized with stochastic gradient descent algorithm
- Learning rate:=A hyperparameter of the model, is the size of the step (the delta between two weights) large steps causes overshoot (miss the trough(Minima)), small steps moves slowly may fail to reach the Minima, Optimized by reducing the size as it gets closer to the Minima
  epoch:= the iteration to train the model
- Hyperparameter:=A paremeter that is external to the model, set by human and cannot be estimated ex. learning weight, search for best hyeprparameter is called optimization
- Collinearity:= when the prediction tightly aligned with one parameter, Use PCA and drop low variances
- [Discrete | Continuous] objectives:= weather the objective is specific distinct value (classification) [f1,recal,precision] or continuous value (regression) [cross-entropy, mse, rmse]

- Loss Function and Optimization Techniques
  - RMSE: Root Mean Square Error
    the standard deviation of the differences between the predicted and observed values. for Regression models

- - Log Likelihood Loss (cross-entropy)
    calculates the logarithm of probabilities. for Binary Classifications

- Gradient Descent Optimized Techniques
  - Gradient Descent (GD): Needs to go through all the data once (Epoch), Slow to find the Minima but fewer steps to reach it.
  - Stochastic Gradient Descent(SGD): Updates the parameters for each data point (record), drawback oscillation at different directions
  - Mini-Batch Gradient Descent: Uses small batch size to update the parameters, less noisy updates than SGD,

Feature Engineering

Feature Engineering:

is the process extracting more information from existing data in order to improve model’s prediction power and help the model to learn fast. requires domain knowledge of the data. for example, identify day of the week(Sat/Sun) where sales figures takes a different pattern

Features:

rank of an attribute, help identify the object such as could be numeric such as age, or binary such as car_windshield_present, in NLP Neural Language Processing words, Structured Data PCA Principal Component Analysis or T-SNE T-Stochastic Neighbor Embedding to reduce dimensionality

Feature Distribution:

Binomial: binary outcome [ Y | N ]
Poisson: Count of events per period
Normal (gaussian): Data distributes evenly around mean. (bell shaped curve)

The main components of Feature Engineering:

Feature Extraction
Create new features from existing dataset, reduce dimensionality, for example Extract features from the image to replace raw image pixels, in Natural Language Processing extract features from popular words, excluding articles and prepositions, and in Structured Data is the principal component analysis (PCA) unsupervised technique based on the interrelations between variables or t-distributed stochastic neighbor embedding (T-SNE)

Embedding
Convert high dimensional vectors into low-dimensional space to make it easier to do machine learning with large sparse vector inputs. Embeddings also capture the semantics of the underlying data by placing similar items closer in the low-dimensional space. This makes the features more effective in training downstream models.

Feature Selection
Selecting a subset of features that are most useful to the problem.
Filtering subset of features using scoring methods to rank and choose features such as correlation or importance

Feature Creation and Transformation
Generating features from existing features, such as generating day, month, year from a date, numeric features
- logarithmic transformation: smooth data distribution, works on positive non zero values only

- square/cube root: reduces variance , square works on positive values and zero only, quad works on any value.

- Binning:=grouping, continous numbers into one group, such as age (child. adult, senior)

- Scaling: Convert values into scale 0 . . 1
  - :Mean/Variance Optimization (MVO):
    reduce the impact of outliers
    
    Blends multiple attributes based on Correlation, Smaller values, reduces impact of outliers, values are mean of 0 and standard deviation of 1
  - MinMax:
    Best with small standard deviations
    
    Scale values to 0..1, Robust to small standard deviation
  - MaxAbs:
    Divide all data into features by the maximum Absolute Value of that feature
  - Robust:
    Substract the median of the features and divides that by the difference between 75th and 25th Quartile
    Minimizes impact of large marginal outliers
- Normalizer: Rescales feature values into a range of [0 .. 1], around mean; mean of zero and variance of 1.

Categorial Data Transformation:
- Ordinal
  Categories ordered and related such as sizes [ S | M | L ] , distances [ near | far], mapped 1 to 1 ex. S = 1, M=5, L=10
- Nominal
  Categories not ordered and not related such as colors [ Red | Blue | Green ] , Country [ CA | US | MX . . . ]
  One-hot encoding – each type given its own column values [ 0 | 1 ] for matched values, to reduce number of columns group values by similarities for ex. Territory for countries
- Cartesian
  Takes categorial variables or text as input, and produces new features that capture the interaction between input variables.
- N-Gram
  Takes a text variable as input and produces strings corresponding to sliding a window of (user-configurable)n words, generating output in the process. An n-gram is a sequence of n words: a (which we’ll call bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (a trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.
- Orthogonal Sparse Bigram (OSB)
  Slide the window of size n over the text and outputting every pair of words that includes the first word in the window ex. “The quick brown fox” => {(the,quick) (the,brown) (the,fox)}
- Bag of words
  NLP algorithm, creates token of the input document text and outputs a statistical depiction of the text such as histograms
- Term Frequency-Inverse Document Frequency(tf-idf)
  Determines how important a word in a document by giving weights to words that are common and less common in the document (just the count), orderd by weights

Encoding Techniques

Used to encode categorial features into numeric representation,

Ordinal: categories are ordered
Nominal: Categories are unrelated not ordered

Encoding Techniques

one-hot-shot Encoding: works on limited number of categories such as day of week, months, etc.
Target Encoding with smoothing: with mean transform and smoothing, for large number of categories by calculating the average of categories.

Hyperparameter Tuning

Hyperparameter categories:

Model
Help define the model architecture, such as defining the number of layers in a neural network Examples Filter size, pooling, stride, padding
Optimizer
How the model learns patterns on data Examples: Gradient descent, stochastic gradient descent, momentum
Data
Define attributes of the data itself, such as cropping or resizing of images in the input data Useful for small/homogeneous datasets

Search Hyperparameters

GridSearch
Trained and scored for hyperparameters at equal distance from each other; can train and score for every hyperparameter combination. Thorough but inefficient.
RandomSearch
Replaces the exhaustive enumeration of all combinations by selecting them randomly. Generalizes to continuous and mixed spaces. Explores many more values than grid search could for continuous hyperparameters. runs concurrent training jobs without impacting performance of the search.

Bayesian Search
The tool makes guesses about best hyperparameter combinations. Uses regressions to refine those combinations to find the best combinations for next round.
Good for exploring unknown areas.

AWS Machine Learning Devices

AWS DeepLens: vision systems provides programmable video camera, use Model Quantization to speed the process without compromising accuracy
AWS DeepRace: simulation to reinforcement learning
AWS DeepComposer: MIDI keyboard music with Generative Adversarial Networks (GANs)
AWS Panorama: Adds vision appliactions into IP camera setup

Amazon SageMaker

References

Workshops and Tutorials

Tools

RStudio: IDE host, share, develop, deploy ML projects
Jupyter Notebooks
SageMaker Studio
SageMaker Prebuilt containers
SageMaker Algorithms
SageMaker Clarify
SageMaker Model Tracking
AWS Lake Formation
SageMaker Canvas

Amazon SageMaker Components

Notebooks:
uses Jupyter Notebooks, Open source to share notebooks documents. Frameworks are prebuilt optimized components for conducting ML work in the notebooks. Jupyter Notebook is a virtual machine instances to host notebooks in specific Conda( Open source Anaconda) managed environment or sparkmagic. A notebook, is a collection of documents, code, reports, data visualization in house framework, A Document structure is a collection of mixed cells [code | Markdown | Raw]
Create Notebook instances in SageMaker Console, uses a kernel (the supported framework)
Instance Types:
SageMaker workloads Notebook, Training, and inference (apply prediction in production)
- t family: ideal for Notebooks
- m, r, c, p family: ideal for traditional ML training
- g family: ideal for Deep Learning DL
- Elastic inferencia: ideal for inference
Lifecycle Configuration
Data Visualization:
Histograms, cross-correlations, and scatter plots.

Model Selection:
Options of algorithms [Built-in algorithms, script on frameworks, Subscribed from AWS marketplace, Bring your own], Hyperparameter optimization and tuning (Automatically)
Ground Truth:
Uses combination of a selected human workforce [ public:=Amazon Mechanical Turk | staff | vendor ] and machine learning to create labeled datasets. Data Types [ Image | Text | Video — files | frames ], labeling techniques [ Chaining labeling jobs | label verification and adjustment | Batches | Annotation consolidation (bounding box, Semantic Segmentation, Name entity, output manifest, Mechanical Turk) ], Built-in Tasks [ Bounding Box | Image Classification | Semantic Segmentation ]
Hyperparameter auto tuning:
Amazon SageMaker finds the best version of a model by running mini training jobs on a dataset using the algorithm and range of hyperparameters, use gaussian regression to choose the most fit hyperparameter. uses Bayesian optimization to balance hyperparameter states and explore more hyperparameter values. Best values with small ranges.
Deployment:
- One Prediction at a time
- All predictions at once

Deployment in SageMaker

SageMaker is one option to run and deploy Machine Learning models and inference. Fully managed end to end service to manage Machine Learning deployment, auto scaling, model hosting, HTTPS endpoints.
Unmanaged ML deployment requires creating AMI, Launch compute instances EC2, ECS or EKS containers, Configure auto scaling.
ML on the Edge: on Edge devices with less power, such as security cams
AWS IoT Greengrass: for ML models built on (SageMaker, DL AMI, DL containers)
Amazon SageMaker Neo: Compiler converts code to common format, and Runtime optimized for the underlying hardware to speedup inference process. processors from ARM, NVIDIA, Xilinx, and Texas Instruments

Steps to deploy and host:

Create the model
CreateModelAPI, Define storage and name the model for hosting and running batch inference jobs
Create HTTPS Endpoint
CreateEndpointConfigAPI to create HTTPS endpoint, set configurations (e.g. prod variations for each model) [ instance type, initial count, initial weight], A SageMaker RunTime VPC endpoint is used for invoke_endpoint API calls.
Deploy an HTTPS endpoint
CreateEndpoint API, specify endpoint configuration, model name, and tags

Inference Options

Batch Transform:

offline, large datasets, does not require persistent endpoint
or associating input records with inference to support interpretation of results.

Real-Time: live predictions, Sustained traffic, low latency, consistent performance, low latency, Multi-model/Multi-container, Secure HTTPS endpoints, one datapoint at a time, used for immediate results, real-time endpoint, use Apache Spark Streaming
Serverless: AutoScale, auto provision resources, cold-start, unpredicted traffic patterns, CPU only, used for simple standard option inferences with intermitted traffic patterns, serverless endpoint
Asynchronous: Queue based, near real-time, <1GB payloads, longer runtime, used for time less sensitive moderate payloads, asynchronous endpoint
Batch: large datasets, Higherthroughput, Event/Schedule based, does not require associating input records with inference to support interpretation of results, used for offline processing not time sensitive

Deployment Strategies

Blue/Green Deployment

Two identical production environments, test the traffic on the new deployment, revert in case of issues., use UpdateEndpoint() to switch

Canary and A/B Deployment

Deploy new identical environment for small subset of users 20%? (canary deployment) or 50/50 users (A/B deployment), use the traffic-shifting mode, test the traffic on the new deployment, revert in case of issues or gradually swift traffic to new environment or adjust the weights in a single step until a single model is processing all of the live traffic if success.

All at once: single step to shift the traffic 100%
Canady: two steps c% then 100%
Linear: fixed portion of traffic shift on specific intervals p% x n to 100%

Model Validation

Offline testing: historic, uses alpha endpoint,

Online testing with live data: use A/B Testing model,

Deploy multiple variants of the model on same HTTPS endpoint with one endpoint configuration that describes all variants of the model.

Retrieval Augmented Generation (RAG)

Retrieve data from outside a foundation model and augment your prompts by adding the relevant retrieved data in context. QuickSight Feature.

Amazon SageMaker inference pipeline

Process data same way in training,

Monitoring the deployment

Monitoring Categories

Business Goals and KPIs: used to understand how the model meets your business goals, User experience
Hosting infrastructure: Performance and cost of infrastructure (latency, utilization, scaling)
Model Performance: Detecting Drifts for divergence of data

Model Monotoring

Data Quality
Model Quality: Metrics accuracy F1.Score
Bias and Drift
Feature attribution drift ( explainability )

Monitoring Tools integration (CloudWatch, EventBridge, and CloudTrail)

Production readiness testing comparison strategies:

Shadow Testing: 2 models side by side, share same endpoint or separate, Shadow receives sampling percentage without responding, key metrics with DataCapture, compare logs
A/B Testing: production and candidate variants, multiple models single endpoint, CloudWatch monitoring

Data Security and Privacy

Tokenization

The process of replacing actual sensitive data elements with non-sensitive data elements that have no exploitable value for data security purposes. Security-sensitive applications use tokenization to replace sensitive data, such as personally identifiable information (PII) or protected health information (PHI), with tokens to reduce security risks.

VPC endpoints

Allows access to SageMaker resources without traversing the public internet

Concept Drift

Use Cases for Bias Detection

Compliance: Regulation, fairness opportunities
Internal reporting: Auditing
Operational excellence: Maintenance
Customer service: Decisions regarding applications

Solutions:

Amazon SageMaker Clarify: ML lifecycle Data Prep, Process and Analyze, Deploy, and Monitor, uses Model Bias detection and model explainability
Monitoring daily/weekly/monthly (Cron jobs/ Event Bridge)
Built-in rules to detect data drift
Automate corrective actions for alerts
Periodic Retraining (Bias or change feature attributions)

Practical Data Science with Amazon SageMaker

Lab 2: Training the Model
Lab 4: Model Deployment

Work with SageMaker

Create a Domain:
A domain, includes EFS, VPCs, Users, Security policies, home directories for each user
Create Users:
Create a user, Permission role (create), customize permissions, SageMaker Studio Settings,
Create a Notebook:
from the user, or studio console, Create Notebook, Configure (Image(framework) ex. Data Science 3, Instance Type, Kernel ex. Python3, Permissions, Network, Git)
Open Jupyter Notebook:
Notebook editor, contains a menu to manage (File, Edit, View, Insert (cells), Cell ([run | markdown | Raw]), Kernel (execution engine), Widgets (Notebook state), and Help (links to usage and frameworks)), toolbar for shortcut icons, and an editor to edit cells.

SageMaker Studio

Web-Based IDE for ML, is a hub for all SageMaker tools(Jupyter notebooks, ML models, frameworks, Data Science preparations,- Data Wrangler and others) used in the ML Pipeline

SageMaker Algorithms

built-in models are pre-trained, but can be customized

Data Preprocessing SageMaker Data Wrangler

Select data from multiple data sources and semi-structured data formats.
Automatically verify data quality and detect abnormalities.
Understand data with visualization templates.
Transform data with 300+ built-in transformations plus the ability to author custom transformations.
Quickly estimate ML model accuracy and diagnose issues before models are deployed into production.
Automate ML data preparation workflows.
Use SageMaker Clarify to detect bias with Total Vairation Distance (TVD) to measure the difference between distinct features with different facets in the dataset, or Proportions of Outcomes (DPO) to measure the proportions of the group membership that hold different views.
Custom Transforms: [ Python (Pandas), Python (PySpark), and SQL (PySpark SQL) ]

Training Jobs

Compute Resources: ML AMI enabled,
Block Storage: S3 / EFS/ FSx for luster

Hyperparameter optimization XGBoost

Booster Parameters

alpha: L1 regulation term on weights. Increasing this value makes models more conservative.

eta: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting rocess more conservative.

max_dept: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit.

min_child_weight: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weght, the building process gives up further partitioning.

num_round: The number of rounds (trees) used for boosting. Increasing the trees can increase the model accuracy but increases the risk of overfitting.

Learning Parameters

Objective: Defines the loss function to be minimized. There are objectives specific to regression problems, and binary and multiple classification problems. for example binary:logistic, binary:logitraw, and binary:hinge.

Eval_metric: Tied to an objective and is used for validation data. for example auc.

SageMaker Automatic Model Tuning (AMT)

Static hyperparameters: don’t change

target objective metric: to be optimized

Optimization options:

Grid Search
Random Search
Bayesian Search
Hyperband: for Neural Network based algorithms NN

HyperparameterTuner()

estimator: An estimator object that has been initialized with the required configuration. There does not need to be a training job associated with this instance. extendable to [ Transformer:= Batch | Predictor:= endpoint | MultiDataModel:= multi-models]
objective_metric_name: Name of the metric for evaluating training jobs.
hyperparameter_ranges: Dictionary of parameter ranges. These parameter ranges can be one of three types: Continuous, Integer, or Categorical. The keys of the dictionary are the names of the hyperparameter, and the values are the appropriate parameter range class to represent the range.
objective_type: The type of the objective metric for evaluating training jobs. This value can be either ‘Minimize’ or ‘Maximize’ (default: ‘Maximize’).
max_jobs: Maximum total number of training jobs to start for the hyperparameter tuning job. The default value is unspecified for the ‘Grid’ strategy, and the default value is 1 for all other strategies (default: None).
max_parallel_jobs: Maximum number of parallel training jobs to start (default: 1).
early_stopping_type: Specifies whether early stopping is enabled for the job. Can be either ‘Auto’ or ‘Off’ (default: ‘Off’). If set to ‘Off’, early stopping will not be attempted. If set to ‘Auto’, early stopping of some training jobs might happen, but is not guaranteed to.

Hyperparameter Tuning

f1 : measure model accuracy FP,FN
map (mean average precision)
ndcg (Normalized Discounted Cumulative Gain)
rmse (Root mean square error): amplify errors
mae (mean Absolute error) : disregard +/- errors
ssd (Sum of the squared Distances):
msd (Mean Squared Distances):

Inferencing vs. training

Inferences requires autoscaling

Machine Learning instance types for the purpose

Amazon Elastic Inferentia: Low cost, GPU powered acceleration Fp16 performance for 16 infrencia accelerators gen 1 per EC2 Infr1 and 12 infrencia accelerator gen 2 for EC2 Infr2 (190 TFLOPS)

Standalone GPU instances, suitable for model training, oversized for inferences.

Factors to determine the right instance:

Target Latency SLA
Constraints
Start small and size up as needed
Use Fp16 for lower latency and higher throughput

MLOps

A combination of Machine LEarning and Operations – combines people, technology, and processes to deliver collaborative ML solutions.

MLOps requires the integration of software development, operations, data engineering, and data science.

MLOps ensures operation ability of ML workloads, by Automating and Orchestrating the data CI/CD pipelines, Tools include AWS Glue (ETL), SageMaker pipelines, Step Functions, and Apache Airflow.

Use separate pipelines for training and deployments.

Deployment of model creates new model version to the staging environment, pending approval, use CodePipeline to approve and deploy to production endpoint

Getting Smarter with AI/ML on AWS

References

Additional Explorations

Related Topics / Labs

Model:

Dataset:

Data Labeling:

Datapoint:

Algorithm:

Framework:

Features (embedding):

Observations:

Prediction (Target value, or label, Estimate):

Context

ML Algorithms

Supervised:

Unsupervised:

Reinforcement:

Deep Learning:

ML Model

Program file python.py

About DLAMI and Jupyter Notebook Setup

Identify DLAMI & Launch Instance

Configure and Install environments:

Connect to Jupyter Notebook Server Putty or SSH

AWS ML Tools

ML Frameworks & Libraries

Setup the environment

High level vs Low level API calls

Train the model commands

Problem Formulation

defining a business problem

identifying the business goal or outcome

Develop quantitative metrics

Understand the data:

Domain Expert:

Data Quality:

Identify features and label data

Data Collection and Integration

Data Collection

Data Lakes:

AWS Kinesis Data Stream:

AWS Kinesis Firehose:

Amazon for Apache Flink:

AWS Glue:

Amazon EMR:

AWS Data Exchange:

SageMaker Processing:

Lake Formation:

Data Preprocessing and Visualization

Data Cleaning:

Data Preprocessing

Descriptive Statistics

Data Visualization

Outliers:

Missing Data

Missing Data Strategies:

Data Visualization

Multivariate stats

Correlations

Contingency Table

Scatterplot

Correlation matrix

Model Evaluation

Model Fit

Model Bias and Variance

Low Bias, Low Variance:

High Bias:

High Variance:

Bias-variance tradeoff

Transfer Learning

Classifications Problems

AUC (PR) False Positive Rate (1 – specificity) curve: FP vs TN

ROC True Positive Rate curve (Recall-Sensitivity): TP vs FN <img decoding="async" style="display: block; margin-left: auto; margin-right: auto; width: 90%;" src="https://images.academy.hahlabs.com/topics/aws/ML/TPR-formula.svg" data-eio="l" />

Regression Problems

MSE Mean Square Error

R2 (R squared)

Adjusted R2

Random Index

Regularizations

ROC True Positive Rate curve (Recall-Sensitivity): TP vs FN

R²(R squared)

Adjusted R²