Spark and Machine Learning at Scale

Spark and Machine Learning at Scale Courseware (WA3290)

In this comprehensive Spark and Machine Learning at Scale training course students will delve into the world of Spark, a powerful open-source big data processing engine, to create scalable machine learning solutions.

Benefits

This Spark and Machine Learning training teaches participants how to build, deploy, and maintain powerful data-driven solutions using Spark and its associated technologies. The course begins with an introduction to Spark, its architecture, and how it fits into the Hadoop and Cloud-based ecosystems. Participants will learn to set up Spark environments using DataBricks Cloud, AWS EMR clusters, and SageMaker Studio. In addition, students will learn about Spark's core functionalities, including RDDs, DataFrames, transformations, and actions.

Outline

  1. Introduction to Spark. Overview of Spark and its Architecture
    1. Big Data and the Analytics Process
    2. What is Big Data
    3. Volume
    4. Velocity
    5. Variety
    6. Veracity
    7. Too large to fit into memory
    8. Too large to fit on drive of typical machine or too large to fit on a single server. 14
    9. Big data and analytic process
    10. Scaling and Distributed Computing
    11. How to Actually Scale?
    12. Bring the Data to the Compute
    13. Bring the Compute to the Data
    14. Introduction to the Spark Platform
    15. History of Spark and Hadoop
    16. Spark vs Hadoop MapReduce
    17. Supported Languages
    18. Pandas API on Spark
    19. Spark Architecture
    20. Spark Architecture: Cluster Manager
    21. Standalone cluster manager
    22. Apache Hadoop YARN
    23. Apache Mesos
    24. Spark Architecture: Driver Process
    25. Spark Architecture: Executor Process and Workers
    26. Spark Building Blocks
    27. Spark SQL and the Catalyst
  2. Introduction to Spark - Setting up a Spark Environment
    1. Set Up On-Premise Spark Environment
    2. Set Up On-Premise Spark Environment (Ubuntu 20.04)
    3. Set Up On-Premise Spark Environment (with Docker)
    4. Set Up DataBricks Community Cloud and Compute Cluster
    5. Set Up EMR Cluster and Attach Notebook
  3. Basic Spark Operations and Transformations
    1. Spark Session and Context
    2. Spark Session and Context
    3. Loading Data
    4. Actions and Transformations
    5. More on Actions in Spark
    6. More on Transformations in Spark
    7. Persistence and Caching
  4. Introduction to Spark SQL
    1. What is Spark SQL?
    2. What is Spark SQL?
    3. Uniform Data Access with Spark SQL
    4. Integration with cloud storage
    5. Using JDBC Sources
    6. Hive Integration
    7. What is a DataFrame?
    8. Creating a DataFrame in PySpark
    9. Commonly Used DataFrame Methods and Properties in PySpark
    10. Grouping and Aggregation in PySpark
    11. The "DataFrame to RDD" Bridge in PySpark
    12. The SQLContext Object
    13. Examples of Spark SQL / DataFrame (PySpark Example)
    14. Converting an RDD to a DataFrame Example
    15. Example of Reading / Writing a JSON File
    16. Performance, Scalability, and Fault-tolerance of Spark SQL
  5. Spark's ML libraries - Lecture: Introduction to Spark's ML libraries
    1. Spark MLlib
    2. Algorithms
    3. Classification
    4. Binary Classification Examples
    5. Binary Classification Algorithms
    6. Multi-Class Classification Examples
    7. Multi-Class Classification Algorithms
    8. Multi-Label Classification Examples
    9. Multi-Label Classification Algorithms
    10. Imbalanced Classification Examples
    11. Imbalanced Classification
    12. Regression
    13. Linear Regression
    14. Simple Linear Regression
    15. Multiple Linear Regression
    16. Polynomial Regression
    17. Support Vector Regression
    18. Decision Tree Regression
    19. Random Forest Regression
    20. Feature Engineering
    21. TF-IDF
    22. TF-IDF - PySpark example
    23. Word2Vec
    24. Word2Vec - PySpark example
    25. Count Vectorizer
    26. Count Vectorizer - PySpark example
    27. Feature Transformers of Spark MLlib
    28. Tokenizer
    29. Tokenizer - PySpark example
    30. Stopwords Remover
    31. Stopwords Remover - PySpark example
    32. N-gram
    33. N-gram - PySpark example
    34. Binarizer
    35. Binarizer - PySpark example
    36. Principal Component Analysis
    37. What is PCA used for?
    38. Advantages of PCA
    39. Disadvantages of PCA
    40. PCA - PySpark example
    41. String Indexing
    42. String Indexing - PySpark example
    43. One Hot Encoder
    44. Why One-Hot Encoding is used for nominal data?
    45. One-Hot Encoding - PySpark Example
    46. Bucketizer
    47. Bucketizer - PySpark example
    48. Standardization and Normalization
    49. Difference between Standardization and Normalization
    50. Scalers
    51. Standard Scaler
    52. Robust Scaler
    53. Min Max Scaler
    54. Max Abs Scaler
    55. Imputer
    56. Feature Selectors in Spark MLlib
    57. Vector Slicer
    58. Vector Slicer - PySpark example
    59. Chi-Squared selection
    60. Chi-Squared selection - PySpark example
    61. Univariate Feature Selector
    62. Variance Threshold Selector
    63. Locality Sensitive Hashing
    64. Locality Sensitive Hashing in Spark MLlib
    65. LSH Operations
    66. Locality Sensitive Hashing in Spark MLlib
    67. Bucketed Random Projection for Euclidean Distance
    68. MinHash for Jaccard Distance
    69. Pipeline
    70. Pipeline
    71. Transformer
    72. Estimator
    73. Persistence
    74. Introduction to Hyperparameter Tuning
    75. Hyperparameter tuning
    76. Hyperparameter tuning methods
    77. Random Search
    78. Grid Search
    79. Bayesian Optimisation
    80. Hyperparameter Tuning with Spark
  6. Streaming and Graphs
    1. Stream Analytics
    2. Tools for Stream Analytics: Kafka
    3. Tools for Stream Analytics: Storm
    4. Tools for Stream Analytics: Flink
    5. Tools for Stream Analytics: Spark
    6. Timestamps in stream analytics
    7. Windowing Operations
  7. Deploying Spark ML Artifacts - Introduction to deploying Spark ML Artifacts
    1. How the Spark system works?
    2. What is Deployment?
    3. Spark Deployment Artifacts
    4. Packaging Spark (ML) for Production
    5. Deploy Spark ML to EMR
    6. Deploy Spark (ML) with Sagamaker
    7. Serving and Updating Spark ML Models
    8. Model Versioning
    9. Model Versioning with AWS Model Registry
  8. Machine learning at Scale - Introduction to Machine Learning at Scale
    1. Introduction to Scalability
    2. Common Reasons for Scaling Up ML Systems
    3. How to Avoid Scaling Infrastructure?
    4. Benefits of ML at Scale
    5. Challenges in ML Scalability
    6. Data Complexities - Challenges
    7. ML System Engineering - Challenges
    8. Integration Risks - Challenges
    9. Collaboration Issues - Challenges
  9. Machine learning at Scale - Distributed Training of Machine Learning models
    1. Introduction to Distributed Training
    2. Data Parallelism
    3. Steps of Data Parallelism
    4. Data Parallelism vs Random Forest
    5. Model Parallelism
    6. Frameworks for Implementing Distributed ML
    7. Introduction to Distributed Training vs Distributed Inference
    8. Introduction to Training
    9. Introduction to Inference
    10. Key components of Inference
    11. Inference Challenges
    12. Inference Challenges - Latency
    13. Inference Challenges - Interoperability
    14. Inference Challenges - Infrastructure Cost
    15. Training vs Inference
    16. Introduction to GPUs
    17. GPU
    18. GPU for Training
    19. GPU for Inference
    20. Inference - Hardware
    21. AWS Inferentia Chip
    22. AWS Inferentia Chip vs GPU
  10. Machine learning at Scale - Hyperparameter tuning and model selection at scale
    1. Hyperparameter Tuning at Scale
    2. Hyperparameter Tuning Challenges
    3. Distributed Hyperparameter Tuning
    4. Bayesian Optimization
    5. Distributed Hyperparameter Tuning
    6. Spark Based Tools
    7. TensorFlowOnSpark
    8. Advantages of TensorFlowOnSpark
    9. BigDL
    10. Advantages of BigDL
    11. Horovod
    12. Advantages of Horovod
    13. H2O Sparkling Water
    14. Advantages of Sparkling Water over H2O

Required Prerequisites

  • Basic understanding of Python programming
  • Familiarity with data processing and analysis concepts
  • Familiarity with Python Pandas
  • Familiarity with basic machine learning concepts and algorithms is recommended