Data Engineering Bootcamp Training using Python and PySpark

Data Engineering Bootcamp Training using Python and PySpark Courseware (WA3020)

This hands-on Data Engineering Bootcamp teaches attendees the foundations of data engineering using Python and Spark SQL. Students learn how to build production-ready data-driven solutions and gain a comprehensive understanding of data engineering.

Benefits

  • Data Availability and Consistency
  • A/B Testing Data Engineering Tasks Project
  • Learning the Databricks Community Cloud Lab Environment
  • Python Variables
  • Dates and Times
  • The if, for, and try Constructs
  • Dictionaries
  • Sets, Tuples
  • Functions, Functional Programming
  • Understanding NumPy and pandas
  • PySpark

Outline

  1. Big Data Concepts and Systems Overview for Data Engineers
    1. Gartner's Definition of Big Data
    2. The Big Data Confluence Diagram
    3. A Practical Definition of Big Data
    4. Challenges Posed by Big Data
    5. The Traditional Client–Server Processing Pattern
    6. Enter Distributed Computing
    7. Data Physics
    8. Data Locality (Distributed Computing Economics)
    9. The CAP Theorem
    10. Mechanisms to Guarantee a Single CAP Property
    11. Eventual Consistency
    12. NoSQL Systems CAP Triangle
    13. Big Data Sharding
    14. Sharding Example
    15. Apache Hadoop
    16. Hadoop Ecosystem Projects
    17. Other Hadoop Ecosystem Projects
    18. Hadoop Design Principles
    19. Hadoop's Main Components
    20. Hadoop Simple Definition
    21. Hadoop Component Diagram
    22. HDFS
    23. Storing Raw Data in HDFS and Schema-on-Demand
    24. MapReduce Defined
    25. MapReduce Shared-Nothing Architecture
    26. MapReduce Phases
    27. The Map Phase
    28. The Reduce Phase
    29. Similarity with SQL Aggregation Operations
    30. Summary
  2. Defining Data Engineering
    1. Data is King
    2. Translating Data into Operational and Business Insights
    3. What is Data Engineering
    4. The Data-Related Roles
    5. The Data Science Skill Sets
    6. The Data Engineer Role
    7. Core Skills and Competencies
    8. An Example of a Data Product
    9. What is Data Wrangling (Munging)?
    10. The Data Exchange Interoperability Options
    11. Summary
  3. Data Processing Phases
    1. Typical Data Processing Pipeline
    2. Data Discovery Phase
    3. Data Harvesting Phase
    4. Data Priming Phase
    5. Exploratory Data Analysis
    6. Model Planning Phase
    7. Model Building Phase
    8. Communicating the Results
    9. Production Roll-out
    10. Data Logistics and Data Governance
    11. Data Processing Workflow Engines
    12. Apache Airflow
    13. Data Lineage and Provenance
    14. Apache NiFi
    15. Summary
  4. Python 3 Introduction
    1. What is Python?
    2. Python Documentation
    3. Where Can I Use Python?
    4. Which version of Python am I running?
    5. Running Python Programs
    6. Python Shell
    7. Dev Tools and REPLs
    8. IPython
    9. Jupyter
    10. Hands-On Exercise
    11. The Anaconda Python Distribution
    12. Summary
  5. Python Variables and Types
    1. Variables and Types
    2. More on Variables
    3. Assigning Multiple Values to Multiple Variables
    4. More on Types
    5. Variable Scopes
    6. The Layout of Python Programs
    7. Comments and Triple-Delimited String Literals
    8. Sample Python Code
    9. PEP8
    10. Getting Help on Python Objects
    11. Null (None)
    12. Strings
    13. Finding Index of a Substring
    14. String Splitting
    15. Raw String Literals
    16. String Formatting and Interpolation
    17. String Public Method Names
    18. The Boolean Type
    19. Boolean Operators
    20. Relational Operators
    21. Numbers
    22. "Easy Numbers"
    23. Looking Up the Runtime Type of a Variable
    24. Divisions
    25. Assignment-with-Operation
    26. Hands-On Exercise
    27. Dates and Times
    28. Hands-On Exercise
    29. Summary
  6. Control Statements and Data Collections
    1. Control Flow with The if-elif-else Triad
    2. An if-elif-else Example
    3. Conditional Expressions (a.k.a. Ternary Operator)
    4. The While-Break-Continue Triad
    5. The for Loop
    6. The range() Function
    7. Examples of Using range()
    8. The try-except-finally Construct
    9. Hands-On Exercise
    10. The assert Expression
    11. Lists
    12. Main List Methods
    13. List Comprehension
    14. Zipping Lists
    15. Enumerate
    16. Hands-On Exercise
    17. Dictionaries
    18. Working with Dictionaries
    19. Other Dictionary Methods
    20. Sets
    21. Set Methods
    22. Set Operations
    23. Set Operations Examples
    24. Finding Unique Elements in a List
    25. Common Collection Functions and Operators
    26. Hands-On Exercise
    27. Tuples
    28. Unpacking Tuples
    29. Hands-On Exercise
    30. Summary
  7. Functions and Modules
    1. Built-in Functions
    2. Functions
    3. The "Call by Sharing" Parameter Passing
    4. Global and Local Variable Scopes
    5. Default Parameters
    6. Named Parameters
    7. Dealing with Arbitrary Number of Parameters
    8. Keyword Function Parameters
    9. Hands-On Exercise
    10. What is Functional Programming (FP)?
    11. Concept: Pure Functions
    12. Concept: Recursion
    13. Concept: Higher-Order Functions
    14. Lambda Functions in Python
    15. Examples of Using Lambdas
    16. Lambdas in the Sorted Function
    17. Hands-On Exercise
    18. Python Modules
    19. Importing Modules
    20. Installing Modules
    21. Listing Methods in a Module
    22. Creating Your Own Modules
    23. Creating a Module's Entry Point
    24. Summary
  8. File I/O and Useful Modules
    1. Reading Command-Line Parameters
    2. Hands-On Exercise (N/A in DCC)
    3. Working with Files
    4. Reading and Writing Files
    5. Hands-On Exercise
    6. Hands-On Exercise
    7. Random Numbers
    8. Hands-On Exercise
    9. Regular Expressions
    10. The re Object Methods
    11. Using Regular Expressions Examples
    12. Hands-On Exercise
    13. Summary
  9. Practical Introduction to NumPy
    1. NumPy
    2. The First Take on NumPy Arrays
    3. The ndarray Data Structure
    4. Getting Help
    5. Understanding Axes
    6. Indexing Elements in a NumPy Array
    7. Understanding Types
    8. Re-Shaping
    9. Commonly Used Array Metrics
    10. Commonly Used Aggregate Functions
    11. Sorting Arrays
    12. Vectorization
    13. Vectorization Visually
    14. Broadcasting
    15. Broadcasting Visually
    16. Filtering
    17. Array Arithmetic Operations
    18. Reductions: Finding the Sum of Elements by Axis
    19. Array Slicing
    20. 2-D Array Slicing
    21. Slicing and Stepping Through
    22. The Linear Algebra Functions
    23. Summary
  10. Practical Introduction to pandas
    1. What is pandas?
    2. The Series Object
    3. Accessing Values and Indexes in Series
    4. Setting Up Your Own Index
    5. Using the Series Index as a Lookup Key
    6. Can I Pack a Python Dictionary into a Series?
    7. The DataFrame Object
    8. The DataFrame's Value Proposition
    9. Creating a pandas DataFrame
    10. Getting DataFrame Metrics
    11. Accessing DataFrame Columns
    12. Accessing DataFrame Rows
    13. Accessing DataFrame Cells
    14. Using iloc
    15. Using loc
    16. Examples of Using loc
    17. DataFrames are Mutable via Object Reference!
    18. The Axes
    19. Deleting Rows and Columns
    20. Adding a New Column to a DataFrame
    21. Appending / Concatenating DataFrame and Series Objects
    22. Example of Appending / Concatenating DataFrames
    23. Re-indexing Series and DataFrames
    24. Getting Descriptive Statistics of DataFrame Columns
    25. Navigating Rows and Columns For Data Reduction
    26. Getting Descriptive Statistics of DataFrames
    27. Applying a Function
    28. Sorting DataFrames
    29. Reading From CSV Files
    30. Writing to the System Clipboard
    31. Writing to a CSV File
    32. Fine-Tuning the Column Data Types
    33. Changing the Type of a Column
    34. What May Go Wrong with Type Conversion
    35. Summary
  11. Data Grouping and Aggregation with pandas
    1. Data Aggregation and Grouping
    2. Sample Data Set
    3. The pandas.core.groupby.SeriesGroupBy Object
    4. Grouping by Two or More Columns
    5. Emulating SQL's WHERE Clause
    6. The Pivot Tables
    7. Cross-Tabulation
    8. Summary
  12. Repairing and Normalizing Data
    1. Repairing and Normalizing Data
    2. Dealing with the Missing Data
    3. Sample Data Set
    4. Getting Info on Null Data
    5. Dropping a Column
    6. Interpolating Missing Data in pandas
    7. Replacing the Missing Values with the Mean Value
    8. Scaling (Normalizing) the Data
    9. Data Preprocessing with scikit-learn
    10. Scaling with the scale() Function
    11. The MinMaxScaler Object
    12. Summary
  13. Data Visualization in Python
    1. Data Visualization
    2. Data Visualization in Python
    3. Matplotlib
    4. Getting Started with matplotlib
    5. The matplotlib.pyplot.plot() Function
    6. The matplotlib.pyplot.bar() Function
    7. The matplotlib.pyplot.pie () Function
    8. The matplotlib.pyplot.subplot() Function
    9. A Subplot Example
    10. Figures
    11. Saving Figures to a File
    12. Seaborn
    13. Getting Started with seaborn
    14. Histograms and KDE
    15. Plotting Bivariate Distributions
    16. Scatter plots in seaborn
    17. Pair plots in seaborn
    18. Heatmaps
    19. A Seaborn Scatterplot with Varying Point Sizes and Hues
    20. ggplot
    21. Summary
  14. Python as a Cloud Scripting Language
    1. Python's Value
    2. Python on AWS
    3. AWS SDK For Python (boto3)
    4. What is Serverless Computing?
    5. How Functions Work
    6. The AWS Lambda Event Handler
    7. What is AWS Glue?
    8. PySpark on Glue - Sample Script
    9. Summary
  15. Introduction to Apache Spark
    1. What is Apache Spark
    2. The Spark Platform
    3. Spark vs Hadoop's MapReduce (MR)
    4. Common Spark Use Cases
    5. Languages Supported by Spark
    6. Running Spark on a Cluster
    7. The Spark Application Architecture
    8. The Driver Process
    9. The Executor and Worker Processes
    10. Spark Shell
    11. Jupyter Notebook Shell Environment
    12. Spark Applications
    13. The spark-submit Tool
    14. The spark-submit Tool Configuration
    15. Interfaces with Data Storage Systems
    16. The Resilient Distributed Dataset (RDD)
    17. Datasets and DataFrames
    18. Spark SQL, DataFrames, and Catalyst Optimizer
    19. Project Tungsten
    20. Spark Machine Learning Library
    21. Spark (Structured) Streaming
    22. GraphX
    23. Extending Spark Environment with Custom Modules and Files
    24. Spark 3
    25. Spark 3 Updates at a Glance
    26. Summary
  16. The Spark Shell
    1. The Spark Shell
    2. The Spark v.2 + Command-Line Shells
    3. The Spark Shell UI
    4. Spark Shell Options
    5. Getting Help
    6. Jupyter Notebook Shell Environment
    7. Example of a Jupyter Notebook Web UI (Databricks Cloud)
    8. The Spark Context (sc) and Spark Session (spark)
    9. Creating a Spark Session Object in Spark Applications
    10. The Shell Spark Context Object (sc)
    11. The Shell Spark Session Object (spark)
    12. Loading Files
    13. Saving Files
    14. Summary
  17. Spark RDDs
    1. The Resilient Distributed Dataset (RDD)
    2. Ways to Create an RDD
    3. Supported Data Types
    4. RDD Operations
    5. RDDs are Immutable
    6. Spark Actions
    7. RDD Transformations
    8. Other RDD Operations
    9. Chaining RDD Operations
    10. RDD Lineage
    11. The Big Picture
    12. What May Go Wrong
    13. Miscellaneous Pair RDD Operations
    14. RDD Caching
    15. Summary
  18. Parallel Data Processing with Spark
    1. Running Spark on a Cluster
    2. Data Partitioning
    3. Data Partitioning Diagram
    4. Single Local File System RDD Partitioning
    5. Multiple File RDD Partitioning
    6. Special Cases for Small-sized Files
    7. Parallel Data Processing of Partitions
    8. Spark Application, Jobs, and Tasks
    9. Stages and Shuffles
    10. The "Big Picture"
    11. Summary
  19. Introduction to Spark SQL
    1. What is Spark SQL?
    2. Uniform Data Access with Spark SQL
    3. Using JDBC Sources
    4. Hive Integration
    5. What is a DataFrame?
    6. Creating a DataFrame in PySpark
    7. Creating a DataFrame in PySpark (Cont'd)
    8. Commonly Used DataFrame Methods and Properties in PySpark
    9. Commonly Used DataFrame Methods and Properties in PySpark (Cont'd)
    10. Grouping and Aggregation in PySpark
    11. The "DataFrame to RDD" Bridge in PySpark
    12. The SQLContext Object
    13. Examples of Spark SQL / DataFrame (PySpark Example)
    14. Converting an RDD to a DataFrame Example
    15. Example of Reading / Writing a JSON File
    16. Performance, Scalability, and Fault-tolerance of Spark SQL
    17. Summary
  20. Lab Exercises
    1. Data Availability and Consistency
    2. A/B Testing Data Engineering Tasks Project
    3. Learning the Databricks Community Cloud Lab Environment
    4. Python Variables
    5. Dates and Times
    6. The if, for, and try Constructs
    7. Understanding Lists
    8. Dictionaries
    9. Sets
    10. Tuples
    11. Functions
    12. Functional Programming
    13. File I/O
    14. Using HTTP and JSON
    15. Random Numbers
    16. Regular Expressions
    17. Understanding NumPy
    18. A NumPy Project
    19. Understanding pandas
    20. Data Grouping and Aggregation
    21. Repairing and Normalizing Data
    22. Data Visualization and EDA with pandas and seaborn
    23. Correlating Cause and Effect
    24. Learning PySpark Shell Environment
    25. Understanding Spark DataFrames
    26. Learning the PySpark DataFrame API
    27. Data Repair and Normalization in PySpark
    28. Working with Parquet File Format in PySpark and pandas

Required Prerequisites

Some working experience in any programming language;  the students will be introduced to programming in Python.  Basic understanding of SQL and data processing concepts, including data grouping and aggregation.

License

Length: 5 days | $350.00 per copy

LicenseRequest More InformationRequest Trainer Evaluation Copy
What is Included?
  • Student Manual
  • Extra Trainer Files
  • PowerPoint Presentation