Applied Data Science with Python

Applied Data Science with Python Courseware (WA2715)

The course includes a deep dive into Python for data science, analytics, and data visualization, as well as an intro to Python in the realm of data engineering. The chapters are reinforced with practical labs where students can apply their theoretical knowledge in the real world.

If you’re an analyst, developer, architect, or technical manager, you will need to use Python in the fields of data science, business analytics, and data logistics. In this intensive 2-day course, we cover both theoretical and practical core concepts of Python and how it applies to these areas.  

Benefits

  • Using Jupyter Notebook
  • Understanding Python
  • Understanding NumPy
  • Understanding pandas
  • Repairing and Normalizing Data
  • Data Visualization in Python
  • Data Splitting
  • The Random Forest Algorithm
  • The k-Means Algorithm

Outline

  1. Python for Data Science
    1. In-Class Discussion
    2. Python Data Science-Centric Libraries
    3. NumPy
    4. NumPy Arrays
    5. Select NumPy Operations
    6. SciPy
    7. pandas
    8. Creating a pandas DataFrame
    9. Fetching and Sorting Data
    10. Scikit-learn
    11. Matplotlib
    12. Seaborn
    13. Python Dev Tools and REPLs
    14. IPython
    15. Jupyter
    16. Jupyter Operation Modes
    17. Jupyter Common Commands
    18. Anaconda
    19. Summary
  2. Defining Data Science
    1. What is Data Science?
    2. Data Science, Machine Learning, AI?
    3. The Data-Related Roles
    4. The Data Science Ecosystem
    5. Tools of the Trade
    6. Who is a Data Scientist?
    7. Data Scientists at Work
    8. Examples of Data Science Projects
    9. An Example of a Data Product
    10. Applied Data Science at Google
    11. Data Science Gotchas
    12. Summary
  3. Data Processing Phases
    1. Typical Data Processing Pipeline
    2. Data Discovery Phase
    3. Data Harvesting Phase
    4. Data Priming Phase
    5. Exploratory Data Analysis
    6. Model Planning Phase
    7. Model Building Phase
    8. Communicating the Results
    9. Production Roll-out
    10. Data Logistics and Data Governance
    11. Data Processing Workflow Engines
    12. Apache Airflow
    13. Data Lineage and Provenance
    14. Apache NiFi
    15. Summary
  4. Descriptive Statistics Computing Features in Python
    1. Descriptive Statistics
    2. Non-uniformity of a Probability Distribution
    3. Using NumPy for Calculating Descriptive Statistics Measures
    4. Finding Min and Max in NumPy
    5. Using pandas for Calculating Descriptive Statistics Measures
    6. Correlation
    7. Regression and Correlation
    8. Covariance
    9. Getting Pairwise Correlation and Covariance Measures
    10. Finding Min and Max in pandas DataFrame
    11. Summary
  5. Repairing and Normalizing Data
    1. Repairing and Normalizing Data
    2. Dealing with the Missing Data
    3. Sample Data Set
    4. Getting Info on Null Data
    5. Dropping a Column
    6. Interpolating Missing Data in pandas
    7. Replacing the Missing Values with the Mean Value
    8. Scaling (Normalizing) the Data
    9. Data Preprocessing with scikit-learn
    10. Scaling with the scale() Function
    11. The MinMaxScaler Object
    12. Summary
  6. Data Visualization in Python
    1. Data Visualization
    2. Data Visualization in Python
    3. Matplotlib
    4. Getting Started with matplotlib
    5. The matplotlib.pyplot.plot() Function
    6. The matplotlib.pyplot.bar() Function
    7. The matplotlib.pyplot.pie () Function
    8. Subplots
    9. Using the matplotlib.gridspec.GridSpec Object
    10. The matplotlib.pyplot.subplot() Function
    11. Figures
    12. Saving Figures to a File
    13. Seaborn
    14. Getting Started with seaborn
    15. Histograms and KDE
    16. Plotting Bivariate Distributions
    17. Scatter plots in seaborn
    18. Pair plots in seaborn
    19. Heatmaps
    20. ggplot
    21. Summary
  7. Data Science and ML Algorithms in scikit-learn
    1. In-Class Discussion
    2. Types of Machine Learning
    3. Terminology: Features and Observations
    4. Representing Observations
    5. Terminology: Labels
    6. Terminology: Continuous and Categorical Features
    7. Continuous Features
    8. Categorical Features
    9. Common Distance Metrics
    10. The Euclidean Distance
    11. What is a Model
    12. Supervised vs Unsupervised Machine Learning
    13. Supervised Machine Learning Algorithms
    14. Unsupervised Machine Learning Algorithms
    15. Choosing the Right Algorithm
    16. The scikit-learn Package
    17. scikit-learn Estimators, Models, and Predictors
    18. Model Evaluation
    19. The Error Rate
    20. Confusion Matrix
    21. The Binary Classification Confusion Matrix
    22. Multi-class Classification Confusion Matrix Example
    23. ROC Curve
    24. Example of an ROC Curve
    25. The AUC Metric
    26. Feature Engineering
    27. Scaling of the Features
    28. Feature Blending (Creating Synthetic Features)
    29. The 'One-Hot' Encoding Scheme
    30. Example of 'One-Hot' Encoding Scheme
    31. Bias-Variance (Underfitting vs Overfitting) Trade-off
    32. The Modeling Error Factors
    33. One Way to Visualize Bias and Variance
    34. Underfitting vs Overfitting Visualization
    35. Balancing Off the Bias-Variance Ratio
    36. Regularization in scikit-learn
    37. Regularization, Take Two
    38. Dimensionality Reduction
    39. PCA and isomap
    40. The Advantages of Dimensionality Reduction
    41. The LIBSVM format
    42. Life-cycles of Machine Learning Development
    43. Data Splitting into Training and Test Datasets
    44. ML Model Tuning Visually
    45. Data Splitting in scikit-learn
    46. Cross-Validation Technique
    47. Hands-on Exercise
    48. Classification (Supervised ML) Examples
    49. Classifying with k-Nearest Neighbors
    50. k-Nearest Neighbors Algorithm
    51. k-Nearest Neighbors Algorithm
    52. Hands-on Exercise
    53. Regression Analysis
    54. Regression vs Correlation
    55. Regression vs Classification
    56. Simple Linear Regression Model
    57. Linear Regression Illustration
    58. Least-Squares Method (LSM)
    59. Gradient Descent Optimization
    60. Multiple Regression Analysis
    61. Evaluating Regression Model Accuracy
    62. The R2 Model Score
    63. The MSE Model Score
    64. Logistic Regression (Logit)
    65. Interpreting Logistic Regression Results
    66. Decision Trees
    67. Decision Tree Terminology
    68. Properties of Decision Trees
    69. Decision Tree Classification in the Context of Information Theory
    70. The Simplified Decision Tree Algorithm
    71. Using Decision Trees
    72. Random Forests
    73. Hands-On Exercise
    74. Hands-on Exercise
    75. Support Vector Machines (SVMs)
    76. Naive Bayes Classifier (SL)
    77. Naive Bayesian Probabilistic Model in a Nutshell
    78. Bayes Formula
    79. Classification of Documents with Naive Bayes
    80. Unsupervised Learning Type: Clustering
    81. Clustering Examples
    82. k-Means Clustering (UL)
    83. k-Means Clustering in a Nutshell
    84. k-Means Characteristics
    85. Global vs Local Minimum Explained
    86. Hands-On Exercise
    87. XGBoost
    88. Gradient Boosting
    89. Hands-On Exercise
    90. A Better Algorithm or More Data?
    91. Summary
  8. (Optional) Quick Introduction to Python for Data Engineers
    1. What is Python?
    2. Additional Documentation
    3. Which version of Python am I running?
    4. Python Dev Tools and REPLs
    5. IPython
    6. Jupyter
    7. Jupyter Operation Modes
    8. Jupyter Common Commands
    9. Anaconda
    10. Python Variables and Basic Syntax
    11. Variable Scopes
    12. PEP8
    13. The Python Programs
    14. Getting Help
    15. Variable Types
    16. Assigning Multiple Values to Multiple Variables
    17. Null (None)
    18. Strings
    19. Finding Index of a Substring
    20. String Splitting
    21. Triple-Delimited String Literals
    22. Raw String Literals
    23. String Formatting and Interpolation
    24. Boolean
    25. Boolean Operators
    26. Numbers
    27. Looking Up the Runtime Type of a Variable
    28. Divisions
    29. Assignment-with-Operation
    30. Comments:
    31. Relational Operators
    32. The if-elif-else Triad
    33. An if-elif-else Example
    34. Conditional Expressions (a.k.a. Ternary Operator)
    35. The While-Break-Continue Triad
    36. The for Loop
    37. try-except-finally
    38. Lists
    39. Main List Methods
    40. Dictionaries
    41. Working with Dictionaries
    42. Sets
    43. Common Set Operations
    44. Set Operations Examples
    45. Finding Unique Elements in a List
    46. Enumerate
    47. Tuples
    48. Unpacking Tuples
    49. Functions
    50. Dealing with Arbitrary Number of Parameters
    51. Keyword Function Parameters
    52. The range Object
    53. Random Numbers
    54. Python Modules
    55. Importing Modules
    56. Installing Modules
    57. Listing Methods in a Module
    58. Creating Your Own Modules
    59. Creating a Runnable Application
    60. List Comprehension
    61. Zipping Lists
    62. Working with Files
    63. Reading and Writing Files
    64. Reading Command-Line Parameters
    65. Accessing Environment Variables
    66. What is Functional Programming (FP)?
    67. Terminology: Higher-Order Functions
    68. Lambda Functions in Python
    69. Example: Lambdas in the Sorted Function
    70. Other Examples of Using Lambdas
    71. Regular Expressions
    72. Using Regular Expressions Examples
    73. Python Data Science-Centric Libraries
    74. Summary
  9. Lab Exercises
    1. Using Jupyter Notebook
    2. Understanding Python
    3. Understanding NumPy
    4. Understanding pandas
    5. Repairing and Normalizing Data
    6. Data Visualization in Python
    7. Data Visualization in Python Project
    8. Data Splitting
    9. The k-Nearest Neighbors Algorithm [OPTIONAL]
    10. The Random Forest Algorithm
    11. Spam Detection with Random Forest Project [OPTIONAL]
    12. The k-Means Algorithm
    13. Building Regression Models with XGBoost Library
    14. A Hand-Made Classifier Project [OPTIONAL]

Required Prerequisites

Participants should have a working knowledge of Python (or have the programming background and/or the ability to quickly pick up Python’s syntax), and be familiar with core statistical concepts (variance, correlation, etc.)