# 1: Introduction to Machine Learning

In [None]:
# Use if you run the notebook on Google colab
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

In [None]:
!pip install mglearn

## Imports

In [None]:
import glob
import os
import re
import sys
from collections import Counter, defaultdict
# Make sure to change the directory to where your store the folder
sys.path.append("/content/drive/MyDrive/50603/code")
os.chdir('/content/drive/MyDrive/50603')

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import graphviz
import IPython
import mglearn
from IPython.display import HTML, Image, display
from plotting_functions import *
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from utils import *

plt.rcParams["font.size"] = 16
pd.set_option("display.max_colwidth", 200)

<br><br>

## Why machine learning (ML)? [[video](https://www.youtube.com/watch?v=-1hTcS5ZE4w&t=1s)]

Check out [the accompanying video](https://www.youtube.com/watch?v=-1hTcS5ZE4w&t=1s) on this material.

### Prevalence of ML

Let's look at some examples.

<!-- <img src="/content/drive/MyDrive/additional/img/ml-examples.png" height="1000" width="1000">
![](/content/drive/MyDrive/additional/img/ml-examples.png) -->


In [None]:
p = 'img/ml-examples.png'
display(Image(filename=p, width=800))

- Image sources
    - [Voice assistants](https://geeksfl.com/blog/best-voice-assistant/)
    - [Google News](https://news.google.com)    
    - [Recommendation systems](https://en.wikipedia.org/wiki/Recommender_system)
    - [Face Recognition source](https://startupleague.online/blog/3dss-tech-facial-recognition-technology/)
    - [Auto-completion](https://9to5google.com/2020/08/10/android-11-autofill-keyboard/)
    - [Stock market prediction](https://hbr.org/2019/12/what-machine-learning-will-mean-for-asset-managers)    
    - [Character recognition](https://en.wikipedia.org/wiki/Handwriting_recognition)    
    - [AlphaGo](https://deepmind.com/alphago-china)
    - [Self-driving cars](https://mc.ai/artificial-intelligence-in-self-driving-cars%E2%80%8A-%E2%80%8Ahow-far-have-we-gotten/)
    - [Drug discovery](https://www.nature.com/articles/d41586-018-05267-x)
    - [Cancer detection](https://venturebeat.com/2018/10/12/google-ai-claims-99-accuracy-in-metastatic-breast-cancer-detection/)

### Saving time and scaling products

- Imagine writing a program for **spam identification**, i.e., whether an email is spam or non-spam.
- *Traditional programming*
    - Come up with **rules** using human understanding of spam messages.
    - Time consuming and hard to come up with robust set of rules.
- *Machine learning*
    - Collect large amount of **data of spam and non-spam** emails and let the machine learning algorithm figure out rules.
- With machine learning, you're likely to
    - **Save time**
    - Customize and **scale** products

### (Supervised) machine learning: popular definition
<blockquote>
A field of study that gives computers the ability to learn without being explicitly programmed. <br> -- Arthur Samuel (1959)
</blockquote>

ML is a different way to think about problem solving.

<!-- ![](img/traditional-programming-vs-ML.png)
<img src="img/traditional-programming-vs-ML.png" height="700" width="700">  -->

In [None]:
p = 'img/traditional-programming-vs-ML.png'
display(Image(filename=p, width=800))

<br><br>

## Supervised machine learning

### Types of machine learning

Here are some typical learning problems.

- Supervised learning ([Gmail spam filtering](https://support.google.com/a/answer/2368132?hl=en))
    - Training a model from input data and its corresponding targets to predict targets for new examples.     
- Unsupervised learning ([Google News](https://news.google.com/))
    - Training a model to find patterns in a dataset, typically an unlabeled dataset.
- Reinforcement learning ([AlphaGo](https://deepmind.com/research/case-studies/alphago-the-story-so-far))
    - A family of algorithms for finding suitable actions to take in a given situation in order to maximize a reward.

### What is supervised machine learning (ML)?

- Training data comprises a set of observations ($X$) and their corresponding targets ($y$).
- We wish to find a model function $f$ that relates $X$ to $y$.
- We use the model function to predict targets of new examples.

<!-- ![](img/sup-learning.png)
<img src="img/sup-learning.png" height="800" width="800">  -->


In [None]:
p = 'img/sup-learning.png'
display(Image(filename=p, width=800))

### Example 1: Predict whether a message is spam or not

Do not worry about the code and syntax for now.

#### Input features $X$ and target $y$

Download SMS Spam Collection Dataset from [here](https://www.kaggle.com/uciml/sms-spam-collection-dataset).

In [None]:
sms_df = pd.read_csv("data/spam.csv", encoding="latin-1")
sms_df = sms_df.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
train_df, test_df = train_test_split(sms_df, test_size=0.10, random_state=42)
train_df.head().style.set_properties(**{"text-align": "left"})

#### Training a supervised machine learning model with $X$ and $y$

In [None]:
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]

clf = Pipeline(
    [
        ("vect", CountVectorizer(max_features=5000)),
        ("clf", LogisticRegression(max_iter=5000)),
    ]
)
clf.fit(X_train, y_train);

#### Predicting on unseen data using the trained model

In [None]:
pd.DataFrame(X_test[0:4]).style.set_properties(**{"text-align": "left"})

<br><br>

In [None]:
pred_dict = {
    "sms": X_test[0:4],
    "spam": y_test[0:4],  # actual spam
    "spam_predictions": clf.predict(X_test[0:4]),
}
pred_df = pd.DataFrame(pred_dict)
pred_df.style.set_properties(**{"text-align": "left"})

**We have accurately predicted labels for the unseen text messages above!**

<br><br>

### Example 2: Predicting whether a patient has a liver disease or not

##### Input data

Suppose we are interested in predicting whether a patient has the disease or not. We are given some tabular data with inputs and outputs of liver patients, as shown below. The data contains a number of input features and a special column called "Target" which is the output we are interested in predicting.

Download the data from [here](https://www.kaggle.com/uciml/indian-liver-patient-records).


In [None]:
df = pd.read_csv("data/indian_liver_patient.csv")
df = df.drop(columns = ["Gender"])
df["Dataset"] = df["Dataset"].replace(1, "Disease")
df["Dataset"] = df["Dataset"].replace(2, "No Disease")
df.rename(columns={"Dataset": "Target"}, inplace=True)
train_df, test_df = train_test_split(df, test_size=4, random_state=42)
train_df.head()

##### Building a supervise machine learning model

Let's train a supervised machine learning model with the input and output above.

In [None]:
from lightgbm.sklearn import LGBMClassifier

X_train = train_df.drop(columns=["Target"])
y_train = train_df["Target"]
X_test = test_df.drop(columns=["Target"])
y_test = test_df["Target"]
model = LGBMClassifier(random_state=123)
model.fit(X_train, y_train);

##### Model predictions on unseen data

- Given features of new patients below we'll use this model to predict whether these patients have the liver disease or not.

In [None]:
pred_df = pd.DataFrame({"Predicted_target": model.predict(X_test).tolist()})

df_concat = pd.concat([pred_df, X_test.reset_index(drop=True)], axis=1)
df_concat

<br><br>

### Example 3: Predicting housing prices

Suppose we want to predict housing prices given a number of attributes associated with houses.

Download the data from [here](https://www.kaggle.com/harlfoxem/housesalesprediction).

In [None]:
df = pd.read_csv("data/kc_house_data.csv")
df = df.drop(columns = ["id", "date"])
df.rename(columns={"price": "target"}, inplace=True)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=4)
train_df.head()

In [None]:
# Build a regression model
import xgboost as xgb
from xgboost import XGBRegressor

X_train, y_train = train_df.drop(columns= ["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns= ["target"]), train_df["target"]

model = XGBRegressor()
model.fit(X_train, y_train);

In [None]:
# Predict on unseen examples using the built model
pred_df = pd.DataFrame(
    # {"Predicted target": model.predict(X_test[0:4]).tolist(), "Actual price": y_test[0:4].tolist()}
    {"Predicted_target": model.predict(X_test[0:4]).tolist()}
)
df_concat = pd.concat([pred_df, X_test[0:4].reset_index(drop=True)], axis=1)
df_concat

<br><br>

### Example 4: Predicting the label of a given image

Suppose you want to predict the label of a given image using supervised machine learning. We are using a pre-trained model here to predict labels of new unseen images.

In [None]:
from PIL import Image

# Predict labels with associated probabilities for unseen images
images = glob.glob("data/test_images/*.*")
for image in images:
    img = Image.open(image)
    img.load()
    plt.imshow(img)
    plt.show()
    df = classify_image(img)
    print(df.to_string(index=False))
    print("--------------------------------------------------------------")

| |
|-|
|To summarize, supervised machine learning can be used on a variety of problems and different kinds of data.|

<br><br>

### ðŸ¤” Questions

- How are we exactly "learning" whether a message is spam and ham?
- What do you mean by "learn without being explicitly programmed"? The code has to be somewhere ...
- Are we expected to get correct predictions for all possible messages? How does it predict the label for a message it has not seen before?  
- What if the model mis-labels an unseen example? For instance, what if the model incorrectly predicts a non-spam as a spam? What would be the consequences?
- How do we measure the success or failure of spam identification?
- If you want to use this model in the wild, how do you know how reliable it is?  
- Would it be useful to know how confident the model is about the predictions rather than just a yes or a no?

It's great to think about these questions right now. By the end of this course you'll know answers to many of these questions!  

<br><br>

### Jupyter notebooks

- This document is a [Jupyter notebook](https://jupyter.org/), with file extension `.ipynb`.
- Confusingly, "Jupyter notebook" is also the original application that opens `.ipynb` files.
- Jupyter notebooks contain a mix of code, code output, markdown-formatted text (including LaTeX equations), and more.
  - When you open a Jupyter notebook in one of these apps, the document is "live", meaning you can run the code.
  - For example:

In [None]:
1 + 1

In [None]:
x = [1, 2, 3]
x[0] = 9999
x

<br><br>

## Summary

- Machine learning is a different paradigm for problem solving.    
- Very often it reduces the time you spend programming and helps customizing and scaling your products.
- In supervised learning we are given a set of observations ($X$) and their corresponding targets ($y$) and we wish to find a model function $f$ that relates $X$ to $y$.
- Let's have fun learning this material together!


## Credits:

This set of Jupyter notebooks is derived from [UBC CPSC 330 Applied Machine Learning](https://github.com/UBC-CS/cpsc330) developed by Varada Kolhatkar.