Basic Introduction to Scikit Learn

Abhishek Jaiswal
Analytics Vidhya
Published in
4 min readApr 27, 2021

--

Photo by Marius Masalar on Unsplash

Built on-top of NumPy, SciPy and Matplotlib,the Scikit-learn is the robust library used in machine learning.

Why Scikit Learn ?

  • Provides efficient tools for machine learning .
  • Provide statistical model including classification, regression, clustering.
  • Covers most machine-learning tasks and scales to most data problems.

Installation

Using pip

pip install -U scikit-learn

Using conda

conda install scikit-learn

Features

Scikit-learn library focus more on modeling the data rather than loading, manipulating and summarizing data.Some of the popular groups of models provided by Sklearn are :

  1. Supervised Learning algorithms : Linear Regression, Support Vector Machine (SVM), Decision Tree etc.
  2. Unsupervised Learning algorithms : clustering, factor analysis,Principal Component Analysis, unsupervised neural networks etc.
  3. Cross Validation : to check the accuracy of supervised models
  4. Dimensional Reduction : for reducing the number of attributes in data for summarisation, visualization and feature selection.
  5. Ensemble methods : for combining the predictions of multiple supervised models and many many more…….

Cheat Sheet

Photo by www.datacamp.com

Dataset Loading

Input :

from sklearn.datasets import load_iris   #imports the library 
iris = load_iris() #import the dataset 'iris' to the variable iris
X = iris.data #store the features into the variable X
y = iris.target #stores the label into the variable Y
feature_names = iris.feature_names #stores the feature attributes
target_names = iris.target_names #stores the label attributes
print("Feature names:", feature_names) #print feature name
print("Target names:", target_names) #print label name
print("\nFirst 10 rows of X:\n", X[:5]) #print first 5 entry

Feature Names − It is the list of all the names of the features(input).

Target Names − It is the list of all the names of the labels(output).

Output :

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
First 10 rows of X:
[
[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
]

Splitting the dataset

Input :

from sklearn.model_selection import train_test_split  #import the library

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1) #divide 70% data for training & 30% data for testing purpose with same data given all the time by random state

print(X_train.shape) #print the shape of input train data
print(X_test.shape) #print the shape of input test data
print(y_train.shape) #print the shape of output train data
print(y_test.shape) #print the shape of output test data

random_size − guarantees the split will always be the same.

Output :

(105, 4)   #row * column
(45, 4) #row * column
(105,) #row * column(0)
(45,) #row * column(0)

Train the Model

Input :

from sklearn.neighbors import KNeighborsClassifier   #import the function
from sklearn import metrics #import the function
classifier_knn = KNeighborsClassifier(n_neighbors = 3)
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)

print("Accuracy:", metrics.accuracy_score(y_test, y_pred)) # Finds the accuracy by comparing actual response values(y_test)with predicted response value(y_pred)

sample = [[5, 5, 3, 2], [2, 4, 3, 5]] # Providing sample data and the model will make prediction out of that data
preds = classifier_knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species)

Output :

Accuracy: 0.9833333333333333
Predictions: ['versicolor', 'virginica']

KNeighborsClassifier : this classifier implements learning based on the k nearest neighbors, where k is an integer value specified by the user.The choice of the value of k is dependent on data

A rough diagram for the above data

The above plot is obtained my following code :

scores_list = []
for k in range(1,15):
classifier = KNeighborsClassifier(n_neighbors=k)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
scores_list.append(metrics.accuracy_score(y_test,y_pred))

Metrics : implements functions assessing prediction error for specific purposes like accuracy score(here).

Model Persistence/Saving the Model

Once we train the model, the model should persist for future use so that we do not need to retrain it again and again. It is done with the help of dump and load features of joblib package.

Input :

from sklearn.externals import joblib
joblib.dump(classifier_knn, 'iris_classifier_knn.joblib') #save the model into file named iris_classifier_knn.joblib

Now, the object can be reloaded from the file with the help of following code

joblib.load('iris_classifier_knn.joblib')

Now, you are good to go with basic projects with sklearn.
All the best ❤️

--

--