Semi-Supervised Learning With Label Propagation

By Jason Brownlee on December 28, 2020 in Python Machine Learning 22

Semi-supervised learning refers to algorithms that attempt to make use of both labeled and unlabeled training data.

Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagate known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the label propagation algorithm for classification predictive modeling.

In this tutorial, you will discover how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

An intuition for how the label propagation semi-supervised learning algorithm works.
How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

Semi-Supervised Learning With Label Propagation
Photo by TheBluesDude, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Label Propagation Algorithm
Semi-Supervised Classification Dataset
Label Propagation for Semi-Supervised Learning

Label Propagation Algorithm

Label Propagation is a semi-supervised learning algorithm.

The algorithm was proposed in the 2002 technical report by Xiaojin Zhu and Zoubin Ghahramani titled “Learning From Labeled And Unlabeled Data With Label Propagation.”

The intuition for the algorithm is that a graph is created that connects all examples (rows) in the dataset based on their distance, such as Euclidean distance. Nodes in the graph then have label soft labels or label distribution based on the labels or label distributions of examples connected nearby in the graph.

Many semi-supervised learning algorithms rely on the geometry of the data induced by both labeled and unlabeled examples to improve on supervised methods that use only the labeled data. This geometry can be naturally represented by an empirical graph g = (V,E) where nodes V = {1,…,n} represent the training data and edges E represent similarities between them

— Page 193, Semi-Supervised Learning, 2006.

Propagation refers to the iterative nature that labels are assigned to nodes in the graph and propagate along the edges of the graph to connected nodes.

This procedure is sometimes called label propagation, as it “propagates” labels from the labeled vertices (which are fixed) gradually through the edges to all the unlabeled vertices.

— Page 48, Introduction to Semi-Supervised Learning, 2009.

The process is repeated for a fixed number of iterations to strengthen the labels assigned to unlabeled examples.

Starting with nodes 1, 2,…,l labeled with their known label (1 or −1) and nodes l + 1,…,n labeled with 0, each node starts to propagate its label to its neighbors, and the process is repeated until convergence.

— Page 194, Semi-Supervised Learning, 2006.

Now that we are familiar with the Label Propagation algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

Semi-Supervised Classification Dataset

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

...

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

...
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

...

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

...
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

...

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# summarize training set size
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
# summarize test set size
print('Test Set:', X_test.shape, y_test.shape)

# prepare semi-supervised learning dataset

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# summarize training set size

print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)

print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)

# summarize test set size

print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,)
Unlabeled Train Set: (250, 2) (250,)
Test Set: (500, 2) (500,)

Labeled Train Set: (250, 2) (250,)

Unlabeled Train Set: (250, 2) (250,)

Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

...
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)

...

# define model

model = LogisticRegression()

# fit model on labeled dataset

model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire hold out test dataset and evaluated using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

...

# make predictions on hold out test set

yhat = model.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

# baseline performance on the semi-supervised learning dataset

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# define model

model = LogisticRegression()

# fit model on labeled dataset

model.fit(X_train_lab, y_train_lab)

# make predictions on hold out test set

yhat = model.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve better accuracy than this.

Accuracy: 84.800

1	Accuracy: 84.800

Next, let’s explore how to apply the label propagation algorithm to the dataset.

Label Propagation for Semi-Supervised Learning

The Label Propagation algorithm is available in the scikit-learn Python machine learning library via the LabelPropagation class.

The model can be fit just like any other classification model by calling the fit() function and used to make predictions for new data via the predict() function.

...
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(..., ...)
# make predictions on hold out test set
yhat = model.predict(...)

...

# define model

model = LabelPropagation()

# fit model on training dataset

model.fit(..., ...)

# make predictions on hold out test set

yhat = model.predict(...)

Importantly, the training dataset provided to the fit() function must include labeled examples that are integer encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “transduction_” attribute on the LabelPropagation class.

...
# get labels for entire training dataset data
tran_labels = model.transduction_

...

# get labels for entire training dataset data

tran_labels = model.transduction_

Now that we are familiar with how to use the Label Propagation algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

...
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))

...

# create the training dataset input

X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

...
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]

...

# create "no label" for unlabeled data

nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

...
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

...

# recombine training dataset labels

y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the LabelPropagation model on the entire training dataset.

...
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)

...

# define model

model = LabelPropagation()

# fit model on training dataset

model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

...

# make predictions on hold out test set

yhat = model.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating label propagation on the semi-supervised learning dataset is listed below.

# evaluate label propagation on the semi-supervised learning dataset
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

# evaluate label propagation on the semi-supervised learning dataset

from numpy import concatenate

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.semi_supervised import LabelPropagation

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# create the training dataset input

X_train_mixed = concatenate((X_train_lab, X_test_unlab))

# create "no label" for unlabeled data

nolabel = [-1 for _ in range(len(y_test_unlab))]

# recombine training dataset labels

y_train_mixed = concatenate((y_train_lab, nolabel))

# define model

model = LabelPropagation()

# fit model on training dataset

model.fit(X_train_mixed, y_train_mixed)

# make predictions on hold out test set

yhat = model.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

In this case, we can see that the label propagation model achieves a classification accuracy of about 85.6 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.600

1	Accuracy: 85.600

So far, so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label propagation model as follows:

...
# get labels for entire training dataset data
tran_labels = model.transduction_

...

# get labels for entire training dataset data

tran_labels = model.transduction_

We can then use these labels along with all of the input data to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

...
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

...

# define supervised learning model

model2 = LogisticRegression()

# fit supervised learning model on entire training dataset

model2.fit(X_train_mixed, tran_labels)

# make predictions on hold out test set

yhat = model2.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label propagation for semi-supervised learning
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# get labels for entire training dataset data
tran_labels = model.transduction_
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

# evaluate logistic regression fit on label propagation for semi-supervised learning

from numpy import concatenate

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.semi_supervised import LabelPropagation

from sklearn.linear_model import LogisticRegression

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# create the training dataset input

X_train_mixed = concatenate((X_train_lab, X_test_unlab))

# create "no label" for unlabeled data

nolabel = [-1 for _ in range(len(y_test_unlab))]

# recombine training dataset labels

y_train_mixed = concatenate((y_train_lab, nolabel))

# define model

model = LabelPropagation()

# fit model on training dataset

model.fit(X_train_mixed, y_train_mixed)

# get labels for entire training dataset data

tran_labels = model.transduction_

# define supervised learning model

model2 = LogisticRegression()

# fit supervised learning model on entire training dataset

model2.fit(X_train_mixed, tran_labels)

# make predictions on hold out test set

yhat = model2.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

In this case, we can see that this hierarchical approach of the semi-supervised model followed by supervised model achieves a classification accuracy of about 86.2 percent on the holdout dataset, even better than the semi-supervised learning used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 86.200

1	Accuracy: 86.200

Can you achieve better results by tuning the hyperparameters of the LabelPropagation model?
Let me know what you discover in the comments below.

Summary

In this tutorial, you discovered how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

An intuition for how the label propagation semi-supervised learning algorithm works.
How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

22 Responses to Semi-Supervised Learning With Label Propagation

Jack December 31, 2020 at 6:40 pm #

Thank you so much for your contribution Jason.
Love your blog and love to learn from you!
Keep up the great work. You are awesome!
best regards and happy new year

Reply
- Jason Brownlee January 1, 2021 at 5:23 am #
  
  You’re welcome, thank you for your support and kind words!
  
  Reply
Constantine January 2, 2021 at 12:02 am #

Hey Jason, and Happy 2021!

I found this article a good first sneak-peak at Semi-Supervised Learning, and it certainly is an exciting domain! I would like to ask, would you consider making a series on that, you know creating some of its algorithms from scratch, and then practically attacking datasets with the help of sklearn? I do have the two books on SSL that you’ve listed, but I would like some practical coding tutorials to put that rather dense theory to practice. In addition, what are your thoughts on Graph-Based (Deep) Machine Learning? I also have an interest in that, as I am motivated by my interest in computational biology which does use graphs quite a lot. Any thoughts on making a few posts on that as well?

Best regards!

Reply
- Sharon January 2, 2021 at 12:21 am #
  
  Model2 – seems like you made one classifier that classified unlabeled data and then use his prediction on the holdout set.. so why not making 2 logistic regressions?
  
  Reply
  - Jason Brownlee January 2, 2021 at 6:26 am #
    
    Model2 is fit on the labels assigned to the training dataset by the semi-supervised learning algorithm.
    
    Reply
- Jason Brownlee January 2, 2021 at 6:26 am #
  
  Thanks, same to you!
  
  Yes, I have a few more tutorials on the topic written and scheduled.
  
  Great suggestion, I’d like to dig deeper and code the algorithms from scratch. Thanks.
  
  Reply
Emanuel January 2, 2021 at 1:27 am #

Jason,
I must thank you for taking the time and effort to produce all those truly enjoyable articles.
Keep up the excellent work. Happy 2021!
Best regards

Reply
- Jason Brownlee January 2, 2021 at 6:27 am #
  
  You’re very welcome!
  
  Reply
Thad Wengert January 5, 2021 at 2:01 am #

Crystal clear exposition as usual, Jason.
Adding the classification model when you already have a model would seem to only increase errors. It feels like making the cake twice … though in the realm of real cakes, that’s what tiramisu is, and it’s certainly a good thing.
Theoretically, why would we expect a classification model (here, the logistic model) to improve accuracy?
Empirically, do you know of any literature where it is shown to?
Thanks!

Reply
- Jason Brownlee January 5, 2021 at 6:27 am #
  
  Thanks.
  
  Not sure I agree. The model fit on the transducted training set provides an alternate view on how to interpret the data.
  
  As aways, evaluate and compare to alternatives and use whatever works best for your specific dataset+model.
  
  Reply
  - Thad Wengert February 4, 2021 at 7:36 am #
    
    Thanks for the answer.
    Tried this on a real dataset. Propagated ~ 10% labeled observations onto the rest. Propagated tags roughly 45% accurate.
    Added a CNN on the propagated tags. Its performance on propagated tags was very good – but not relevant of course. That’s the echo chamber ; the twice-baked cake.
    But the CNN’s performance on the original ~ 10% labeled observations was over 80% accurate.
    Wow. Smashing.
    
    Reply
    - Jason Brownlee February 4, 2021 at 9:35 am #
      
      Very cool, well done!
      
      Thanks for sharing your success.
      
      Reply
Sheetal January 8, 2021 at 5:06 pm #

How to do semi-supervised learning for regression?

Reply
- Jason Brownlee January 9, 2021 at 6:39 am #
  
  It is typically used for classification, nevertheless there are algorithm for regression.
  
  Sorry I don’t have any examples, perhaps some of the resources in the “further reading” section will help.
  
  Reply
Sreedevi March 19, 2021 at 6:01 pm #

Thanks Jason for the crisp article on Label Propagation.

What are the other semi-supervised Learning techniques? The sci-kit learn link (that you provided) also mentions self training – which you seem to have used in the above tutorial (along with Label propagation) . Any others?

Reply
- Jason Brownlee March 20, 2021 at 5:17 am #
  
  You’re welcome.
  
  Good question, this might be a good place to start:
  https://en.wikipedia.org/wiki/Semi-supervised_learning
  
  Reply
Jonathan May 4, 2021 at 5:17 pm #

Hi Jason
Great article. In the sklearn documentation it does not say to use -1 to mark unlabeled data. Where did you get this info from? I tried some other random values, pos and neg, and -1 works best.
Also, what happens if coincidentally there are some -1s in the labeled targets (y_train_lab)?
Thanks.
Jon

Reply
- Jason Brownlee May 5, 2021 at 6:09 am #
  
  Thanks.
  
  From the documentation: (unlabeled points are marked as -1)
  
  On this page:
  https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html
  
  Reply
frozhen July 14, 2021 at 7:13 am #

Hi Jason,

Very nice illustration on semi-supervised learning. Question for you, do you have experience using semi-supervised learning for multi-label classification? I know scikit-multilearn is a great package for multi label classification but I cannot figure out how to combine it with sklearn’s self learning classifier. Would love to hear your opinions!

Reply
- Jason Brownlee July 15, 2021 at 5:21 am #
  
  Thanks!
  
  Not off hand, sorry.
  
  Reply
Alexandre August 10, 2022 at 10:11 am #

Great article Jason, thanks for sharing this !

As i was reading thru this approach i got to wonder:
Is there a diff between this and doing the following:
a. Build a normal classifier, apply to the unlabeled data
b. filter for the top 10% of highest precision (lets say predictions with precision >90%).
c. re-injecting the top precision results back into the dataset, thus extending the labeled dataset
d. re-classify again, i.e. back to a.

It feels its somewhat similar to what this approach you describe above is doing, or am i missing something ?

Interestingly enough, we should be able to test this, by comparing both approaches
Also with the one i describe we have available all the typical algols to use

Reply
- James Carmichael August 11, 2022 at 6:17 am #
  
  Hi Alexandre…You are very welcome! Your understanding and suggested procedure are correct.
  
  Reply

Navigation

Semi-Supervised Learning With Label Propagation

Tutorial Overview

Label Propagation Algorithm

Semi-Supervised Classification Dataset

Label Propagation for Semi-Supervised Learning

Further Reading

Books

Papers

APIs

Articles

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

22 Responses to Semi-Supervised Learning With Label Propagation

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Label Propagation Algorithm

Semi-Supervised Classification Dataset

Label Propagation for Semi-Supervised Learning

Further Reading

Books

Papers

APIs

Articles

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

22 Responses to Semi-Supervised Learning With Label Propagation

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects