Semi-Supervised Learning With Label Spreading

By Jason Brownlee on December 30, 2020 in Python Machine Learning 24

Semi-supervised learning refers to algorithms that attempt to make use of both labeled and unlabeled training data.

Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagates known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the label spreading algorithm for classification predictive modeling.

In this tutorial, you will discover how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

An intuition for how the label spreading semi-supervised learning algorithm works.
How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

Semi-Supervised Learning With Label Spreading
Photo by Jernej Furman, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Label Spreading Algorithm
Semi-Supervised Classification Dataset
Label Spreading for Semi-Supervised Learning

Label Spreading Algorithm

Label Spreading is a semi-supervised learning algorithm.

The algorithm was introduced by Dengyong Zhou, et al. in their 2003 paper titled “Learning With Local And Global Consistency.”

The intuition for the broader approach of semi-supervised learning is that nearby points in the input space should have the same label, and points in the same structure or manifold in the input space should have the same label.

The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure typically referred to as a cluster or a manifold) are likely to have the same label.

— Learning With Local And Global Consistency, 2003.

The label spreading is inspired by a technique from experimental psychology called spreading activation networks.

This algorithm can be understood intuitively in terms of spreading activation networks from experimental psychology.

— Learning With Local And Global Consistency, 2003.

Points in the dataset are connected in a graph based on their relative distances in the input space. The weight matrix of the graph is normalized symmetrically, much like spectral clustering. Information is passed through the graph, which is adapted to capture the structure in the input space.

The approach is very similar to the label propagation algorithm for semi-supervised learning.

Another similar label propagation algorithm was given by Zhou et al.: at each step a node i receives a contribution from its neighbors j (weighted by the normalized weight of the edge (i,j)), and an additional small contribution given by its initial value

— Page 196, Semi-Supervised Learning, 2006.

After convergence, labels are applied based on nodes that passed on the most information.

Finally, the label of each unlabeled point is set to be the class of which it has received most information during the iteration process.

— Learning With Local And Global Consistency, 2003.

Now that we are familiar with the label spreading algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

Semi-Supervised Classification Dataset

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

...

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

...
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

...

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

...
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

...

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# summarize training set size
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
# summarize test set size
print('Test Set:', X_test.shape, y_test.shape)

# prepare semi-supervised learning dataset

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# summarize training set size

print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)

print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)

# summarize test set size

print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,)
Unlabeled Train Set: (250, 2) (250,)
Test Set: (500, 2) (500,)

Labeled Train Set: (250, 2) (250,)

Unlabeled Train Set: (250, 2) (250,)

Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

...
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)

...

# define model

model = LogisticRegression()

# fit model on labeled dataset

model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire holdout test dataset and evaluated using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

...

# make predictions on hold out test set

yhat = model.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

# baseline performance on the semi-supervised learning dataset

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# define model

model = LogisticRegression()

# fit model on labeled dataset

model.fit(X_train_lab, y_train_lab)

# make predictions on hold out test set

yhat = model.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve a better accuracy than this.

Accuracy: 84.800

1	Accuracy: 84.800

Next, let’s explore how to apply the label spreading algorithm to the dataset.

Label Spreading for Semi-Supervised Learning

The label spreading algorithm is available in the scikit-learn Python machine learning library via the LabelSpreading class.

The model can be fit just like any other classification model by calling the fit() function and used to make predictions for new data via the predict() function.

...
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(..., ...)
# make predictions on hold out test set
yhat = model.predict(...)

...

# define model

model = LabelSpreading()

# fit model on training dataset

model.fit(..., ...)

# make predictions on hold out test set

yhat = model.predict(...)

Importantly, the training dataset provided to the fit() function must include labeled examples that are ordinal encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “transduction_” attribute on the LabelSpreading class.

...
# get labels for entire training dataset data
tran_labels = model.transduction_

...

# get labels for entire training dataset data

tran_labels = model.transduction_

Now that we are familiar with how to use the label spreading algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

...
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))

...

# create the training dataset input

X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

...
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]

...

# create "no label" for unlabeled data

nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

...
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

...

# recombine training dataset labels

y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the LabelSpreading model on the entire training dataset.

...
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)

...

# define model

model = LabelSpreading()

# fit model on training dataset

model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

...

# make predictions on hold out test set

yhat = model.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating label spreading on the semi-supervised learning dataset is listed below.

# evaluate label spreading on the semi-supervised learning dataset
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelSpreading
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

# evaluate label spreading on the semi-supervised learning dataset

from numpy import concatenate

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.semi_supervised import LabelSpreading

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# create the training dataset input

X_train_mixed = concatenate((X_train_lab, X_test_unlab))

# create "no label" for unlabeled data

nolabel = [-1 for _ in range(len(y_test_unlab))]

# recombine training dataset labels

y_train_mixed = concatenate((y_train_lab, nolabel))

# define model

model = LabelSpreading()

# fit model on training dataset

model.fit(X_train_mixed, y_train_mixed)

# make predictions on hold out test set

yhat = model.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

In this case, we can see that the label spreading model achieves a classification accuracy of about 85.4 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.400

1	Accuracy: 85.400

So far so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label spreading model as follows:

...
# get labels for entire training dataset data
tran_labels = model.transduction_

...

# get labels for entire training dataset data

tran_labels = model.transduction_

We can then use these labels, along with all of the input data, to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

...
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

...

# define supervised learning model

model2 = LogisticRegression()

# fit supervised learning model on entire training dataset

model2.fit(X_train_mixed, tran_labels)

# make predictions on hold out test set

yhat = model2.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label spreading for semi-supervised learning
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelSpreading
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# get labels for entire training dataset data
tran_labels = model.transduction_
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

# evaluate logistic regression fit on label spreading for semi-supervised learning

from numpy import concatenate

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from sklearn.semi_supervised import LabelSpreading

from sklearn.linear_model import LogisticRegression

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

# split train into labeled and unlabeled

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

# create the training dataset input

X_train_mixed = concatenate((X_train_lab, X_test_unlab))

# create "no label" for unlabeled data

nolabel = [-1 for _ in range(len(y_test_unlab))]

# recombine training dataset labels

y_train_mixed = concatenate((y_train_lab, nolabel))

# define model

model = LabelSpreading()

# fit model on training dataset

model.fit(X_train_mixed, y_train_mixed)

# get labels for entire training dataset data

tran_labels = model.transduction_

# define supervised learning model

model2 = LogisticRegression()

# fit supervised learning model on entire training dataset

model2.fit(X_train_mixed, tran_labels)

# make predictions on hold out test set

yhat = model2.predict(X_test)

# calculate score for test set

score = accuracy_score(y_test, yhat)

# summarize score

print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

In this case, we can see that this hierarchical approach of semi-supervised model followed by supervised model achieves a classification accuracy of about 85.8 percent on the holdout dataset, slightly better than the semi-supervised learning algorithm used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 85.800

1	Accuracy: 85.800

Can you achieve better results by tuning the hyperparameters of the LabelSpreading model?
Let me know what you discover in the comments below.

Summary

In this tutorial, you discovered how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

An intuition for how the label spreading semi-supervised learning algorithm works.
How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

24 Responses to Semi-Supervised Learning With Label Spreading

fabou January 4, 2021 at 11:53 pm #

Hi Jason,

nice post as usual.

Is label spreading sensitive to imbalanced classes, spreading mainely the over represented class?

If yes, is there a way to cope with that like giving more weigth to a minority class nearby point?

Reply
- Jason Brownlee January 5, 2021 at 6:23 am #
  
  Thanks.
  
  I expect the technique is sensitive to imbalanced classes and would assume a balanced training set. Perhaps check the literature to confirm and check if there are ways of biasing the label spreading algorithm accordingly.
  
  Reply
marco January 5, 2021 at 12:55 am #

Hello Jason,
I ve a couple of questions about PCA.
Does PCA(n_components=2) mean that the result is a dataset with two columns?

In the following example the classifier (logistic regression) will use only two columns after standardscaler?

>>> pipe_lr = make_pipeline(StandardScaler(),
PCA(n_components=2), LogisticRegression(random_state=1, solver=’lbfgs’)

Does PCA apply to classification and regression as well?
Thanks,
Marco

Reply
- Jason Brownlee January 5, 2021 at 6:25 am #
  
  Yes it will select 2 components after transforming the dataset – e.g. the result is 2 columns.
  
  Yes, the transform can be used for classification and regression prediction tasks.
  
  Reply
marco January 5, 2021 at 12:55 am #

Jason,
is there a rule of thumb to choose a PCA vs. LDA? When to use each of them?
Do you have an easy example that explains PCA and LDA?
Thanks,
Marco

Reply
- Jason Brownlee January 5, 2021 at 6:26 am #
  
  No, careful experimentation and choose the technique that results in the best performance for your specific dataset.
  
  Reply
marco January 7, 2021 at 7:47 pm #

Hello Jason,
a question on PCA.
Does make sense mixing sklearn with keras?
I mean first apply StandardScaler(), then PCA (e.g. n_components=2)
and then use the ouput as an input for a neural network like MLP?
Thanks,
Marco

Reply
- Jason Brownlee January 8, 2021 at 5:41 am #
  
  Yes, data prep in sklearn can be very helpful.
  
  Reply
marco January 7, 2021 at 7:48 pm #

Jason,
how to decide numbers of component for a PCA e.g. (n_components=2).
Is there a way?
An analysis of importance of features may help?
Thanks,
Marco

Reply
- Jason Brownlee January 8, 2021 at 5:42 am #
  
  Try a suite of values and compare, see here for an example:
  https://machinelearningmastery.com/principal-components-analysis-for-dimensionality-reduction-in-python/
  
  Reply
Sheetal January 8, 2021 at 5:49 am #

Very informative post for me since this is the first time I’m learning about semi-supervised learning.

Reply
- Jason Brownlee January 8, 2021 at 7:42 am #
  
  Thnaks.
  
  Reply
Ramesh Ravula January 9, 2021 at 8:42 pm #

In label spreading for semi-supervised learning why are we using X_test_unlab as part of training?

Reply
- Jason Brownlee January 10, 2021 at 5:40 am #
  
  Semi-supervised learning means we have a large amount of unlabelled data.
  
  A semi-supervised learning algorithm can make use of this unlabelled data.
  
  Reply
Walid January 9, 2021 at 11:18 pm #

Tornado is a tool that help you perform semi supervised learning with no coding, more detIL could be found here:https://www.linkedin.com/pulse/tornado-zero-coding-active-learning-ml-tool-walid-daboubi/
Demo:https://www.youtube.com/watch?v=xcX-95iGKxY

Reply
- Jason Brownlee January 10, 2021 at 5:42 am #
  
  Thanks for sharing.
  
  Reply
Michael Dunham March 20, 2021 at 1:36 pm #

Hi there,

I have a suspicion that the LabelSpreading() class is maybe not working like you think it is here. Label propagation techniques are inherently transductive, meaning they can only make predictions on unlabeled data they train with (i.e. the X_test_unlab from concatenate((X_train_lab, X_test_unlab)) in your initial case). Transductive methods don’t work like normal (inductive) classifiers and can’t make predictions for unseen data, i.e. X_test. So I find it strange that LabelSpreading() is still able to run here without error. In my mind, to make predictions for X_test, X_test would have needed to be included in model.train() via X_train_mixed = concatenate((X_train_lab, X_test_unlab, X_test))
.

I thought that perhaps the only reason your scenario executed without error was because your X_test is the exact same size as the data matrix that was used to train the algorithm, concatenate((X_train_lab, X_test_unlab)) – both are size (500, 2). But I ran your code using different sizes of the training & testing matrices, and it still somehow ran. To me, this is a strange phenomenon given my knowledge of how label propagation methods work. Thoughts?

Reply
- Jason Brownlee March 21, 2021 at 6:06 am #
  
  Perhaps adjust the dataset sizes and see if it has an impact?
  
  Reply
Ben July 15, 2021 at 11:31 pm #

Hi Jason,

“The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure typically referred to as a cluster or a manifold) are likely to have the same label.”

My labelled data-set has a different balance/proportion of classes to what I’d expect the unlabelled data-set to have. I don’t think this violates the above assumptions because nearby points are likely to share the same label. It’s just that the distribution of classes in both sets is different – I think this is fine, but I’d appreciate your thoughts.

Many thanks,
Ben

Reply
- Jason Brownlee July 16, 2021 at 5:26 am #
  
  The imbalance might lead you to try oversampling methods or cost-sensitive methods if it becomes an issue.
  
  Reply
Vi August 4, 2021 at 2:31 am #

Hi!

What distance metric is used for the knn in LabelSpreading()? And can you change the distance metric?

Thank you 🙂
Vi

Reply
- Jason Brownlee August 4, 2021 at 5:16 am #
  
  Not sure, probably the default for knn in sklearn. I recommend checking the docs.
  
  Reply
Gavin Smyth April 27, 2022 at 7:02 pm #

Hi! how would you implement this with your own dataset. for example tabular data in .csv file. Is there another post including this?

Reply
- James Carmichael May 2, 2022 at 9:34 am #
  
  Hi Gavin…the following may be of interest to you:
  
  https://arxiv.org/abs/2108.12296
  
  Reply

Navigation

Semi-Supervised Learning With Label Spreading

Tutorial Overview

Label Spreading Algorithm

Semi-Supervised Classification Dataset

Label Spreading for Semi-Supervised Learning

Further Reading

Books

Papers

APIs

Articles

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

24 Responses to Semi-Supervised Learning With Label Spreading

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Label Spreading Algorithm

Semi-Supervised Classification Dataset

Label Spreading for Semi-Supervised Learning

Further Reading

Books

Papers

APIs

Articles

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

24 Responses to Semi-Supervised Learning With Label Spreading

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects