In this article, you will learn how to build an end-to-end sentiment analysis pipeline using Scikit-LLM and open-source large language models served through the Groq API.
Topics we will cover include:
- How Scikit-LLM bridges classical scikit-learn pipelines with modern large language model API calls.
- How to set up Scikit-LLM with a Groq backend and prepare the IMDB Movie Reviews dataset for inference.
- How to build, run, and evaluate a zero-shot sentiment classification pipeline using scikit-learn-compatible syntax.

Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM
Introduction
Traditional machine learning pipelines for predictive tasks like text classification usually rely on extracting structured, numerical features from raw text — for instance, TF-IDF frequencies or token embeddings — to feed into classical models such as logistic regression, ensembles, or support vector machines.
With the rise of large language models (LLMs), the rules of the game have somewhat changed: it is now possible to leverage zero-shot or few-shot reasoning on existing, pre-trained models for language tasks as part of a machine learning framework. Scikit-LLM is a Python library that addresses this: it bridges the gap between classical machine learning and modern LLM API calls. In this article, we will use Scikit-LLM alongside Groq backend models to build an end-to-end pipeline for sentiment analysis (a domain-specific form of text classification), achieving reasonably fast inference results with open-source models. From preprocessing to inference, we will use a large, realistically-sized dataset — the IMDB movie reviews dataset.
Prerequisites, Setup, and Obtaining the Dataset
To make the code shown in this tutorial work, you’ll need to have installed the Scikit-LLM library:
|
1 |
pip install scikit-llm |
Once installed, the first step is to set it up and configure API credentials. In other words, we will need to “connect” Scikit-LLM to an endpoint — namely an LLM API repository like Groq. Make sure you register on Groq and generate an API key here: you’ll need to copy and paste it in the code below:
|
1 2 3 4 5 6 7 8 |
from skllm.config import SKLLMConfig # 1. Pointing to a Groq's compatible endpoint SKLLMConfig.set_gpt_url("https://api.groq.com/openai/v1") # 2. Set your free Groq API key # Get yours at https://console.groq.com/keys SKLLMConfig.set_openai_key("YOUR-API-KEY-GOES-HERE") |
Scikit-LLM uses an endpoint function, set_gpt_url, that is compatible with OpenAI by default; we have routed it to make internal requests to a custom Groq URL: https://api.groq.com/openai/v1.
The next stage of the process is importing the IMDB Movie Reviews dataset — which has about 50K instances — and preparing it for the sentiment analysis pipeline we will build. Instances consist of a text review labeled with a sentiment, which can be positive or negative (this is a binary classification problem, solvable with models like logistic regression, for instance).
For convenience, we read the dataset from a publicly available GitHub repository version in CSV format:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import pandas as pd from sklearn.model_selection import train_test_split # Fetching a large, realistic-sized dataset (IMDB Movie Reviews - 50,000 rows) # We will read the data from a public raw CSV for convenience url = "https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv" print("Downloading dataset...") df = pd.read_csv(url) print(f"Total dataset size: {df.shape[0]} rows") # In a realistic LLM pipeline using a free-tier API, sending 50,000 requests # will likely trigger quota limits. Thus, we will use 500 rows for demonstrating our pipeline execution. # Feel free to use more data if you have paid API access. df_sampled = df.sample(n=500, random_state=42) # The IMDB dataset contains HTML tags and formatting noise: that's perfect for testing our cleaner X = df_sampled["review"] y = df_sampled["sentiment"] # Labels are 'positive' or 'negative' # Splitting into training (for initializing zero-shot labels) and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
Note that we fetched 500 rows only for demonstration purposes, as otherwise inference may take long without sufficient computing resources. You can freely change this sample size, n=500, to adapt it to your own needs.
Building the Sentiment Analysis Pipeline
Here comes the most interesting part of the process! A data science pipeline boils down to a series of preprocessing, cleaning, and data preparation steps followed by model setup or training, inference, and evaluation. For a predictive, text-based scenario like ours, preprocessing typically entails cleaning and normalizing the text. Scikit-learn provides an elegant class, FunctionTransformer, to define and encapsulate preprocessing steps based on a custom function:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from sklearn.preprocessing import FunctionTransformer def clean_text_data(texts): """Cleans raw text inputs by removing HTML tags and stripping whitespace.""" series = pd.Series(texts).astype(str) # Remove HTML tags like <br /> cleaned = series.str.replace(r'<[^>]+>', ' ', regex=True) # Remove extra spaces cleaned = cleaned.str.strip().str.replace(r'\s+', ' ', regex=True) return cleaned.tolist() # Wrapping the cleaning function to enable its use inside a Pipeline object text_cleaner = FunctionTransformer(clean_text_data) |
Now we put together this preprocessing object with a model instance to create the Pipeline. Once defined, this pipeline orchestrates the whole process of preparing the data and passing it to the model at both training and inference stages — even though we use the term “training”, no actual weight-based training will occur, as we are utilizing a pre-trained model from Groq for zero-shot classification. Fitting the model only involves passing it the classification labels to use.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.pipeline import Pipeline from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier # Define the end-to-end pipeline sentiment_pipeline = Pipeline([ ("cleaner", text_cleaner), # Updated to use Groq's active Llama 3.1 8B model ("llm_classifier", ZeroShotGPTClassifier(model="custom_url::llama-3.1-8b-instant")) ]) # Fit the pipeline # Note: For Zero-Shot classification, fit() doesn't train the LLM. # It simply registers the unique labels present in 'y_train' (positive, negative). print("Fitting the pipeline...") sentiment_pipeline.fit(X_train, y_train) |
Once we have run the pipeline to “fit” the model, we use it once more for inference. Both steps use familiar scikit-learn syntax. Besides evaluating the model pipeline’s performance, we also display a few example predictions:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from sklearn.metrics import classification_report print(f"Running predictions on {len(X_test)} test samples...") # Run predictions through the pipeline predictions = sentiment_pipeline.predict(X_test) # Evaluate the pipeline's performance on the realistic data print("\n--- Classification Report ---") print(classification_report(y_test, predictions)) # Display a few side-by-side examples print("\n--- Sample Predictions ---") for review, actual, predicted in zip(X_test[:3], y_test[:3], predictions[:3]): # Truncate review for display purposes short_review = review[:100] + "..." print(f"Review: {short_review}") print(f"Actual: {actual} | Predicted: {predicted}\n") |
Here’s the detailed output — execution of the above code may take a few minutes to complete:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
--- Classification Report --- precision recall f1-score support negative 0.95 0.97 0.96 60 positive 0.95 0.93 0.94 40 accuracy 0.95 100 macro avg 0.95 0.95 0.95 100 weighted avg 0.95 0.95 0.95 100 --- Sample Predictions --- Review: I saw mommy...well, she wasn't exactly kissing Santa Clause; he has his hand on her thigh and wicked... Actual: negative | Predicted: negative Review: This entry is certainly interesting for series fans (like myself), but yet it is mostly incomprehens... Actual: negative | Predicted: negative Review: Ingrid Bergman (Cleo Dulaine) has never been so beautiful. Gary Cooper as "Cleent" so perfectly cast... Actual: positive | Predicted: positive |
Our pipeline is doing a solid job at classifying sentiment in reviews. Well done!
Wrapping Up
This article walked you through defining an end-to-end pipeline for sentiment classification using Scikit-LLM and freely available, pre-trained LLMs from API endpoints like Groq. This is a versatile approach to using classic scikit-learn syntax in novel, LLM-driven machine learning applications.






No comments yet.