Project Spotlight: Stack Exchange Clustering using Mahout with Konstantin Slisenko

This is a project spotlight with Konstantin Slisenko a programmer and machine learning enthusiast.

Could you please introduce yourself?

My name is Konstantin Slisenko, I’m from Belarus. I graduated from the Belarusian State University of Informatics and Radioelectronics. I am currently taking a master course.

Konstantin Slisenko

Konstantin Slisenko

I’m a Java developer and work in JazzTeam company. I like to learn new technologies. I’m currently interested in big data and machine learning. I like to participate in conferences and meet new interesting people. I also like to travel and ride a bike.

What is your project called and what does it do?

My project is clustering data of stackoverflow.com website.

The goal is to group stackoverflow questions and answers. Once grouped, you can see a common picture of stackoverflow data with relationships between questions. This can help if you want to do a marketing research or write an article (or event book) about a specific problem.

Stackexchange clustering using Mahout Tags

Stackexchange clustering using Mahout Tags

I have ideas for improvements such as to mark “hot” topics, take into consideration users ratings, etc. to add more data to a common picture. Also I’m thinking about training classifier. This could help when we get updated data and want to put this update into the system.

How did you get started?

First of all I became interested in Apache Hadoop. After I made some Hadoop programs, I started to study it’s infrastructure and learned about Apache Mahout.

I started to dig into it and apply some examples to: prepare data, run algorithm, see output. One day I found materials about stackoverflow clustering by Frank Scholten. You can watch an interesting presentation of his. This topic was also mentioned in Mahout in Action.

I now use Frank’s code as base and apply my own improvements and tuning. The data processing includes following steps:

  1. Stackexchange source data are in XML format. Hadoop jobs are used to extract text.
  2. Then I process text data using custom Lucene analyzer:  remove stop words, apply Porter Steamer, etc.
  3. Then I vectorize text using TF-IDF Mahout utilities.
  4. For clustering I now use K-Means algorithm from Mahout, but I want to try another algorithms in future.
  5. After this I store results in graph-oriented database Neo4j and use HTML and JavaScript to visualize them.

All visualizations are available here: Stackexchange clustering using Mahout.

What are some interesting discoveries you made?

The clustering quality depends on how you do perform data preparation. During this step you must pay a lot of attention to which stop-words you should remove.

Stack Exchange Clustering using Mahout by Konstantin Slisenko

Stack Exchange Clustering using Mahout by Konstantin Slisenko

The K-Means clustering algorithm requires you to set an initial number of clusters K. I want to do K calculations dynamically. For this reasons I plan to find another algorithm.

What do you want to do next on the project?

  • Use date of post publication to determine topics which are “hot” now.
  • Try some other clustering algorithms and also calculate number of clusters dynamically.
  • Build classifier based on clustered data.
  • Apply more different visualizations.
  • Apply clusters evaluation to say which clusters are “good” and which are “bad”.
  • Apply some indexed search for clustered data.
  • I’m thinking of Apache Mahout contributions – provide utility for visualizing clustered data.

Learn More

Thanks Konstantin.

Do you have a machine learning side project?

If you have an interesting machine learning side project and are interested in being profiled like Konstantin, please contact me.

7 Responses to Project Spotlight: Stack Exchange Clustering using Mahout with Konstantin Slisenko

  1. Avatar
    Konstantin Slisenko March 14, 2014 at 5:55 pm #

    Good post, thank you Jason! This motivates me to investigate machine learning!

    • Avatar
      jasonb March 15, 2014 at 7:08 am #

      It’s a great project Konstantin!

  2. Avatar
    vijay June 17, 2016 at 1:19 am #

    Great articles. Interested see some code. Also, your example site is down. Can you update the link.

  3. Avatar
    Jesús Martínez March 12, 2018 at 12:00 pm #

    It’s really great you provide a space for machine learning enthusiasts to share their projects. You have a big audience and that can certainly help in many ways! I like this initiative. Keep it up!

  4. Avatar
    Amelie June 18, 2019 at 11:43 pm #

    hello Mr. Brownlee

    I work on digital samples (600 samples) consist of 5 features.
    I wanted to apply clustering on this data.

    My questions are:

    1. Is there a way to determine the ideal cluster number that fits the number of samples?

    2. I wanted to know if there is a method that allows me to get a mlti-class clustering.

    Thank you

    • Avatar
      Jason Brownlee June 19, 2019 at 8:06 am #

      Sorry, I don’t have tutorials on clustering, I hope to cover the topic in the future.

Leave a Reply