I hadn't done any lo-fi text analysis for ages, and since we've been using Slack at my job for a little while now, it was time to get munging!
My MVP was really simplistic, but actually surprisingly good. I didn't record the output of the implementation at the time, but in summary it used the Jaccard Index between channel memberships as a measure of similarity.
I then did a fairly standard thing of inverting this into a distance measure, and visualising this with a D3 Force Layout. So far, so good, without even touching any messages!
Enter: Gensim. This is a really neat little Python library that I discovered via a colleague at Skyscanner.
It focusses on doing a small number of jobs, but well. I've been playing about with its word2vec implementation in another project, but here I used TF-IDF + Latent Semantic Indexing to produce channel similarities:
The procedure is similar to my MVP, except I treat each channel as a document, consisting of a big bag of words from up to the last 1000 messages. TF-IDF is applied to boost up effect of unusual words and LSI extracts topics. I then compute all channel to channel similarities, convert to distances, and visualise as a force layout.
You can see the video of the output above, but you can also play with a visualisation of Skyscanner, lightly obfuscated.
To run this, you literally only need Python and a Slack API_TOKEN
. I would love to see visualisations of other people's Slack channels. Please share them, or ping me on Twitter, and I can host them on Github!