Information operations · Information Warfare · Russia

Predicting Russian Trolls Using Reddit Comments


Using Machine Learning to Predict Russian Trolls

Code for those Interested

Introduction

Russia has long maintained a contentious relationship with countries in the west. Vladimir Putin, the Russian President, has long been known as a Russian nationalist who will do anything to advance the interests of his country (Marten 2018). This has long led to many countries in the West(The United States, England, etc.) constantly suspecting his motives. Over the past several years, this relationship has soured due to accusations of Russia interference in the 2016 US Presidential election.

There has been much discussion and public outrage pertaining to Russian influence on social media giants such as Facebook and Twitter. Russian troll farms put a significant amount of effort into creating fake accounts and Facebook pages to influence American voters (BBC, 2018).

Although this problem has been identified and a variety of companies have taken action to shut down thousands of troll accounts, it does not mean the trouble has ended. There is nevertheless fear that Russia may try to use social media for other devious purposes (Washington Post).

Russian Flag: Source- wikipedia.com

Although Facebook and Twitter have been highlighted by the media as the premier venues of Russian influence, Reddit represents another company that had to deal with Russian trolls. In 2017, Reddit released a list of over 900 accounts they suspected to be controlled by Russian troll farms (Thomsen 2018). Although they were able to track these users, it does not mean the Russians have given up on using Reddit as a platform to influence people.

The goal of this project is to develop an algorithm that can identify potential Russian trolls using the commenting activity of a user.

Now, there are many groups that I think would be interested in using my work. Obviously, Reddit would be interested; they want to make sure their community is clear of trolls. Reddit users and American citizens would also have a stake. Ultimately, the US Government may express an interest as well. The government is quite angry the Russians interfered with the election.

Problem Definition and Data

Russian Propaganda is not something we want on US platforms. The problem I would like to solve involves keeping these accounts from infiltrating Reddit. Ideally, I would like to produce a complex classifier that combines several data mining techniques to predict whether or not a person can be classified as Russian propagandist.

I made use of two data sets: one which contains the comment activity of Russian trolls and the other contains comments of random Reddit users. I obtained the Russian troll data set from data.world. I obtained the other data set (Non-trolls) from Google Big Query. I randomly sampled several thousand comments from 2015 to 2018.

Snippet of Russian Troll Data

Now in terms of what the data actually looks like, both datasets contain around 7000 rows of comments. In addition to the comment data, every row includes the date posted, karma obtained, the controversy of the post and SubReddit posted in. There is a myriad of distinctive features in these data sets. However, there are only several prominent features. These include time posted, body (text of the comment), Subreddit posted to, the controversy of a post, the post responded to and number of up votes that the post received. In the articles I read, there was a fixation on time. For example,“Prosecutors said Russian operatives would work shifts to make sure their posting times matched the timezone of the area they were pretending to be based.” I decided to focus on Subreddit posted on because one paper saw that people on Reddit tend to comment and interact with popular posts(Glenski 2017). Additionally, I felt that upvotes and controversy of a post were equally important. Were Russian trolls garnering a lot of negative attention or did they figure out a way to attract many upvotes? However, among all components, I felt text was the most important. Did Russian trolls post with bad grammar or uncommon English words? Or to state more generally, were Russians writing in a way that was distinctive from normal Reddit users?

Description of Data

Methodology

After spending the first several weeks performing exploratory data analysis on the variety of columns I have interest in, I began the process of feature engineering. In my opinion, feature engineering is the most difficult part of machine learning. No matter how state of the art your algorithm is, without good data, the final result will not be ideal. In my case, I had to use creativity and logic to come up with the features that I felt would differentiate the trolls from normal Reddit users. This is entirely why I spent so much time doing exploratory data analysis. By getting a view of the data, I can see how features differ from both datasets.

Flesch-Kincaid Formula- Source: https://www.researchgate.net/figure/10-FLESCH-KINCAID-GRADE-LEVEL-SCORE-FORMULA_tbl5_237768392

Essentially, my project was a text classification task. As a result, the majority of my feature engineering revolved around natural language processing. The majority of tasks were completed using the help of SpaCy (an NLP library in Python). One feature that I created was the number of Named Entities in every comment. Named Entity Recognition (NER) “labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.”(Stanford) Another feature I made use of was Flesch- Kincaid Score. Flesch- Kincaid is essentially a way of quantifying the readability of a sentence. I also looked to identify the sentiment of every comment{Textstat}. Sentiment Score is essentially a measurement of how positive or negative a sentence is (ranging from negative one to one). To measure sentiment I made use of TextBlob, a text processing library in Python {Textblob}. The lexical score is essentially a feature name that I created. Making use of the Empath library, I created several topics using “seed words.” Essentially this library uses word embeddings trained from a large corpus of words to find words that are related to one another(Fast 2016). The similarity between words is measured by cosine similarity. For the purpose of this project, I employed words related to Russia and Trump (all categories are located in the table below). This library then found words related to these categories. Then for every comment, I used my categories to calculate a score of how related this sentence was to the categories I had created. In addition to these methods, I made use of simpler text features like punctuation and word count {Empath}.

Although the majority of my features were text-related, I made use of other non-text related features. I had initially looked into SubReddits (what communities were users posting in?) to see if there were any significant differences. However, I noticed that among the twenty most popular SubReddits, nineteen out of twenty were shared by both the trolls and non-trolls. I also looked into the controversy of a post. I noticed that there was not a considerable difference between both groups. However, I did detect a difference in the times that posts were submitted to Reddit. Using a Kruskal-Wallis test, I noticed a statistically significant difference. The final post I decided to include was “relevant date.” This was another time-related post. PBS had compiled a spreadsheet with all of the relevant events from the Russia-Trump timeline. If a comment was posted on a date present in this spreadsheet, I would mark it with a one.

Evaluation and Results

First I calculated a baseline. A baseline is a simple calculation that reflects what you would expect the model to predict. In the case of this assignment since I had exactly half of my comments from Russian Trolls and the other half was from normal Reddit users, I would expect that a predictive algorithm would at worse have a 50 percent accuracy rate (along with 50 percent for other classification metrics like recall). After combining both data sets, I labeled them either “troll” or “non-troll.” Then I randomly shuffled the rows in this combined data frame. After this, I created a list where the first half of the list was labeled “Troll” and the second half was labeled “non-troll.” I compared the labels of this list and the randomly shuffled list of the DataFrame, and checked what percentage of labels lined up. I found that I attained an accuracy of 50 percent. I found similar scores for the other evaluation metrics I tried.

Non-Optimized Random Forest

After calculating a baseline, I decided to compare it with a Random Forest model. A Random Forest is a powerful ensemble learning method. I first fit a random forest with the set parameters present in the ScikitLearn version of the model. Although results were solid, I knew that they could be improved. I used a Random Grid Search to find the most optimal parameters. Given a set of parameters with a random distribution of possible values for that parameter, a random grid search randomly fits a certain amount of models. It is used to optimize parameters. Bergstra et al. found that Randomized searches are more efficient than normal grid searches. The classification reports from both the normal random forest and the optimized model are present in the images to the right.

Optimized Random Forest
Optimized Parameters

The results show great improvement from the baseline. However, these results are still not great. One particularly noteworthy part of these results is the feature importance graph. Among all the variables I used, time posted was by far the most important. Flesch- Kincaid, and punctuation count were distantly the second and third most important variables.

Performance for all three models

Discussion

The results I found through this project were very eye-opening. As someone who is a relative novice in the world of data mining, machine learning, and natural language processing, I thought that these results were very promising. Although a model with less than 75 percent accuracy is clearly not production level (e.g. Reddit should not use it), it is a great improvement from my baseline scoring. The recall and precision scores were fairly similar to my accuracy scores, indicating that this model is fairly consistent.

Example Output from Decision Tree

The fairly high accuracy level indicates that there are some underlying differences in the way Russian trolls use Reddit. Specifically, there appears to be quite a significant difference in posting times. This is an odd finding, especially since the BBC found that “Russian operatives would work shifts to make sure their posting times matched the time zone of the area they were pretending to be based.” The other two significant factors, punctuation count and Flesch- Kincaid (sentence complexity), indicate that these Russian operatives may be writing in different styles compared to normal Reddit users. As the New York Times pointed out, Russian trolls used odd English on Facebook, this trend appears to be holding on Reddit. One reason that I hypothesize Russian trolls have fairly different punctuation count and Flesch-Kincaid scores are that they do not realize that most people use Reddit as an informal medium. They are trying to hard to appear as if they are normal Reddit users.

Now although some of my features appeared to be fairly significant, others were not. I was slightly disappointed that “relevant date” was not very important. I would gather that this was due to the fact that there were too many dates present in the data set. Perhaps if the data set was trimmed to the 25 most important events, this feature would become more prominent. Lexical Score also seemed to be irrelevant for the purposes of my model. Maybe I included too many topics?

Most Important Features

Conclusion

Russian Interference looks to present a problem that will persistently haunt the US unless platforms undertake significant steps to quell down the trolls. Facebook & Twitter have already taken steps in the proper direction. But Reddit, a website with less than 100 employees, may not possess the resources to fight off these trolls. However, I am hopeful that the steps I have taken on this project will prove to be at least somewhat useful for identifying Russian Trolls on the platform. Although human behavior is admittedly challenging to model, it appears that at least some of the features I’ve added proved to show significant differences between trolls and non-trolls.

As I pointed out earlier, this project was an excellent starting point. However, there is absolutely room for improvement. I think logically there are different ways to approach the troll detection problem, especially on Reddit. I exclusively looked at comment activity, when clearly submissions may be just as useful for finding trolls. Supplementary, I was not examining the activity of trolls. Alternatively, I was focusing on comments as if they were all independent. The posting/comment patterns of an individual user may be useful for troll detection purposes. At least in the case of Twitter, it has been shown that this approach has achieved very good results (Fornacciari 2018).

Advertisements