Text Forensics: The Blueprint for Advanced Disinformation Solutions

Problem

Each individual is impacted by the spread of misinformation and fake news on social media platforms and the internet. Each piece of information that is consumed online influences a minor section of our memory, and although that may seem small at first, its impact soon elevates. This is explored under the umbrella of “misinformation effect”, or the tendency of new information to interfere with one’s understanding of memory [Cherry].

The current state of the world, and based on our understanding of past events, requires active measurements that prevent misinformation from interfering with our truthful and realistic understanding of the world around us.

There are many solutions, ranging from the enhancement of social media algorithms to inoculation theory that address this problem. However, our goal with Text Forensics is to provide a simple solution that could potentially be the basis for further development.

Solution

Text Forensics is a simple web application with a single text input field. A user is able to enter a piece of text, and click on “perform analysis”. The web app then analyzes the text and returns conclusions. For this iteration of the website, there are three categories of data that is outputted:

1. Polarity Score & Sentiment Detection

Sentiment is the overall mental attitude that is present within the statement. The polarity score is the numerical measurement of sentiment that is presented within the text calculated using the frequency of value-laden words (alternatively known as loaded words). The polarity score can be utilized to determine the sentiment of the author, or the contextual feeling of the topic.

To achieve, the AFINN sentiment analysis GitHub repository is utilized [Nielsen]. The author, Finn Årup Nielsen, analyzed each word and gave them a polarity score [Nielsen]. If the score was a negative integer, it had negative connotations in terms of sentiment, and if not, it had positive connotations. Using the AFINN algorithm, the inputted text is analyzed and given a polarity score.

The total polarity score of the entirety of the text is utilized to determine the sentiment that has appeared within the text.

Polarity Score to Sentiment Label (Image by Author)

The purpose of this category is to inform the user of the general tone that is present within the text. By knowing that, for example, the loaded words within the text lead to a “very negative” or “very positive” polarity score, the reader can understand that potentially, the mental attitude within the text makes the source unsuitable for reliable use.

The labels of “Very Negative” and “Very Positive” are mentioned because fundamentally, they have the same impact. For instance, the same way an author can speak very positively about their own argument, they can also speak very negatively of someone else’s argument to achieve an identical goal: persuade the reader into believing that their argument is more justifiable. This can often lead to awareness about one perspective, while the other is kept unknown.

The tone of a passage should usually be “Negative”, “Neutral”, or “Positive”. The reason for that is that by having a polarity score that is negative, neutral, or positive, it indicates that, with a higher probability, multiple perspectives are presented in the text. In addition, it is justifiable for a tone of a passage to be slightly “Negative” or “Positive” because it illustrates that an opinion is presented with reasonable bias. For instance, a group of researchers could be evaluating a solution to reducing carbon footprint. The solution that is being evaluated is effective in quality, but inefficient in use of energy. However, the researchers are concerned with the efficiency of the device. Therefore, their overall tone would be negative towards the solution since it does not meet their requirements.

2. Fake News Detection (Factuality)

This category is concerned with the validity of the piece of text, meaning whether the text is misinformation and fake news, or real and truthful information. The purpose of this category (factuality) is to inform the reader of potential misinformation within the text that they are consuming. This will help them become more alert, and more importantly, encourages them to research better informational sources.

For this category, a fake news detection model classifies a text as “Most Likely Valid News” or “Most Likely Misinformation” through natural language processing and linguistic analysis.

Dataset

To create the model, this dataset was utilized due to its benefits [“REAL and”]:

The dataset contains a balanced ratio of fake news and real news, which leads to less technical bias regarding which category the algorithm should select as a prediction.
3898 pieces of text (title and content) for each category of fake and real news, which is adequate for a fairly accurate model.

From this dataset, we only utilized the “content” and “label” column since this fake news detection model was intended towards the actual content of the article, and not its title or headline.

Data Preprocessing

For data preprocessing, all pieces of text were tokenized, removed of any stopwords, and finally, stemmed. These steps cleaned and prepared the text for the subsequent step.

Feature Extraction

For feature extraction, bag-of-words (BoW) word model was utilized. Within this feature extraction process, each type of word within the inputted text is given an index on a one-dimensional array. That index will then record the frequency of the terms within the whole sentences.

Algorithm

Using the bag-of-words model, the algorithm will then compare the frequency of various words within the one-dimensional vector of the inputted text to other pieces of texts, and based on whether a specific text is fake news or real news, the algorithm will learn how to classify between the two types of texts.

There are various algorithms to choose from, and for this project, through experimentation, logistic regression was identified as the most effective and accurate. The creation of this fake news detection model using logistic regression returned the highest validation accuracy of 92.66%, meaning out of 100 texts, the model only categorizes ~7-8 texts incorrectly.

[The explanations above are short descriptions of each category. To learn more about natural language processing, please read this article.]

3. Emotion Detection

This category of analysis informs the user of the specific emotion presented within the text. The emotions include “Sadness”, “Joy”, “Love”, “Anger”, and “Fear”. Its purpose is to inform the user of the emotions present within the text. Within a literature review of the “psychology of fake news”, Gordon Pennycook and David Rand explain how the more people experience emotions such as anger, fear, and outrage, the more likely they are to believe fake news [Pennycook]. This verifies the necessity for understanding the emotions within a piece of text.

To achieve this, an identical process to fake news detection was utilized to detect the emotion that is present within the inputted text. The only difference is that this dataset is utilized, because it contains examples of various emotions within text [Saravia]. The validation accuracy of this model is 86.18%.

Blueprint for Future Advancements

Text Forensics, as presented, utilizes three categories of data to inform users about the underlying components of a piece of text, however, it is simply a minimum viable product (MVP) to illustrate that a solution similar to this could potentially be an effective solution to misinformation and fake news.

The current iteration be significantly improved in two areas:

1. Inaccessible & Inefficient Process

It is not worthwhile for a user to visit this web application, enter a text, and then finally understand the results. There are too many steps involved in this process, and the more steps there, the less likelihood there is for the user to actually use the web application.

For that reason, these functionalities should be integrated in more seamless ways. This could be in the form of web browser extension, or API integration within other media sources.

2. Limited Categories & Insight

The number of categories and insights that are presented for each text is relatively limited compared to what NLP can achieve. Considering the many emerging fields, such as political bias detection, stance detection, textual extraction, and textual summarization, for future iterations of the website, more insights can be derived from the text.

Conclusion

Text Forensics is a web application that uses various NLP models that have high-accuracy to share insight regarding a text with its users. Although the web application is not easily accessible, it provides a blueprint for future creations, including its integration in various websites for the benefit of the user.

Thank you for reading this article. If you have any feedback for my writing, or have found a mistake in this article, please message me using the Contact Us page.

References

Cherry, Kendra. “How Does Misinformation Influence Our Memories of Events?” Verywell Mind, Verywell Mind, 11 May 2022, https://www.verywellmind.com/what-is-the-misinformation-effect-2795353.

Nielsen, Finn Årup. “A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs.” ArXiv, ArXiv, 15 Mar. 2011, https://arxiv.org/abs/1103.2903.

Pennycook, Gordon, and David G. Rand. “The Psychology of Fake News.” Trends in Cognitive Sciences, vol. 25, no. 5, 2021, pp. 388–402., https://doi.org/10.1016/j.tics.2021.02.007.

“REAL and FAKE News Dataset.” Kaggle, 31 Oct. 2019, https://www.kaggle.com/datasets/nopdev/real-and-fake-news-dataset.

Saravia, Elvis, et al. “Carer: Contextualized Affect Representations for Emotion Recognition.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, https://doi.org/10.18653/v1/d18-1404.