Online content analysis

Online content analysis or online textual analysis refers to a collection of research techniques that social scientists use to test hypothesis about the content of discourses and Internet-based messages. Online content analysis are types of content analysis methods that have been created for, and are applied to, the analysis of Internet-based communication.

Content analysis

Main article: Content analysis

Content and textual analysis consist in labeling units of texts (i.e. sentences, quasi-sentences, paragraphs, documents, web pages, etc.) in order to build quantitative measures for scientific inference. According to Berelson (1952),^[1] textual analysis is a “systematic technique for analyzing message content and message handling.”

Social scientists have used this technique to investigate research questions related to: mass media,^[2] media effects,^[3] and agenda setting.^[4] Although there are multiple way of labeling and analyzing texts, McMillan (2000) ^[1] suggests that there are 5 main pillars across different variants:

The goal of using text analysis techniques is to help solving specific research puzzles and questions.
The researcher selects a sample of texts that will help address the research questions.
The researcher defines a set of categories (labels) that will use to classify the different text units and to test the argument and hypotheses.
A single or multiple coders perform the labeling and the researchers report the validity of the method (inter-coder reliability).
Researchers proceed with the analysis and present how the data rejects or not the null hypotheses.

Textual analysis and Internet research

Since the emergence of the Internet, scholars have discussed how to adapt textual analysis techniques to study the World Wide Web. In general, these scholars highlight two main differences between printed and online text. First, web pages are dynamic.^[1] Whereas printed texts (for example, a printed newspaper) are static and do not change, online content (i.e. online newspapers) are continuously changing. Thus, when studying online text sources, researches need to think about how to incorporate this dynamic component to their research design. Second, online texts are non-linear.^[5] Instead of having a beginning and an end, web pages have multiple layers and each user can choose a different path and have different experiences. Moreover, with the increasing number of devices with internet access (such as tablets and smartphones), the way people consume and process information in the Internet is even more diverse.

Besides these differences between printed and online texts, textual analysis of online sources poses another big challenge: how to identify the universe the researcher wants to explain and how to select a random sample.^[6] The internet is immense and it is continuously growing. Every second there are Individuals and groups posting new information online (social media posts, news stories, webpages, blog posts, pictures etc.). Hence, it is an arduous task to identify how many sources of one (or multiple) types are out there; and although there are some search engines that may help researchers identify the sources of information that they are looking for (i.e. Google), the sampling methods employed by these engines are probably not random and researches do not usually have access to the engine’s algorithm. Thus, since it is really hard to get random samples, social scientists studying online sources are increasingly trying to look at the whole universe of texts they aim to explain.

Computational textual analysis: text as data

See also: Text mining

The price of computers’ memory space is continuously decreasing and machines are becoming more computationally efficient. This allows researchers to use techniques that could not use years ago (such as methods for bayesian inference). As a result, an increasing number of social scientists are using today automatic text-analysis techniques to look at not just samples but the whole universe they aim to study. In particular, these computational techniques have become more relevant since the emergence of multiple social media platforms (Facebook, Twitter, Instagram, LinkedIn, etc.), which allow social science researchers to study how individuals, organizations, and institutions behave by streaming and studying large amount of online text.^[7]

There are 2 main types of automatic textual-analysis techniques: supervised and unsupervised methods. On the one hand, supervised methods, such as dictionary-labeling or supervised machine learning algorithms, require some preliminary manual labeling to train the machine and then the machine do the rest of the text-labeling. On the other hand, unsupervised methods use a set of statistical assumptions (that researchers decide) to automatically categorize pieces of texts into distinctive categories. The following are some of the main supervised and unsupervised methods:^[8]

Supervised Methods:
- Dictionary Methods: the researcher pre-selects a set of keywords (n-gram) to each category. The machine then uses these keywords to classify each text unit to a its category.
- Individual Methods: the researcher pre-labels a sample of texts and trains a machine-learning algorithm (i.e. SVM algorithm) with those. Then the machine labels the rest of the observations by extrapolating information from the training set.
- Ensemble Methods: instead of using only one machine-learning algorithm, the researcher trains a set of them and uses the resulting multiple labels to label the rest of the observations (see Collingwood and Wiklerson 2011 for more details about ensemble methods).^[9]
- Supervised Ideological Scaling (i.e. wordscores): to fit different text units in an ideological continuum, the researcher selects two texts (or two sets of texts) that represent each ideological extreme. Then the machine looks for words that belong to each extreme point and scales the rest of the texts depending of how many words of each extreme reference they contain.^[10]

Unsupervised Methods:
- Single membership models: these models automatically cluster texts into different categories that are mutually exclusive. As pointed out by Grimmer and Stewart (2013:16), “each algorithm has three components: (1) a definition of document similarity or distance; (2) an objective function that operationalizes and ideal clustering; and (3) an optimization algorithm.” ^[11]
- Mixed membership models: Topic models. According also to Grimmer and Stewart (2013:17), mixed membership models “improve the output of single-membership models by including additional and problem-specific structure.”^[11] One of the most used topic modeling technique is LDA.
- Unsupervised Ideological Scaling (i.e. wordsfish): algorithms that allocate text units into an ideological continuum depending on shared grammatical content. Contrary to supervised scaling methods such as wordscores, methods such as wordfish^[12] do not require that the researcher provides samples of extreme ideological texts.

Current and future challenges of textual analysis for Internet research

Despite the continuous evolution of text-analysis in the social science, there are still some unsolved methodological concerns. This is a (non-exclusive) list with some of this concerns:

When should researchers define their categories? Ex-ante, back-and-forth, or ad-hoc? Some social scientists argue that researchers should build their theory, expectations and methods (in this case specific categories they will use to classify different text units) before they start collecting and studying the data^[13] whereas some others support that defining a set of categories is a back-and-forth process.^[14]^[6]
Validation. Although most researchers report validation measurements for their methods (i.e. inter-coder reliability, precision and recall estimates, confusion matrices, etc.), some others do not. In particular, a larger number of academics are concerned about how some topic modeling techniques can hardly be validated.^[15]
Random Samples. On the one hand, it is extremely hard to know how many units of one type of texts (for example blogposts) are in a certain time in the Internet. Thus, since most of the time the universe is unknown, how can researcher select a random sample? If in some cases is almost impossible to get a random sample, should researchers work with samples or should they try to collect all the text units that they observer? And on the other hand, sometimes researchers have to work with samples that are given to them by some search engines (i.e. Google) and online companies (i.e. Twitter) but the research do not have access to how these samples have been generated and whether they are random or not. Should researches use such samples?

References

↑ 1.0 1.1 1.2 McMillan, S. J. (2000). The microscope and the moving target: The challenge of applying content analysis to the World Wide Web. Journalism and Mass Communication Quarterly, 77(1), 80-98.
↑ Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage.
↑ Riffe, D., Lacy, S., & F. Fico(1998). Analyzing media messages. Using quantitative content analysis in research. Mahwah, New Jersey, London: Lawrence Erlbaum.
↑ Baumgartner, Frank, and Bryan Jones. (1993). Agendas and Instability in American Politics. Chicago: University of Chicago Press.
↑ Van Selm, Martine & Jankowski, Nick, (2005) "Content Analysis of Internet-Based Documents." Unpublished Manuscript.
↑ 6.0 6.1 Herring, Susan C. (2009). Hunsinger, Jeremy, ed. Web Content Analysis: Expanding the Paradigm (in English). Springer Netherlands. pp. 233–249. ISBN 978-1-4020-9788-1. Retrieved 2015-04-11.
↑ For example, see: Barberá, P., Bonneau, R., Egan, P., Jost, J. T., Nagler, J., & Tucker, J. (2014). Leaders or Followers? Measuring Political Responsiveness in the US Congress Using Social Media Data. Presented at the Annual Meeting of the American Political Science Association; Barbera, Pablo. (2015). Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data.Political Analysis, 2015, 23 (1), 76-91.; Van Dijck, Jose. (2013). You have one identity: performing the self on Facebook and LinkedIn, in Media, Culture & Society 16:1138-1153.
↑ These are some of the main methods that Grimmer and Stewart (2013) point out: Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis
↑ Collingwood, Loren and John Wilkerson. (2011). Tradeoffs in Accuracy and Efficiency in supervised Learning Methods, in The Journal of Information Technology and Politics, Paper 4.
↑ Gerber, Elisabeth, and Jeff Lewis. 2004. Beyond the median: Voter preferences, district heterogeneity, and political representation. Journal of Political Economy 112(6):1364–83.
↑ 11.0 11.1 Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis.
↑ Slapin, Jonathan, and Sven-Oliver Proksch. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3):705–22.
↑ King, Gary, Robert O. Keohane, & Sidney Verba. (1994). Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Prince University Press.
↑ Saldana Johnny. (2009). The Coding Manual for Qualitative Research. London: SAGE Publication Ltd.
↑ Chuang, Jason, John D. Wilkerson, Rebecca Weiss, Dustin Tingley, Brandon M. Stewart, Margaret E. Roberts, Forough Poursabzi-Sangdeh, Justin Grimmer, Leah Findlater, Jordan Boyd-Graber, and Jeffrey Heer. (2014). Computer-Assisted Content Analysis: Topic Models for Exploring Multiple Subjective Interpretations. Paper preset end at the Conference on Neural Information Processing Systems (NIPS). Workshop on HumanPropelled Machine Learning. Montreal, Canada.