Online content analysis

Online content analysis or online textual analysis refers to a collection of research techniques that social scientists use to test hypothesis about the content of discourses and Internet-based messages. Online content analysis are types of content analysis methods that have been created for, and are applied to, the analysis of Internet-based communication.

Content analysis

Main article: Content analysis

Content and textual analysis consist in labeling units of texts (i.e. sentences, quasi-sentences, paragraphs, documents, web pages, etc.) in order to build quantitative measures for scientific inference. According to Berelson (1952),[1] textual analysis is a “systematic technique for analyzing message content and message handling.”

Social scientists have used this technique to investigate research questions related to: mass media,[2] media effects,[3] and agenda setting.[4] Although there are multiple way of labeling and analyzing texts, McMillan (2000) [1] suggests that there are 5 main pillars across different variants:

  1. The goal of using text analysis techniques is to help solving specific research puzzles and questions.
  2. The researcher selects a sample of texts that will help address the research questions.
  3. The researcher defines a set of categories (labels) that will use to classify the different text units and to test the argument and hypotheses.
  4. A single or multiple coders perform the labeling and the researchers report the validity of the method (inter-coder reliability).
  5. Researchers proceed with the analysis and present how the data rejects or not the null hypotheses.

Textual analysis and Internet research

Since the emergence of the Internet, scholars have discussed how to adapt textual analysis techniques to study the World Wide Web. In general, these scholars highlight two main differences between printed and online text. First, web pages are dynamic.[1] Whereas printed texts (for example, a printed newspaper) are static and do not change, online content (i.e. online newspapers) are continuously changing. Thus, when studying online text sources, researches need to think about how to incorporate this dynamic component to their research design. Second, online texts are non-linear.[5] Instead of having a beginning and an end, web pages have multiple layers and each user can choose a different path and have different experiences. Moreover, with the increasing number of devices with internet access (such as tablets and smartphones), the way people consume and process information in the Internet is even more diverse.

Besides these differences between printed and online texts, textual analysis of online sources poses another big challenge: how to identify the universe the researcher wants to explain and how to select a random sample.[6] The internet is immense and it is continuously growing. Every second there are Individuals and groups posting new information online (social media posts, news stories, webpages, blog posts, pictures etc.). Hence, it is an arduous task to identify how many sources of one (or multiple) types are out there; and although there are some search engines that may help researchers identify the sources of information that they are looking for (i.e. Google), the sampling methods employed by these engines are probably not random and researches do not usually have access to the engine’s algorithm. Thus, since it is really hard to get random samples, social scientists studying online sources are increasingly trying to look at the whole universe of texts they aim to explain.

Computational textual analysis: text as data

See also: Text mining

The price of computers’ memory space is continuously decreasing and machines are becoming more computationally efficient. This allows researchers to use techniques that could not use years ago (such as methods for bayesian inference). As a result, an increasing number of social scientists are using today automatic text-analysis techniques to look at not just samples but the whole universe they aim to study. In particular, these computational techniques have become more relevant since the emergence of multiple social media platforms (Facebook, Twitter, Instagram, LinkedIn, etc.), which allow social science researchers to study how individuals, organizations, and institutions behave by streaming and studying large amount of online text.[7]

There are 2 main types of automatic textual-analysis techniques: supervised and unsupervised methods. On the one hand, supervised methods, such as dictionary-labeling or supervised machine learning algorithms, require some preliminary manual labeling to train the machine and then the machine do the rest of the text-labeling. On the other hand, unsupervised methods use a set of statistical assumptions (that researchers decide) to automatically categorize pieces of texts into distinctive categories. The following are some of the main supervised and unsupervised methods:[8]

Current and future challenges of textual analysis for Internet research

Despite the continuous evolution of text-analysis in the social science, there are still some unsolved methodological concerns. This is a (non-exclusive) list with some of this concerns:

References

  1. 1.0 1.1 1.2 McMillan, S. J. (2000). The microscope and the moving target: The challenge of applying content analysis to the World Wide Web. Journalism and Mass Communication Quarterly, 77(1), 80-98.
  2. Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage.
  3. Riffe, D., Lacy, S., & F. Fico(1998). Analyzing media messages. Using quantitative content analysis in research. Mahwah, New Jersey, London: Lawrence Erlbaum.
  4. Baumgartner, Frank, and Bryan Jones. (1993). Agendas and Instability in American Politics. Chicago: University of Chicago Press.
  5. Van Selm, Martine & Jankowski, Nick, (2005) "Content Analysis of Internet-Based Documents." Unpublished Manuscript.
  6. 6.0 6.1 Herring, Susan C. (2009). Hunsinger, Jeremy, ed. Web Content Analysis: Expanding the Paradigm (in English). Springer Netherlands. pp. 233–249. ISBN 978-1-4020-9788-1. Retrieved 2015-04-11.
  7. For example, see: Barberá, P., Bonneau, R., Egan, P., Jost, J. T., Nagler, J., & Tucker, J. (2014). Leaders or Followers? Measuring Political Responsiveness in the US Congress Using Social Media Data. Presented at the Annual Meeting of the American Political Science Association; Barbera, Pablo. (2015). Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data.Political Analysis, 2015, 23 (1), 76-91.; Van Dijck, Jose. (2013). You have one identity: performing the self on Facebook and LinkedIn, in Media, Culture & Society 16:1138-1153.
  8. These are some of the main methods that Grimmer and Stewart (2013) point out: Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis
  9. Collingwood, Loren and John Wilkerson. (2011). Tradeoffs in Accuracy and Efficiency in supervised Learning Methods, in The Journal of Information Technology and Politics, Paper 4.
  10. Gerber, Elisabeth, and Jeff Lewis. 2004. Beyond the median: Voter preferences, district heterogeneity, and political representation. Journal of Political Economy 112(6):1364–83.
  11. 11.0 11.1 Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis.
  12. Slapin, Jonathan, and Sven-Oliver Proksch. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3):705–22.
  13. King, Gary, Robert O. Keohane, & Sidney Verba. (1994). Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Prince University Press.
  14. Saldana Johnny. (2009). The Coding Manual for Qualitative Research. London: SAGE Publication Ltd.
  15. Chuang, Jason, John D. Wilkerson, Rebecca Weiss, Dustin Tingley, Brandon M. Stewart, Margaret E. Roberts, Forough Poursabzi-Sangdeh, Justin Grimmer, Leah Findlater, Jordan Boyd-Graber, and Jeffrey Heer. (2014). Computer-Assisted Content Analysis: Topic Models for Exploring Multiple Subjective Interpretations. Paper preset end at the Conference on Neural Information Processing Systems (NIPS). Workshop on HumanPropelled Machine Learning. Montreal, Canada.

See Also