Uncovering The Language Defining Each Day: A Decade Of Television News Through Ngrams
By TheWAY - 9월 25, 2019
Getty Images. GETTY
Ngrams offer a powerful non-consumptive lens through which to understand the latent linguistic patterns of the textual lenses through which we see the world, from the topics they focus on to the emotions they express. While they have most famously been used in recent years to study historical book archives, these simple word histograms can also shed new light on television news by translating their textual closed captioning into word frequencies that can be statistically analyzed. Using the power of the cloud, a decade of television news coverage can be analyzed in a matter of seconds to surface the defining words of each day on twelve television news stations monitored by the Internet Archive over the past decade.
The Internet Archive’s Television News Archive has been preserving the ephemeral world of broadcast news for more than a decade. Analyzing the closed captioning of this enormous archive of almost two million broadcasts, an ngram dataset was created that lists the unique words and their frequencies over the past decade for ABC, Al Jazeera, BBC News, CBS, CNN, DeutscheWelle, FOX, Fox News, NBC, PBS, Russia Today, Telemundo and Univision (though not all stations stretch back the full ten years).
Word frequency tables represent an ideal mechanism through which to understand the linguistic patterns of textual corpora in that they are both non-consumptive and allow the statistical analysis of word usage over time.
Such ngram datasets can be used to assess the use of emotional language in the news and compare how different terms have been used over time to identify hidden linguistic correlations.
At scale, such datasets can also be used to analyze topical shifts in the news narrative, as well as to compare the topics and issues focused on by each station each day.
Most importantly, TF-IDF requires only word frequencies, meaning it can be applied directly to an ngram dataset.
Using Google’s BigQuery cloud analytics platform, a single line of SQL was all it took to analyze all 1.2 billion ngram records in just 45.3 seconds, comparing each day on each station over the past decade against the baseline of the background word usage of the dozen stations over the full ten years.
In other words, for each station on each day, a histogram is computed of its word usage and then compared against a master histogram of all word usage across the twelve stations across the ten years to find the words most meaningful to each station-day.
For example, on June 4, 2019, the top words on Al Jazeera included Sudan, Khartoum, Rakhine and Sudan's transitional government, showcasing its emphasis on events in Sudan and Myanmar that day – topics US news outlets tend to pay far less attention to. BBC News unsurprisingly emphasized the NHS, Jeremy Corbyn, the Labour Party and Theresa May. CNN focused on Rikers Island where Paul Manafort was being sent, Brexit, Theresa May and “birtherism.” DeutscheWelle emphasized the UK, Ukrainian President Volodymyr Zelensky and Tiananmen Square, while Fox News focused on Brexit and Theresa May. The San Francisco affiliates of ABC, CBS, NBC and PBS emphasized topics like James Holzhauer’s record-setting Jeopardy run, Tiananmen Square and local and national media topics. MSNBC focused on the George Nader arrest, Elliott Broidy, Mueller, Brexit and impeachment. Russia Today emphasized Canada’s government report labeling the deaths of indigenous women genocide, Canada’s closure of its Caracas embassy and concerns over media reports “outing” the author of the doctored Nancy Pelosi video.
Of course, these were not the only topics discussed on those stations on that day, but rather reflect the words that were most statistically significant when compared against the grand total of all word usage across all 12 stations across the past decade.
In the end, this is only a simplistic example of the kinds of topical analysis that can be performed with such a massive ngram dataset, but offers a glimpse of what becomes possible when multimedia modalities are translated into textual form, then represented as non-consumptive word frequencies and analyzed using the cloud.
I’d like to thank Google for the use of Google Cloud resources including BigQuery and Felipe Hoffa for creating the original TF-IDF query and the Internet Archive and its Television News Archive, especially its Director Roger Macdonald.
Within information retrieval, term frequency inverse document frequency (TF-IDF) is a simplistic but remarkably insightful statistical technique to compare the words used in a given set of documents during a specific time interval against the background usage of those terms across the corpus as a whole to surface statistically meaningful words. In short, if the word “CNN” shows up on CNN each day, its meaningfulness to a given day is relatively low. On the other hand, if “cathedral” rarely appears on CNN but suddenly is mentioned continuously on April 15, 2019 along with “Notre” and “Dame” and “Paris,” it suggests those terms are highly significant to that day’s coverage.
0 개의 댓글