How to translate live-spoken human words into computer "truth"

Our Knight Lab team spent three months in Winter 2018 exploring how to combine various technologies to capture, interpret, and fact check live broadcasts from television news stations, using Amazon’s Alexa personal assistant device as a low-friction way to initiate the process. The ultimate goal was to build an Alexa skill that could be its own form of live, automated fact-checking: cross-referencing a statement from a politician or otherwise newsworthy figure against previously fact-checked statements from widely trusted nonpartisan fact-checking resources such as Politifact, FactCheck.org and the Washington Post’s fact-checker blog, and delivering a verdict to the user as to whether what they just heard was really true.

Fully automated fact-checking has been a focus of intense research within the tech-journalism community, and most experts agree it’s still several years away. But the need for improved fact-checking resources in our current age of “alternative facts” and fast-traveling Internet quips is obvious. Concurrently, there has been a tremendous rise in the use of voice-activated personal assistants, with more than 60 million Americans estimated to have used such devices at least once a month in 2017, and Amazon’s Echo dominating the space with a 70 percent market share. People are starting to keep Echo devices nearby while they watch TV, and when they hear something suspicious from a politician or the media, it seems like a natural impulse to want to ask someone (or something), “Is that true?”

We sought a way to leverage a platform with such wide adoption rates for a relevant, important civic purpose.

An early whiteboard diagram of how Watson and the Politifact API interact to match entities from our analyzed speech and determine a confidence rating.

“Hearing” and “Seeing” with Alexa

We learned pretty early on in our research that training Alexa to “hear” and analyze voices from a TV would be too daunting of a task, not least because the device only wakes up when we use a vocal cue to trigger it, by which point it would be too late to fact-check whatever we just heard. Overcoming this challenge would require an always-on microphone, which poses significant privacy and user experience issues. Instead, we went looking for reliable live TV caption feeds that did not depend on live voice transcription, because the voice input would ultimately need to be transcribed into a string text regardless in order to input it into widely available natural language processors.

We found a promising option in OpenedCaptions, which displays a real-time caption feed from C-SPAN on the web. We started our own private OpenedCaptions server on AWS Lambda, then built a simple Google Apps script that would transfer the last minute of C-SPAN captions into a Google Document. The document continually updates with the previous minute of text, deleting anything older so as not to overwhelm the servers with unnecessary data. This way, Alexa doesn’t have to be constantly monitoring the television 24/7, waiting to be cued, and we can avoid both excess use of energy and the many privacy concerns that have crept up around the device. Instead, when it is cued, the Alexa skill executes a script that simply pulls all available text from our script, and analyzes that minute of text for keywords, concepts and entities (including people).

Live-transcription technology is improving all the time; hopefully in the near future, reliable third-party transcription will be available on vastly more popular and influential news channels like CNN, Fox News and MSNBC, in order to more easily facilitate analysis of their content.

Truth and computational linguistics

The question of “what is truth, anyway?” is a largely philosophical one, too big for the scope of our project. Very little of language is simply “true” or “false.” What we can do instead of passing that kind of judgment is find similar phrases from the past that have been marked by experts as true or false, or something in-between. Our job is not to judge what is true or false; our job is to help determine if others have already evaluated these statements for their truth or falsehoods. We also have to be clear about what this means from a programming perspective: In the language of computer programming, if a set of qualities or pre-conditions are met, the statement is “true.” So we must be careful about how we set our pre-conditions.

Voice assistants, even AI software, cannot currently fully understand referential and figurative language in speech. A human will understand “He threw in the towel” to colloquially mean that someone gave up during a strenuous trial, while retaining its literal meaning of throwing a towel into something. While humans will understand that this saying stems from boxers tossing their towel into the boxing ring as a sign of defeat, computers struggle with bridging the literal and colloquial uses of language. When we try to figure out what things mean, we enter a theoretical world of speech that relies on an ever-evolving understanding of information structures, internal networks of context and undefined human intuition.

Unfortunately, we’re not going to be able to teach a machine to learn language the way humans understand it in just three months. So instead, we utilized an API, Watson Natural Language Understanding (NLU) from IBM, to extract keywords, concepts, and entities (i.e., people, organizations, places, and objects) from our caption stream, and develop Politifact queries that relate to these people and subjects (see “Working with Politifact” below). We then take the resulting Politifact headlines and claims and query the Parallel Dots API to rank how similar each result is to the original. We are looking for a previously fact-checked claim that sufficiently matches the phrase we’ve placed under our analytic microscope. Then, instead of us delivering a definitive truth/falsehood verification, we provide the user with additional information they can use to contextualize what they’ve just heard.

In our example, culled from President Trump’s 2018 State of the Union address, the red text represents words and terms our NLU was able to recognize and pull out for further analysis. The green represents a word that has a match in the Politifact lexicon, which has a limited reference list of “subjects” and “people,” so we know that we can use these terms to begin our cross-referencing process. In this example, because the speaker has been previously defined, we also know that “I” refers to Donald Trump.

So much of fact-checking tends to focus on very small turns of phrase: To determine the factuality of Trump saying “We enacted the biggest tax cuts and reforms in American history,” we need our NLU to understand and extract keywords, concepts and people and come up with a relevant value judgement. In this instance, the NLU would know “tax break,” qualifying it as “biggest,” and recognize the words “American history” are involved somehow. But it wouldn’t be able to link all the terms to provide a value judgement, or provide additional context beyond that. We have to take each entity on its own and input them into the Politifact API looking for a match. And sometimes our NLU API misunderstands context: when it processes “American people,” it believes the phrase is more likely to relate to “immigration” or “the Democratic Party” than to “taxes.”

Working with Politifact

There are several dedicated fact-checking services out there, but for our purpose it was easiest to stick with Politifact because they were the only fact-checkers providing an open API. However, in addition to the site itself not having the human bandwidth to cover every falsehood and misstatement in the political universe, the Politifact API has very limited capabilities. It allows us to search for claims by a few different criteria; for our purposes, we stuck to two search methods, People and Subject. Each of these came with its own hurdles. And it has a very limited text summary of its rulings, requiring interested users to go to their webpage to learn more. Soon enough, every decision we made on a programming level had to be done with the limitations of the Politifact API in mind.

In the future, we will need to find ways to expand our search beyond the Politifact API in order to grow the pool of archived statements and improve the quality of our results.

An early, crude model of the complex pathway our program has to go through once it recognizes part of a name ("Obama").

To search by People, the Politifact API will only recognize an hyphenated input of both first and last names. The last name alone isn’t sufficient. (Political organizations, such as the NRA, are also housed under People, and for them we need to input the full name, e.g. “National Rifle Association,” to get returns, not acronyms or partial names.) So the computational challenge for us here was that we would need to instruct our program to automatically fill in the entire first name/last name sequence for instances in which our C-SPAN speaker would only verbalize a last name or a shorthand title – which is most of the time, considering the ways people typically converse. To accomplish this, we made a list of the manual inputs we were most likely to run into, in combination with support from APIs that find related, similar and common nearby words: for example, “Donald-Trump” for “Trump” or “the President,” or “Jeff-Sessions” for “the Attorney General.” This is further complicated with politicians with the same first and last name, such as George Bush (“george-p-bush” and “george-w-bush”).

Searching the API by Subject carries its own problems. Politifact uses very broad Subject categories like “taxes” or “foreign policy.” In addition, we have to manually set how many returns we want to get per query: so for example, how many of Donald Trump’s most recent fact-checked statements we want to receive. The Parallel Dots API charges us per query, so we currently don’t have the bandwidth to ping for large numbers of results; we have to be more efficient. We have found that limiting our returns to the six most recent statements tends to cover a wide timespan for most political figures: the last six returns for Jeff Sessions, for example, stretch back nearly a year and a half.

Analyzing the Statements for “Most True”

For every block of C-SPAN text we send to our Watson language-parsing software, we receive multiple potential matches from Politifact, ranked internally by their relevance score. But a high mathematical score doesn’t necessarily translate into a high relevance for the user, in terms of the statement that is being returned to us.

Because of this, we need to make a human judgment (as well as a journalistic one) about where to cut off the relevance threshold. This will help us determine how much confidence we can express to our users when we return a result. At first we had considered setting the “high confidence” threshold to 0.7 and above, but that was returning too many irrelevant results. The result with the highest (0.7) similarity score for “Obama wiretapped Trump Tower,” for example, is an article about whether Obama sent Christmas cards to the military. So we’ve moved that threshold up to 0.9 and above.

We also wanted to be able to deliver “medium confidence” and “low confidence” results, for cases where we think we’ve found something that could be relevant but that might not be the exact information the user is looking for. Even being able to match a couple entities in a statement could deliver something of partial relevance. Through our testing, we determined the “medium” range to be 0.7-0.9, and the “low” range to be 0.2-0.7. Our “low confidence” results are for when we have matched only a person to the original statement, or only the subject matter, or when we have matched both but the ranking is still low. For these cases, we will deliver one of three results making clear to the user that we have only found a statement that involves the same person, or the same subject, or when we have found a statement that is unlikely to address the issue they just heard.

We knew there would also need to be a “no confidence” threshold, for when our results matched nothing at all. Delivering an irrelevant result to users with confidence may in some ways be worse for long-term UX than returning no results at all. We mark everything below a 0.2 threshold as “no confidence.”

Future iterations of this project would be wise to devote more time to intensive testing in this phase of the process, so that we can develop more granular definitions of relevancy for fact-checked statements.

Messaging

Although Alexa is a voice-activation tool, we reasoned early on that delivering the entire user result via voice would be too long, interrupt the flow of television-watching and result in poor information retention. So voice wouldn’t be the best mechanism for people who wanted to opt into learning more about the topic. Instead, we sought to transfer the user from a voice-delivery result to a result by text message.

A voice cue from Alexa prompts the user to “check your phone” via one of five carefully worded phrases, depending on our level of confidence about what has been checked – one phrase each for high, medium and low-confidence results that return people and subjects, plus one phrase for a low-confidence result that returns only the same person and another for a low-confidence result that returns only the same subject. Then, our product delivers a brief series of conversational texts displaying the headline, summary, claim rating and article link to the most relevant Politifact piece. This method allows the user (who is probably already using their phone as a “second screen” while they watch TV) to easily decode what the results are, and follow the link to more information if they are interested.

In cases where our results fail to score above a 0.2, the threshold for “low confidence,” we will not send a text to the user, but will instead merely instruct Alexa to say, “I’m sorry, I couldn’t find anything for you.”

A demonstration of the Alexa service.

Conclusion

When we started the project, the term “conversational” was meant to refer to using an Alexa skill to somehow “fact-check” friends or other people while in conversation with them. We now see just how far of a pie-in-the-sky goal this was. Finding a way to live-capture spoken language proved a big enough technological hurdle on its own that we had to devise a shortcut around that challenge so we could focus on the entirely different question of whether statements and phrases, in any form, could be fact-checked in real time.

We have also learned how limiting it can be to rely on APIs from fact-checking and other journalistic Web resources for the trustworthiness of a product, so a lot of legwork and collaboration would need to be done industry-wide to get technology like this approaching peak usability.

However, we have still made great strides in our research and testing this quarter, particularly when it came to language-parsing via Watson and Parallel Dots, and finding a way to automatically sort through claims for relevancy. It’s clear that the core function of our device can work, and can deliver a satisfying user experience, under the right circumstances. What was comforting was seeing how many of our limitations were due to things beyond our control, whether due to the rigid capabilities of the Politifact API; the lack of reliable transcriptions for any network outside of C-SPAN; and our lack of proper bandwidth to perform truly thorough similarity checks.

As computational linguists and journalists in the technology sphere continue to make their own improvements in the world of automated fact-checking, we are closing the gap between the rampant spoken falsehoods of our political class and journalism’s ability to keep up with them. Our work will be valuable in that drive to increase civilian access to useful, verified information, and it has the potential to improve civic dialogue in the process.

The truth is out there.