Andre Obereigner | A Text Prediction App in Collaboration with Coursera & SwiftKey
880
post-template-default,single,single-post,postid-880,single-format-standard,ajax_fade,page_not_loaded,,qode-title-hidden,qode-theme-ver-10.1.1,wpb-js-composer js-comp-ver-5.0.1,vc_responsive

A Text Prediction App in Collaboration with Coursera & SwiftKey

The text prediction app is the result from the Coursera Data Science Capstone project in collaboration with SwiftKey.

The objective of the capstone project was to (1) build a model that predicts the next term in a sequence of words, and to (2) encapsulate the result in an appropriate user interface using Shiny.  You can try out the Text Prediction App on the Shiny server.

Data Cleaning & Preparation

The prediction model is based on three different sources of text (blogs, news, tweets). For the subsequent model building process, I drew a random sample of text and began the data preparation.

An excerpt of text cleaning and other transformations:

  • Removal of all non-alphanumeric characters to bypass prevailing encoding issues.
  • Flagging end of sentences to avoid that the app makes predictions across sentence boundaries.
  • Flagging numbers to eventually remove them (as we want to predict terms).
  • Removal of any punctuation except the ‘ in terms like “don’t” or “I’d”.
  • Removal of any Internet related content (hyperlinks, emails, retweets).
  • Conversion of text to lower case and removal of any unnecessary whitespaces.

 

As a next step, I created 4 n-gram tables:

  • Unigrams: only the top 3 unigrams
  • Bigrams: 2,443,099 unique bigrams and their frequencies
  • Trigrams: 688,916 unique trigrams and their frequencies
  • Quadgrams: 321,218 unique quadgrams and their frequencies

 

Text Prediction Model

Once a cleaned set of text source was available in form of n-gram tables, I began to implement and test a variety features.

Feature List:

  • Most notably, I set up a Katz’s Backoff Model which tries to identify appropriate terms in the largest n-gram table and when without success backs off to the next smaller n-gram table and repeats the process.
  • The model recognizes end of sentences (based on ., ? or !) in the user input and proposes an appropriate next term for the beginning of a sentence.
  • The app also comes with “evil word” protection by cencoring/hiding predicted terms that represent evil words.

 

Performance Benchmark: I utilized the benchmark code by Jan to test the performance of the next term prediction app. My final model performs as follows:

  • Overall top-3 score: 17.49 %
  • Overall top-1 precision: 12.95 %
  • Overall top-3 precision: 21.39 %

 

Shiny Application

The app is extremely intuitive. The user can immediately begin to enter text, see and choose from up to 3 next terms and simply click and add them to the existing message.

The final app offers a variety of benefits to its users:

  1. It is fast and with an overall top-3 precision score of >21% extremely reliable.
  2. It offers its users up to 3 next best terms.
  3. It ships with “evil word” protection.
  4. It allows native German-speakers to use the app as well (experimental).

 

Today is a great ...

Today is a great … day.

No Comments

Post A Comment