07 Feb A Text Prediction App in Collaboration with Coursera & SwiftKey
The objective of the capstone project was to (1) build a model that predicts the next term in a sequence of words, and to (2) encapsulate the result in an appropriate user interface using Shiny. You can try out the Text Prediction App on the Shiny server.
Data Cleaning & Preparation
The prediction model is based on three different sources of text (blogs, news, tweets). For the subsequent model building process, I drew a random sample of text and began the data preparation.
An excerpt of text cleaning and other transformations:
- Removal of all non-alphanumeric characters to bypass prevailing encoding issues.
- Flagging end of sentences to avoid that the app makes predictions across sentence boundaries.
- Flagging numbers to eventually remove them (as we want to predict terms).
- Removal of any punctuation except the ‘ in terms like “don’t” or “I’d”.
- Removal of any Internet related content (hyperlinks, emails, retweets).
- Conversion of text to lower case and removal of any unnecessary whitespaces.
As a next step, I created 4 n-gram tables:
- Unigrams: only the top 3 unigrams
- Bigrams: 2,443,099 unique bigrams and their frequencies
- Trigrams: 688,916 unique trigrams and their frequencies
- Quadgrams: 321,218 unique quadgrams and their frequencies
Text Prediction Model
Once a cleaned set of text source was available in form of n-gram tables, I began to implement and test a variety features.
- Most notably, I set up a Katz’s Backoff Model which tries to identify appropriate terms in the largest n-gram table and when without success backs off to the next smaller n-gram table and repeats the process.
- The model recognizes end of sentences (based on ., ? or !) in the user input and proposes an appropriate next term for the beginning of a sentence.
- The app also comes with “evil word” protection by cencoring/hiding predicted terms that represent evil words.
Performance Benchmark: I utilized the benchmark code by Jan to test the performance of the next term prediction app. My final model performs as follows:
- Overall top-3 score: 17.49 %
- Overall top-1 precision: 12.95 %
- Overall top-3 precision: 21.39 %
The app is extremely intuitive. The user can immediately begin to enter text, see and choose from up to 3 next terms and simply click and add them to the existing message.
The final app offers a variety of benefits to its users:
- It is fast and with an overall top-3 precision score of >21% extremely reliable.
- It offers its users up to 3 next best terms.
- It ships with “evil word” protection.
- It allows native German-speakers to use the app as well (experimental).