Table of Contents
At Fleksy, we provide developers with the ability to build virtual keyboards. Empowering our users means improving their typing efficiency and accuracy, and we achieve this with features such as auto-correction, auto-completion, next-word prediction, etc. But how can we measure the quality of these features? This blog post shows how we test language features here at Fleksy.
Why do we need an automated benchmark?
User feedback is the most valuable way to measure tools’ quality. However, getting feedback from users is not always feasible when we have constraints, such as :
- Frequent releases requiring frequent quality assessments
- 80+ languages to assess
- Quality assessments need to be repeatable
- Quality assessments should be unbiased
Because of this, a manual approach with human testers would be inappropriate or too expensive. Instead, an automated process is much more appropriate for comparing the quality of language features.
While there are some open-source projects for evaluating spellchecking, they focus on spellchecking without covering essential tasks we are interested in, such as auto-completion or next-word prediction. That’s why we at Fleksy came up with our benchmark. Our benchmark focuses on functions specific to virtual keyboards / mobile keyboards :
- Auto-correction
- Auto-completion
- Next-word prediction
- Swipe gesture resolution
We use this benchmark to compare them across versions, against competitors, etc.
Benchmarking
Here is a description of how the benchmark is performed :
The Oracle takes the clean data as input, transforms it for each task, and sends it as input to the Corrector. The Corrector is simply the model to test, and it returns its output to the Oracle. Afterward, the Oracle scores this output and computes overall metrics that are then saved in a JSON file.
Tasks & data
Let’s quickly describe the tasks mentioned above and the various sources we use to collect data for testing.
Tasks description
- Auto-correction: Corrects the words typed by the user. E.g., if a user types I’m especialy touched, the typo should be detected and corrected to I’m especially touched.
- Auto-completion: Completes the word typed by the user. E.g., if a user types I love y, the word should be auto-completed to I love you.
- Next-word prediction: Predicts the next word to be typed. E.g., if a user types I want to eat french, a probable next word can be fries.
- Swipe gesture resolution: Predicts the word from a swipe gesture.
Data
For the benchmark, the idea is to use a list of clean text sentences, alter the text artificially to mimic a user typing on a virtual keyboard, and see how each feature would react to this input :
- For auto-correction, we introduce typos in the clean text and see how the typo would be corrected.
- For auto-completion, we input only part of a word (with or without typos) and see how the word is completed.
- For the next-word prediction, we input a partial sentence and see how it is completed.
- For swipe gesture resolution, we generate a list of swipe gesture points and see if the resolved word corresponds.
For every language, we use the following sources to generate the data over which benchmarking is performed :
- Formal: Data that comes from official, public datasets.
- Conversational: Data scrapped from online content, such as movie lyrics.
- Chat: Data gathered directly from our users testing the keyboard.
We have data for each of the 80+ languages supported by our SDKs that we need to test. After sourcing the required data, we proceed to introduce typos in them.
Typos generation
Typos are introduced character-by-character to mimic the user’s typing on a keyboard. It also allows us to have multiple typos in a single word, which is vital to see how robust our model is on more complex typos.
Overview
Typos are introduced following this graph :
We use teacher forcing at evaluation time. If the model incorrectly auto-corrected a word, we pass the right context (original previous words) when auto-correcting the next word.
Let’s see in detail how we introduce typos in the clean text.
Type of typos
Various typos are introduced :
- Character deletion
I’m → Im - Character addition
acknowledgment → acknowledgement - Character substitution (with a close character)
love → lpve - Character simplification
- Accent simplification (change a letter with an accent to the equivalent letter without an accent)
café → cafe - Case simplification (change an uppercase letter to its lowercase equivalent)
I’m → i’m
- Accent simplification (change a letter with an accent to the equivalent letter without an accent)
- Character transposition
colleague → collaegue - Language-specific common typos (if available)
“Character simplification” is applied before other typos because they can be nested.
For example: A character can be both uppercase and accented. The user may type a lowercase, non-accented letter (2 “character simplification” in a single character).
Fuzzy typing
After introducing some potential typos, we have a list of “intended characters”. This list needs to go through a “fuzzy typing” component, which has two roles :
- Generate a likely coordinate for the keystroke (our SDK works directly with keystrokes coordinates)
- Potentially type a near-character instead of the intended character (substitution typo)
The goal is to represent the user experience on a soft keyboard (fat finger symptom). This goal can be achieved through Gaussian Probability Density Function, i.e., sample a position using 2 Gaussian distributions (one for the x-axis, the other for the y-axis), where the distributions are centered on the intended character.
For example: When typing the letter d:
With a high probability, the generated position will be inside the key d, but it’s also possible to have a keystroke corresponding to another letter.
These Gaussian distributions are parametrized so that we can change them easily to represent different profiles:
Gaussian distributions skewed on the x-axis:
Gaussian distribution representing a right-hand typer profile (the distribution is offset on the right part of the key):
The parameters for these Gaussian distributions are :
- Offset: Distance to offset from the center of the key.
- Ratio: Ratio describing how much the distribution covers the key (see the 68-95-99 rule).
It’s important to note that fuzzy typing will generate various typos given a different keyboard layout.
Swipe gesture generation
While most tasks require text input, the input differs for swipe gesture resolution. First, we need to generate a swipe gesture. We can generate one from the keystrokes generated for the word (see Fuzzy typing). For this, we link the different keystrokes of the word using bezier curves and add some randomness to generate more natural swipe gestures.
Here are some examples of the generated swipe gestures (in red are the keystrokes generated by the fuzzy typing, and in blue are the points of the corresponding swipe gesture created).
For the word gives:
For the word they:
Measurements
Now that we have seen how the data is transformed to generate inputs for the corrector let’s see how we score the predictions of the corrector.
Task-specific metrics
Let’s start with the most straightforward task, i.e., next-word prediction.
Next-word prediction
For the next-word prediction task, the predicted word is either right or wrong. Therefore, we compute the accuracy as follows:
accuracy = correct/total
Where correct is the number of correct predictions, and total is the total number of predictions.
We also compute the top-3 accuracy, which is the same as accuracy. Still, when counting the number of correct predictions, instead of looking only at the most probable prediction, we look at the top 3 most probable candidates. If the word to predict is in these three candidates, we count the prediction as correct.
For this task, we use the top-3 accuracy as the main score. This is because in a virtual keyboard, we usually show three candidates, and the user selects the right one.
Auto-completion
The auto-completion task is similar to next-word prediction, i.e., whether or not the predicted word is the right one. So we also have accuracy and top-3 accuracy, and for the same reason, we use top-3 accuracy as the main score for this task.
Swipe gesture resolution
The measurement for swipe gesture resolution is similar to the previous tasks, i.e., whether the predicted word is correct. Similarly, we also have accuracy and top-3 accuracy, but in this case, we use accuracy (instead of top-3 accuracy) as the main score for this task. This is because the virtual keyboard automatically recognizes the swipe gesture. The user doesn’t select anything.
Auto-correction
We can compute more metrics for auto-correction because we have the notion of true/false positive/negative. Let’s first define these notions in the context of auto-correction :
- True Negative: No typo introduced. The model doesn’t correct anything
- False Positive: No typo introduced, but the model correct (wrongly) the word
- True Positive: A typo is introduced, and the model corrects the word into the expected word
- False Negative: A typo is introduced, but the model doesn’t correct anything
With an example, it’s easier to visualize :
Word typed by the user | Word after being corrected by the model | Expected word | |
True Negative | love | love | love |
False Positive | love | loev | love |
True Positive | loev | love | love |
False Negative | loev | loev | love |
From these notions, we can compute the following metrics:
As the main score for auto-correction, we use the F-score.
Our benchmark computes a score for each task. Then, we merge these scores into a global score using a weighted sum. This kind of global score covering several tasks might hide some details, but it’s essential to have a single number to compare models easily.
Insights from the metrics
We can compute the task metrics mentioned in the previous section globally, either for the whole test set or on specific subsets of the data, to extract interesting insights.
For example, we mentioned earlier that we have different data sources, i.e., formal, conversational, and chat. By computing metrics for each data, we can see if the model is better at dealing with a specific domain or if a specific domain is more complex than the others.
For the auto-completion task, we can also get the metrics depending on the completion rate of the word (it’s supposedly easier to guess whatever from whatev than from w). Or see the difference in metrics if the partial word contains a typo.
Note that the overall score should be appropriately balanced based on the difficulty of auto-completion (if the model can’t guess whatever from w, it’s ok, but not if it can’t predict it from whatev).
Inside the benchmark, we achieve this by sampling. We sample more “almost complete” partial words than “almost empty” ones.
For the auto-correction task, we can get the metrics for each type of typo (to see if the model struggle with a specific kind of typo) or by the number of typos introduced (only one typo introduced vs. several typos introduced in a single word) to see the robustness of the model.
Here, we are talking about insights from the metrics, but it’s also possible to get insights from the data. Our benchmark saves the 1000 most common mistakes so that we can analyze them and see where our model struggles.
Performance metrics
Fleksy focuses mainly on mobile keyboards, focusing on speed and memory to ensure that the user has a smooth and responsive experience. So we need to ensure that an update keeps the speed and the memory the same. That’s why our benchmark also measures the runtime and the memory usage of the Corrector so that we can quickly notice if there is any issue with the performance of our models.
Results
Initially, we developed this benchmark strictly for internal usage. Still, one of the significant applications this kind of benchmark provides is the ability to compare Fleksy’s product to its competitors on common grounds.
Let’s take Typewise, for example. They make the following claims on their website:
Typewise | Gboard | Apple Keyboard | Grammarly | |
Prediction accuracy | 63.7% | 48.1% | 53.9% | N/A |
Mistakes corrected | 97% | 95.1% | 95.7% | 95.3% |
Multi-language | ✅ | ~ | ~ | ❌ |
Custom AI model | ✅ | ❌ | ❌ | ❌ |
Privacy | ✅ | ❌ | ✅ | ❌ |
No backing sources for the claims were available on the website when writing this article.
So we compared Fleksy and Typewise using our benchmark :
Fleksy | Typewise | |
Next-word prediction (top-3 accuracy) | 16% | 24% |
Auto-completion (top-3 accuracy) | 46% | 62% |
Auto-correction (F-score) | 0.71 | 0.70 |
Swipe resolution (accuracy) | 89% | Not supported |
Average runtime for auto-correcting one word | 357μs | 1s |
As we can see, Fleksy is slightly better at auto-correction, while Typewise is better at next-word prediction and auto-completion. Fleksy also supports swipe gesture resolution, unlike Typewise.
It’s important to note that the comparison is only partially fair because Fleksy and Typewise solve different problems under different constraints. Fleksy’s keyboards run on mobile. Thus, the models are small and run fast, which is visible when comparing the average runtime speed, i.e., Typewise, which is 2800x slower than Fleksy.
Making the benchmark accessible
A benchmark is a communication tool. The information it identifies should be easily accessible and shared. That’s why stopping here is not enough.
Dashboard
To communicate the benchmark results across the Fleksy team, no matter how technical each individual is, we made a dashboard. This dashboard shows the latest results for each language, as well as the trend with previous results:
It also shows metrics for each task:
We also show how each metric evolves from the previous version to highlight the change and see if it’s getting better or worse.
We also display other metrics that can bring interesting insights, like the metrics for each type of typo, etc.:
With such a dashboard, anyone at Fleksy can see how well this specific version works.
Automation
Running the benchmark manually is fine the first time, but when it’s recurring and for many languages, it quickly becomes tiring, even if it’s a single command line. That’s why at Fleksy, we see it as primordial to automate it.
We configured our Continuous Deployment pipeline so that whenever a new version of our software is released, a job is started for each language, and the version is benchmarked. The results are stored in the cloud and displayed in our dashboard. No manual work is required for benchmarking our models at Fleksy.
Future work
This benchmark helps us at Fleksy understand the performances of our models, but as it’s still an internal tool, the claims that we make can’t be verified by an external party.
In the future, we would like to publicize and share this tool to foster competition and have common grounds for comparing alternatives.
We don’t know if we want to open-source it or create a leaderboard with a submission form (Kaggle-style). If you are interested in this tool, have any questions, or want to share an opinion, please get in touch with Fleksy!