4 — Testing NLG Studio projects
How does testing of NLG Studio projects differ from testing of software in general?
What's different?
Not so much! The general principles of testing are the same — you are trying to test that the behavior of the system corresponds to the design or the requirements (i.e. it behaves as intended under all the different scenarios of input data and user actions). However, you are also testing language, and there are different ways to say the same thing! In other words, varying sentences might be used in the same place and each might be correct. So the same test might have multiple acceptable results.
Linguistic variation poses different challenges. Variation in texts that is caused by variation in data can be treated in the same way as any other testing — each is a different test case. For example, Profits rose dramatically in the first quarter versus Profits rose slightly in the first quarter are sentences produced from data for two different years or two different companies. But variation in texts that reports on the same result, such as Profits rose dramatically in the first quarter or Profits dramatically increased in Q1 or Profits climbed massively in the first three months are all generated for the same set of input data (same company, same year).
Tests (especially automated tests) must allow for the fact that there are multiple correct answers. Variation can be tested in component tests, but it is important to test end-to-end to check that:
the variation is coming out in the end system
the variants work well in the broader context of the whole paragraph or whole document.
Language is subjective. What might be technically correct might not be acceptable to the user; it might not read well or it might not be quite the right terminology. (This is a question of validation rather than verification.) So keep in mind that the testing of NLG systems must focus on the quality of the produced texts, in addition to correctness (i.e. correct behavior). Does the terminology sound right to the user?
Here are a couple examples:
Heavy rain through the afternoon but there will be patchy thunderstorms.
In the above sentence, feedback from a weather forecaster told us that the conjunction used should be and not but.
Rain will ease to snow overnight. No, this should say Rain will turn to snow.
Actually, this should say Rain will turn to snow. The problem here was caused by values appearing in the narrative the other way around. When originally tested with data, the result was Snow will ease to rain overnight, which was fine!
Does the system use domain sub-language? For a given (non-public facing) system, language is likely to operate in a specialized domain, therefore the language itself is specialized. Here are some examples:
From 03:00 to 04:15 wellhead 3 circulated bottoms up, reaming at 70 RPM.
Feeder 12W74 opened automatically at 21:12, Sunday 02/05/2017 with status of 'flush or dump'.
Currently, the baby is on CMV in 35% O2.
Vent RR is 50 breaths per minute.
This means that when testing the system, particularly from a quality perspective, you will have to pick up some of the lingo! Advice from SMEs and/or the product owner is invaluable, as is lots of reading.
Moreover, we have found that this makes it very difficult to perform any “automated” grammar checking and style checking on the text, for example with a system such as Language Tool. Generally, such systems will highlight issues that actually are valid in your text (abbreviations, specialized terms, etc.) and in particular they will almost always be unable to spot language that is correct English but incorrect for the particular system domain.
Tip
You can use Studio Runner to test NLG Studio projects. This tool allows users to publish a project and share a URL with other users (who don't need to have NLG Studio accounts). The other users can then upload their own data and receive a narrative output without having access to the project itself.
Note that the creator of the project must make the project available for Studio Runner by enabling it in the Publish view, then share the URL and API key with the other users. For more information, see the documentation.
Developing a test strategy
When strategizing how to test an NLG system, there are special considerations.
Testing is usually done by running a system on test cases. Test cases need to test behavior on unusual edge cases as well as normal situations, because this is where failure is most likely.
In order to avoid repetitive narratives, we often add random variation to them, typically using Studio’s chooseAtRandom
function. Testing that good narratives are produced in all variants can be challenging, so when adding variation, don't overdo it — especially if the client doesn't need much.
Tip
Watch out for number rounding in the text. An increase from $3.6M to $3.9M is okay, but An increase from $4M to $4M is not. It’s a good idea to agree rounding conventions early on and get them in there, otherwise you may test on £3.6M”to £4.1M but subsequently get £4M to £4M.
In testing NLG systems, the following are some things to consider: