Wondering where to get started with building your Test Body? Read our tips below on types and characteristics of test data.
Types of Test Data
Category | Description | Recommendations | Tips/Examples |
Smoke | Tiny data sources that complete in a short amount of time. Useful to "smoke out" basic issues before kicking off on a broader set of data. | All pipelines should have 1 (or more) smoke datasets that you can comfortably run | If you don't have in-house smoke data, consider downsampling or subsampling existing data.
Ex: PhiX sequencing data, heavily downsampled or subsampled FASTQ/BAM input |
Representative | Datasources that are representative of the data your application is likely to see in the field. | If your application can be applied to different types or categories of data, or is expected to operate on data ranging across a spectrum of characteristics (size, complexity, etc), be sure to include datasources representing each of the major conditions. Tag them and/or place into datasource groups accordingly. | Consider having both Core and Extended groups of representative data, with the latter including multiple sources per category or more comprehensive spectrum coverage. |
Small | Small (and fast-executing) versions of your realistic test data. | Useful during active development for pipelines with long execution times scaling with data size. Provides greater insight than a basic smoke test. | If you don't have in-house small data, consider downsampling or subsampling representative data.
Ex: smaller sequencing runs, regional/chromosomal subsampling |
Challenge/Stress | Data that is outside the bounds of your expected inputs or claims, but near enough that it could conceivably be faced in the field or provides a buffer for your recommended inputs. | Establish reasonable expectations and boundaries of field data and test just outside those boundaries. Allow for failures and use the results to inform your release notes. | Ex: large or computationally complex datasources |
Targeted | Data that is known to contain characteristics of interest, based on known issues, field input, feature development, high-risk or high-impact features. | Can be full-size or sub-sampled, synthetic or real. Allow for failures and use the results to inform your release notes. | Ex: synthetic data to test variant classes not covered in available real data, datasources related to issues or customer questions in the field and/or led to a bugfix or feature addition, sub-samples of large data to target problematic or high-impact areas. |
Characterization
Truth-Based Characterization
There are three levels of truth-based characterization:
Fully Characterized
Partially Characterized
Uncharacterized
There is no reason to restrict your Test Body to only fully characterized datasources. You can still derive modified accuracy statistics and comparative statistics from partially characterized sources, and incrementally add truth information to partially characterized or uncharacterized sources through curation over time.
Result Expectation-Based Characterization
Even in the absence of known truth information, verification can still be performed on result-generating pipelines.
Expected number of results meeting particular criteria (e.g. 0 somatic variants in a non-cancer sample, >=1 result, >100)
Expected proportion of results meeting particular criteria (e.g. %SNV, % of indels that are homopolymers, % of variants that are in COSMIC)
Derived proportion stats (e.g. Transition/Transversion ratio)
Replicate-Based Characterization
Expected % Overlap between results of replicates
Output Expectation-Based Characterization
Known benchmarks
Initial-execution benchmarks
Output metadata range
Fully Uncharacterized
Even fully uncharacterized data that you never plan on characterizing is useful for your development and verification needs. For example, you can still use these for:
comparative assessment of output results and summary stats between workflows, versions, and parameter sets
consistency monitoring: ensure determinism within versions and between parameter sets
relational monitoring: ensure that related parameter sets perform in expected patterns (for example, a parameter set with very tight thresholds should have fewer variant results than one with loose thresholds)
monitoring and comparing timing
trends and relationships between input metadata/characteristics and output metadata/characteristics