All Collections
Testing Strategies & Tips
Building a Body of Test Data
Building a Body of Test Data
Gwenn Berry avatar
Written by Gwenn Berry
Updated over a week ago

Wondering where to get started with building your Test Body? Read our tips below on types and characteristics of test data.

Types of Test Data

Category

Description

Recommendations

Tips/Examples

Smoke

Tiny data sources that complete in a short amount of time. Useful to "smoke out" basic issues before kicking off on a broader set of data.

All pipelines should have 1 (or more) smoke datasets that you can comfortably run

If you don't have in-house smoke data, consider downsampling or subsampling existing data.

Ex: PhiX sequencing data, heavily downsampled or subsampled FASTQ/BAM input

Representative

Datasources that are representative of the data your application is likely to see in the field.

If your application can be applied to different types or categories of data, or is expected to operate on data ranging across a spectrum of characteristics (size, complexity, etc), be sure to include datasources representing each of the major conditions. Tag them and/or place into datasource groups accordingly.

Consider having both Core and Extended groups of representative data, with the latter including multiple sources per category or more comprehensive spectrum coverage.

Small

Small (and fast-executing) versions of your realistic test data.

Useful during active development for pipelines with long execution times scaling with data size. Provides greater insight than a basic smoke test.

If you don't have in-house small data, consider downsampling or subsampling representative data.

Ex: smaller sequencing runs, regional/chromosomal subsampling

Challenge/Stress

Data that is outside the bounds of your expected inputs or claims, but near enough that it could conceivably be faced in the field or provides a buffer for your recommended inputs.

Establish reasonable expectations and boundaries of field data and test just outside those boundaries. Allow for failures and use the results to inform your release notes.

Ex: large or computationally complex datasources

Targeted

Data that is known to contain characteristics of interest, based on known issues, field input, feature development, high-risk or high-impact features.

Can be full-size or sub-sampled, synthetic or real. Allow for failures and use the results to inform your release notes.

Ex: synthetic data to test variant classes not covered in available real data, datasources related to issues or customer questions in the field and/or led to a bugfix or feature addition, sub-samples of large data to target problematic or high-impact areas.

Characterization

Truth-Based Characterization

There are three levels of truth-based characterization:

  • Fully Characterized

  • Partially Characterized

  • Uncharacterized

There is no reason to restrict your Test Body to only fully characterized datasources. You can still derive modified accuracy statistics and comparative statistics from partially characterized sources, and incrementally add truth information to partially characterized or uncharacterized sources through curation over time.

Result Expectation-Based Characterization

Even in the absence of known truth information, verification can still be performed on result-generating pipelines.

  • Expected number of results meeting particular criteria (e.g. 0 somatic variants in a non-cancer sample, >=1 result, >100)

  • Expected proportion of results meeting particular criteria (e.g. %SNV, % of indels that are homopolymers, % of variants that are in COSMIC)

  • Derived proportion stats (e.g. Transition/Transversion ratio)

Replicate-Based Characterization

  • Expected % Overlap between results of replicates

Output Expectation-Based Characterization

  • Known benchmarks

  • Initial-execution benchmarks

  • Output metadata range

Fully Uncharacterized

Even fully uncharacterized data that you never plan on characterizing is useful for your development and verification needs. For example, you can still use these for:

  • comparative assessment of output results and summary stats between workflows, versions, and parameter sets

  • consistency monitoring: ensure determinism within versions and between parameter sets

  • relational monitoring: ensure that related parameter sets perform in expected patterns (for example, a parameter set with very tight thresholds should have fewer variant results than one with loose thresholds)

  • monitoring and comparing timing

  • trends and relationships between input metadata/characteristics and output metadata/characteristics

Did this answer your question?