Skip to main content

Data Pipeline and Formats

Risk Navigator is a static viewer. It does not call a backend after load; every screen is driven by one built JSON dataset selected from tool/manifest.json.

The pipeline therefore has one job: convert vulnerability intelligence and project dependency inventory into data/<scope>.json.

Prerequisites

The reference pipeline assumes:

  • Node.js 20 or later.
  • npm 10 or later.
  • Python 3.11 or later.
  • Git for optional public repository scans.
  • Optional cdxgen or syft for regenerating scanner-based SBOM inputs.

Published documentation and sample dataset metadata should use portable commands and repo-relative paths, not local workstation paths.

Pipeline overview

Final viewer input

The viewer consumes a compact JSON file:

data/<scope>.json

Each dataset must be listed in tool/manifest.json:

{
"datasets": [
{
"label": "OSERA Demo Data (Example)",
"url": "../data/finos-sample-platform.json",
"description": "Small curated walkthrough dataset shaped to exercise the main Risk Navigator workflows.",
"source_type": "synthetic-demo",
"coverage": "Synthetic project and dependency inventory with hand-shaped remediation scenarios.",
"limitations": "Demo data for UI walkthroughs; not derived from live FINOS repositories.",
"docs_url": "../docs/data-pipeline#osera-demo-data"
},
{
"label": "FINOS SBOM Scan Demo",
"url": "../data/finos-sbom-demo.json",
"description": "Compact FINOS repository SBOM demo showing the CycloneDX import path.",
"source_type": "cyclonedx-sbom-demo",
"coverage": "Small selected FINOS repository names with committed CycloneDX dependency graphs aligned to the sample OSV seed.",
"limitations": "Compact demo inputs; full repo scans depend on scanner support, lockfiles, and available public repository metadata.",
"docs_url": "../docs/data-pipeline#finos-sbom-scan-demo"
},
{
"label": "FINOS Deep SBOM Demo",
"url": "../data/finos-deep-sbom-demo.json",
"description": "Curated multi-ecosystem FINOS demo with explicit direct/transitive SBOM graph relationships.",
"source_type": "curated-cyclonedx-sbom",
"coverage": "8 selected FINOS repositories; Maven, npm, PyPI, OCI base-image, and RPM child-package examples.",
"limitations": "Demo input, not an authoritative current SBOM for those repositories.",
"docs_url": "../docs/data-pipeline#finos-deep-sbom-demo"
},
{
"label": "FINOS GitHub Org Snapshot",
"url": "../data/finos-github-org.json",
"description": "Broad public FINOS GitHub organization snapshot built from public repository metadata and manifest extraction.",
"source_type": "public-github-org-manifest-snapshot",
"coverage": "132 active non-fork public FINOS repositories; 97 repositories with extracted dependency edges in the committed snapshot.",
"limitations": "Broad but often shallow: manifest extraction can be direct-only and does not fully resolve every build-system dependency graph.",
"docs_url": "../docs/data-pipeline#finos-github-org-snapshot"
}
]
}

Manifest entries require label and url. The viewer also reads optional description, source_type, coverage, limitations, and docs_url fields for dropdown tooltips and the in-app About Data Sources help tab. The built dataset can provide richer provenance in meta.dataset_methodology; the viewer combines both sources when describing the active dataset.

The top-level JSON shape is:

FieldPurpose
metaScope labels, freshness, source stats, filters, optional overlays, and counts.
departmentsProject grouping values used by filters. In enterprise usage this can be department, BU, platform, or team.
consumer_projectsProject/repository inventory plus vulnerable-library rollups.
librariesVulnerable package-version records, CVEs, safe-version analysis, consumers, and remediation signals.
amplifier_clustersParent dependencies that introduce vulnerable transitive libraries across projects.

Minimal final JSON example

{
"meta": {
"scope_type": "repository-sbom-sample",
"scope_name": "finos-sbom-demo",
"scope_label": "FINOS SBOM Scan Demo",
"extracted_at": "2026-06-23T00:00:00Z",
"filters_applied": {
"tooling_qualifiers_excluded": ["build", "module-load", "test"],
"library_namespaces_excluded": [],
"cvss_min": 0.01
},
"external_signals": {
"kev_loaded": true,
"epss_loaded": true
},
"counts": {
"consumer_projects": 3,
"distinct_cve_libraries": 10,
"distinct_amplifier_clusters": 3
}
},
"departments": [
{ "name": "SBOM Scan", "project_count": 3 }
],
"consumer_projects": [
{
"id": "github|finos|legend-studio",
"project_ref": "github/finos/legend-studio",
"department": "SBOM Scan",
"vulnerable_library_ids": ["maven|org.springframework|spring-web|5.3.39"],
"vulnerable_library_count": 1,
"direct_vulnerable_library_count": 0,
"transitive_vulnerable_library_count": 1
}
],
"libraries": [
{
"id": "maven|org.springframework|spring-web|5.3.39",
"namespace": "maven",
"meta": "org.springframework",
"proj": "spring-web",
"release": "5.3.39",
"max_cvss": 8.1,
"cve_count": 1,
"cves": [],
"is_kev_listed": false,
"epss_max": 0.0,
"consumer_project_ids": ["github|finos|legend-studio"],
"direct_consumer_project_ids": [],
"direct_consumer_count": 0,
"transitive_consumer_count": 1,
"total_consumer_count": 1,
"version_chain": [],
"nearest_safe_version": "5.3.40",
"max_safe_patch_same_minor": "5.3.40",
"distance_to_safe": "PATCH",
"effort_class": "PATCH"
}
],
"amplifier_clusters": []
}

The exact library object is richer than this example. The authoritative contract is the repository SPEC.md; the executable guardrail is:

python3 scripts/validate_dataset.py data/<scope>.json

Raw scope input

scripts/build_dataset.py reads normalized raw CSV files from:

data/raw/<scope>/

The extractor is responsible for producing these files:

FileRequired purpose
01-consumer-projects.csvOne row per project, repo, service, app, or other consumer.
02-dep-edges.csvProject-to-library edges with direct/transitive and parent dependency information.
03-cve-libs.csvVulnerable library rows joined to CVE severity and ownership metadata.
04-version-chain.csvKnown package versions used to find the nearest safe upgrade.
05-amplifiers.csvParent libraries that introduce vulnerable transitive libraries.
06-cve-edges.csvLibrary-version to CVE rows.

Version values are normalized before joins. For PyPI and npm, Git-tag-style versions such as v0.8.8 are canonicalized to registry versions such as 0.8.8, and duplicate version-chain rows are collapsed. The detailed merge rule is defined in SPEC.md.

Most custom integrations only need to focus on project inventory and dependency edges. The included extractors and builder can fill the CVE-specific files when they can join edges against data/vulns.db.

Project inventory CSV

01-consumer-projects.csv:

ColumnMeaning
idStable project ID. GitHub examples use `github
namespaceProject namespace, usually github for repo-backed samples.
metaProject owner or organization, such as finos.
projProject or repository name.
releaseVersion, branch, tag, or inventory release marker.
project_refHuman-readable source reference, such as github/finos/legend-studio.
eonidOptional enterprise/application portfolio ID. Demo data uses synthetic IDs.
departmentProject grouping used by filters.
tai_systemOptional system/application name.

Example:

id,namespace,meta,proj,release,project_ref,eonid,department,tai_system
github|finos|legend-studio,github,finos,legend-studio,main,github/finos/legend-studio,SBOM-1,SBOM Scan,legend-studio

Dependency edge CSV

02-dep-edges.csv:

ColumnMeaning
consumer_idProject ID from 01-consumer-projects.csv.
library_idPackage coordinate: `namespace
direct1 for direct dependency, 0 for transitive dependency.
qualifierTypical values: runtime, test, build, module-load.
parent_idParent package that introduces a transitive dependency. Empty for direct dependencies.

Example:

consumer_id,library_id,direct,qualifier,parent_id
github|finos|legend-studio,maven|org.springframework.boot|spring-boot-starter-web|2.7.18,1,runtime,
github|finos|legend-studio,maven|org.springframework|spring-web|5.3.39,0,runtime,maven|org.springframework.boot|spring-boot-starter-web|2.7.18

Direct/transitive accuracy matters. It drives direct-only filters, transitive exposure, amplifier analysis, and OpenRewrite cart eligibility.

SBOM input path

Risk Navigator can import CycloneDX JSON SBOMs:

python3 scripts/extract_org.py \
--scope finos-sbom-demo \
--source cyclonedx \
--sbom-dir data/sboms/finos-sbom-demo

The importer reads:

  • metadata.component: project identity and optional VCS reference.
  • components[*].purl: package coordinates.
  • components[*].scope: runtime/test/build qualifier when present.
  • dependencies[]: direct and transitive graph relationships.

The dependency graph is used to derive:

  • direct project dependencies,
  • transitive dependencies,
  • parent dependency IDs for amplifier analysis.

CycloneDX files can come from a repo scanner:

python3 scripts/scan_repos_to_sbom.py \
--scope finos-sbom-demo \
--org finos \
--default-demo-repos

By default the wrapper expects cdxgen. It also supports Syft:

python3 scripts/scan_repos_to_sbom.py \
--scope finos-sbom-demo \
--org finos \
--default-demo-repos \
--scanner syft

Scanner output is realistic but not perfect. Completeness depends on lockfiles, ecosystem support, generated build files, private registry access, and whether the scanner resolves transitive dependencies for that repository.

Run repo scanners in a sandboxed environment. Some scanners may invoke package manager or build-wrapper commands to resolve dependencies. The committed demo datasets avoid that by using deterministic CycloneDX inputs.

Built-in datasets

OSERA Demo Data

Label: OSERA Demo Data (Example)

Built file:

data/finos-sample-platform.json

Raw input:

data/raw/finos-sample-platform/

Build command:

npm run build:all

This is the curated walkthrough dataset. It is intentionally small, stable, and hand-shaped to exercise the main UI modes: patch/minor/major upgrades, backpatch candidates, KEV/EPSS signals, framework grouping, and amplifiers.

FINOS SBOM Scan Demo

Label: FINOS SBOM Scan Demo

Built file:

data/finos-sbom-demo.json

SBOM input:

data/sboms/finos-sbom-demo/

Raw extract:

data/raw/finos-sbom-demo/

Build command:

npm run build:all:finos-sbom-demo

This demonstrates the repo-scanner path. The committed SBOMs are compact CycloneDX files using FINOS public repository names and dependency graphs that align with the sample OSV seed. They are meant to show the data shape and pipeline mechanics without requiring a live GitHub clone during docs builds.

Optional regeneration command:

npm run scan:finos-sbom-demo
npm run build:all:finos-sbom-demo

FINOS Deep SBOM Demo

Label: FINOS Deep SBOM Demo

Built file:

data/finos-deep-sbom-demo.json

SBOM input:

data/sboms/finos-deep-sbom-demo/

Raw extract:

data/raw/finos-deep-sbom-demo/

Build command:

npm run build:all:finos-deep-sbom-demo

This is the higher-fidelity dependency-graph demo. It uses eight selected FINOS repository names with curated CycloneDX inputs that model a useful mix of Maven/Gradle-style Java, npm/package-lock-style frontend, PyPI, OCI base-image, and RPM child-package relationships.

The selected repositories are:

  • architecture-as-code
  • waltz
  • traderX
  • open-resource-broker
  • ipyregulartable
  • TimeBase-CE
  • FDC3
  • symphony-bdk-java

This dataset is intentionally not an all-FINOS scan. It is meant to demonstrate the direct/transitive dependency shape, amplifier analysis, namespace spread, and base-image ancestry that a production SBOM pipeline should preserve.

The committed SBOM inputs are generated by:

npm run generate:finos-deep-sbom-demo

Optional live scanner experimentation can use:

npm run scan:finos-deep-sbom-demo

Run that scanner path in a sandbox and review output before publishing it. The published deep demo is deterministic by design.

FINOS GitHub Org Snapshot

Label: FINOS GitHub Org Snapshot

Built file:

data/finos-github-org.json

Raw extract:

data/raw/finos-github-org/

Recommended refresh command:

npm run build:all:finos-org:full-osv

This is the larger realism dataset in the dropdown. It is generated from the public FINOS GitHub organization using the GitHub REST API, filtered to active non-fork repositories, and joined with full OSV records filtered by observed packages.

The current committed snapshot includes:

  • 132 active non-fork public FINOS repositories.
  • 97 repositories with extracted dependency edges.
  • 4,040 dependency edges.
  • 72 distinct vulnerable libraries in the final viewer dataset.
  • 2 KEV-listed libraries.

Use authenticated GitHub API access for refreshes to avoid public API rate limits. The extractor reads declared manifests and does not perform full build-system resolution for every ecosystem, so dependency coverage depends on the manifests present in public repositories and the ecosystems supported by scripts/extract_org.py. This broad dataset is therefore useful for portfolio coverage, but it can be shallow: dependency edges extracted from manifests are often direct-only. Use the FINOS Deep SBOM Demo to show the richer data shape expected from lockfile, scanner, or build-system dependency graphs.

For a faster smoke test, this command uses the sample OSV seed and produces lower vulnerability coverage:

npm run build:all:finos-org

Custom pipeline checklist

For an enterprise or community deployment, the minimum custom pipeline is:

  1. Choose a scope name, for example payments-platform.
  2. Produce project inventory records.
  3. Produce dependency edges with direct/transitive fidelity.
  4. Ingest or provide vulnerability mappings.
  5. Fetch KEV and EPSS enrichment.
  6. Build data/<scope>.json.
  7. Validate it with scripts/validate_dataset.py.
  8. Add it to tool/manifest.json.

Typical commands:

python3 scripts/ingest_vulns.py --write-sample
python3 scripts/fetch_external.py
python3 scripts/extract_org.py --scope <scope> --source cyclonedx --sbom-dir <sbom-dir>
python3 scripts/build_dataset.py --scope <scope> --meta-overlay <meta-overlay-json>
python3 scripts/validate_dataset.py data/<scope>.json

For production use, keep internal extractors and secrets outside this upstream repository and preserve the final JSON contract consumed by the static viewer.