Disaster Recovery in Porch

Descriptions of scenarios where one or more Porch data stores are lost or corrupted, with or without backups.

Introduction

This document describes the impact on Porch when one or more data stores fails.

Porch has a relatively complex data storage model, such that it essentially acts as a mediator between sets of data stored in several locations:

  • the Kubernetes cluster control plane on which Porch is installed, including Porch’s custom resources
  • the storage repositories in which package resources are stored (backing the Porch Repository objects). For more details on how Porch interacts with repositories, see the documentation on Repositories and Repository Adapters
    • Git repositories are Porch’s primary and fully-supported storage backend
  • and the contents of the package revision cache (which, depending on the cache option configured at install-time, may be either incorporated in the cluster’s control plane (with the CR cache) or stored in a separate SQL database (with the DB cache)). A more detailed explanation of the package revision cache and the different cache options can be found in the Architecture and Components section: Cache

Porch’s data storage operations are covered in significantly greater depth in the Architecture and Components section.

Each data store serves as the source of truth for different elements of Porch’s data structure:

  • custom resource objects on Kubernetes control plane:
    • Porch repositories (Repository objects)
    • package variants (and by extension package variant sets)
    • (if the CR cache is configured) “work-in-progress” package revisions whose lifecycle stage is “Draft”, “Proposed”, or “DeletionProposed”
  • Git repositories:
    • package revisions (Kpt package file contents and directory structures)
  • Package revision cache:
    • Kubernetes-related metadata for package revisions (e.g. labels and annotations)
    • (if the DB cache is configured) “work-in-progress” package revisions whose lifecycle stage is “Draft”, “Proposed”, or “DeletionProposed”

Backup strategy

Implementing a regular backup strategy for each data store will minimise data loss in the event of a disaster scenario. Restoring from a valid point-in-time backup will guarantee recovery of Porch state at least at the point the backup was made.

Back up Porch custom resources

Porch custom resources - Repositories, PackageVariants, PackageVariantSets, and the CR cache’s PackageRevs - are stored in the Kubernetes cluster. They can be backed up as such - for example, by exporting them as YAML manifests or by taking a snapshot of the cluster’s etcd store. If Porch is itself being managed by a GitOps platform such as ConfigSync or FluxCD, the GitOps platform is expected to have its own record of the custom resources, conducting its own reconciliation operations to save and restore custom resources.

Back up Git repositories

The Git repositories contain the file content of the Kpt packages managed by Porch. The distributed nature of Git version control means backing up a few repositories can be as simple as using Git from the Git server toward another, secured Git server: git push to back up and git clone or git pull (depending on the extent of the disaster) to restore. However, repositories in larger numbers or containing many large Kpt packages may require more specialized approaches to back up the Git servers themselves

Back up DB cache database

If the DB cache option is selected at install time, the package revision cache is stored in an SQL database. Porch recommends backing up the entire contents of this database on a regular basis, as it contains all data for unpublished package revisions (in lifecycle stages “Draft”, “Proposed”, or “DeletionProposed”)

Disaster scenarios

This section gives details of tested scenarios with varying combinations of backup, wipe, and restore routines for different data stores.

1. Complete disaster

Kubernetes cluster is lost with all nodes; Git repositories are lost; DB cache database is lost.

Data backed up:

  • Porch custom resources
  • Git repository contents
  • DB cache database contents

Data stores lost:

  • Kubernetes control plane: entire Kubernetes cluster deleted
  • Git repositories: Git server deleted and recreated empty of data
  • DB cache database:
    • SQL script used to drop all Porch tables
    • PostgreSQL server deleted and recreated empty of data

Restoration steps:

  1. Recreate Kubernetes cluster
  2. Reinstall Porch with DB cache pointed to empty database server
  3. Restore backed-up repository contents to Git server
  4. Restore backed-up database contents to PostgreSQL server
  5. Perform GitOps reconciliation, gradually (in batches of 20) re-creating all backed-up Porch Repository objects
    1. For each batch, wait until all Repository objects have condition with type “Ready” and status set “True”

Expected data loss

None - complete recovery of state at time data was backed up.

With backups of all data stores, Porch recovers all data.

2. Kubernetes cluster loss

Kubernetes cluster is lost with all nodes; Git repositories and DB cache database remain safe.

Data backed up:

  • Porch custom resources

Data stores lost:

  • Kubernetes control plane: entire Kubernetes cluster deleted

Restoration steps:

  1. Recreate Kubernetes cluster
  2. Reinstall Porch with DB cache pointed to same (still-existing) PostgreSQL server
  3. Perform GitOps reconciliation, gradually (in batches of 20) re-creating all backed-up Porch Repository objects
    1. For each batch, wait until all Repository objects have condition with type “Ready” and status set “True”

Expected data loss

None - complete recovery of state at time of cluster loss.

Through using Git as the source of truth, we might expect Porch to automatically delete any state that only exists in the cache - e.g., package revisions in “Draft” lifecycle stage. However, the connection between Porch and Git is represented by the Repository objects, so until they are recreated in the GitOps reconciliation, Porch does not know the state of Git to overwrite the cached state.

3. Porch microservices restarted

All Porch pods (by default, all in the “porch-system” namespace) are ungracefully restarted (e.g. by forcible pod deletion with grace-period 0).

Data backed up:

  • Porch custom resources
  • Git repository contents
  • DB cache database contents

Data stores lost:

  • None
  • Porch will immediately begin to re-sync all repositories, resulting in a decrease in quality of service until all repositories are deemed Ready
    • Porch API will be unavailable to perform operations on package revisions
      • get or list operations can be used to monitor Porch for API availability and repository status

Restoration steps:

  1. Wait until all Porch pods return to Ready state
  2. Wait until all Repository objects have condition with type “Ready” and status set “True”
    1. GitOps reconciliation is unnecessary in this case since the Repository objects are unchanged
  3. List package revisions periodically, monitoring results until state stabilises

Expected data loss

None - no data stores were impacted, but only Porch’s ability to manage them, allowing for full recovery.

In a representative testing environment, recovery takes less than 5 minutes for 115 Repository objects with a 4GiB memory limit applied to the porch-server microservice

4. DB cache loss with backup

Kubernetes cluster and Git repositories remain safe; DB cache database is lost.

Data backed up:

  • DB cache database contents

Data stores lost:

  • DB cache database:
    • SQL script used to drop all Porch tables
    • PostgreSQL server deleted and recreated empty of data

Restoration steps:

  1. Restore backed-up database contents to PostgreSQL server
  2. Perform GitOps reconciliation, gradually (in batches of 20) re-creating all backed-up Porch Repository objects
    1. For each batch, wait until all Repository objects have condition with type “Ready” and status set “True”

Expected data loss

None - complete recovery of state at time DB cache database was backed up.

With a backup of the cache database, Porch recovers all data.

5. DB cache loss without backup

Kubernetes cluster and Git repositories remain safe; DB cache database is lost without a valid backup.

Data backed up:

  • None

Data stores lost:

  • DB cache database:
    • SQL script used to drop all Porch tables
    • PostgreSQL server deleted and recreated empty of data

Restoration steps:

  1. Perform GitOps reconciliation, gradually (in batches of 20) re-creating all backed-up Porch Repository objects
    1. Repository objects will fail to become ready, as crucial data is missing from the wiped cache database
  2. Restart the porch-server microservice:
    1.  kubectl -n porch-system delete pod --selector app=porch-server --force --grace-period 0 
      
    2. Porch will immediately begin to re-sync all repositories, resulting in a decrease in quality of service until all repositories are deemed Ready
  3. Wait until all Repository objects have condition with type “Ready” and status set “True”
  4. List package revisions periodically, monitoring results until state stabilises

Expected data loss

All “work in progress” on package revisions lost:

  • package revisions in “Draft” or “Proposed” lifecycle stages - complete loss
  • package revisions in “DeletionProposed” lifecycle stage are reverted to “Published” lifecycle stage
    • a similar effect to the use of porchctl rpkg reject