Update corrupted files in TDR
Here is a guide on how to update files published in Terra Data Repository that have been identified as corrupted.
Context
Use case issue: #1396
UCSC team performs checks on the integrity of the data. During such check, they might identify mismatches in the checksum of file in TDR and checksum in metadata.
In such cases, uncorrupted data should be re-submitted by wranglers. One option would be to re-export data from ingest. However, upload api doesn’t allow us to re-upload files that have previously been deleted. A workaround is to override ingest and update file directly in the staging area. In order for TDR to allow overwriting the previous file, we should update the file version. Files that we don’t want to update don’t have to be in staging area.
process
- identify the corrupted files in database
- identify and download correct file from archives
- calculate sha256 checksum hash
- get json metadata/ descriptor for files
- amend json files according to dcp2 SOP
- ensure staging area is cleaned
- populate staging area:
/datawith correct files/metadata/*filewith amended json/descriptors/*filewith amended json
- import form
flowchart TB
subgraph a[inform]
A(UCSC detects corrupted)
end
subgraph b[get new file]
B[1. Identify file in ingest]
C[2. Download file from archives]
D[3. check sha256]
end
subgraph c[update file]
E[4. get json files]
F[5. amend files]
end
subgraph d[upload]
G[6. ensure staging clear]
H[7. populate staging]
I[8. import form]
end
A --> B --> C --> D
C --> E --> F --> G
F --> H --> I
1. identify the corrupted files in database
In order to identify the file in ingest database, you would need to query ingest with the following command or python script.
curl --request POST \
--url 'https://api.ingest.archive.data.humancellatlas.org/files/query?operator=AND' \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '[
{
"field": "content.file_core.file_name",
"operator": "IS",
"value": "<file_name>"
}
]'
or
from hca_ingest.api.ingestapi import IngestApi
api = IngestApi(url="https://api.ingest.archive.data.humancellatlas.org/")
api.set_token(f"Bearer {<token>}")
query = [{'field': 'content.file_core.file_name',
'operator': 'IS',
'value': '<file_name>'}]
response = api.post('https://api.ingest.archive.data.humancellatlas.org/files/query?operator=AND', json=query)
Note:
First, verify that project is not Managed Access. If it’s managed access, we should reach out to contributor to provide file again and make sure that file is not downloaded locally at any time. Descriptors and file metadata should be fine to have locally.
From the output, identify the file uuid, and information that would help to identify file in the archive. For example, if it’s sequence_file we could look for insdc_run_accession, insdc_experiment_accession, read_index and lane_index.
2. identify and download correct file from archives
Using information from step 1, we will try to identify the file in the archives (for example, ENA if file in INSDC). If we know that the corrupted file was not retrieved from an archive but from a contributor see note above.
If corrupted file is fastq derived from BAM file, take advantage of the complementary read_index to identify the library that was specified in tha bamtofastq argument.
Download locally, unless it is managed access.
3. calculate sha256 checksum hash
Here we want to verify that the substitution we are attemting makes sense.
- Use command
sha256or equivalent to get the hash of the downloaded file. - Compare the
sha256value with:- Corrupted file’s
sha256(provided by the indexing team) sha256value in metadata (if not provided bu indexing team, seek inside descriptor file in next step).
- Corrupted file’s
- If downloaded’s file
sha256value:- matches to value in corrupted file, and file can be accessed, it’s possible that error is on metadata value. TODO: add SOP for updating metadata only.
- matches to value in metadata file, file has been corrupted at one point after deposition in staging area, the sha256 value should match the value in the metadata. Re-submitting the file should fix the issue.
- doesn’t match any value but we are sure that TDR file is corrupted and the downloaded file is the correct file, we can proceed to next step as well.
4. get json metadata/ descriptor for files
In order to do an update, we need the metadata and descriptor files to accompany the data file (links are not needed since we are not changing the file_uuid). If we can’t re-genarate metadata and descriptor files from ingest by re-exporting project, we will have to override ingest and access the previously exported files.
This could be done either from staging (if dir is not cleared up) or from TDR (someone with access to tables could provide those to us).
Get metadata & descriptors from staging area
Check if staging area for project is still filled, with the following command:
gsutil ls gs://broad-dsp-monster-hca-prod-ebi-storage/prod/<project-uuid>/metadata/<entity_type>
and identify the file with the file
Download locally the /metadata file and the /descriptors file. Both have the same file_name so make sure to download in different directories. Staging has the following structure (if entity is sequence_file):
<project-uuid>/
├── data/
│ ├── SRR111111_1.fastq.gz
│ └── ...
├── metadata/
│ └── sequence_file/
│ ├── <file_uuid>_<timestamp>.json
│ └── ...
└── descriptors/
└── sequence_file/
├── <file_uuid>_<timestamp>.json
└── ...
Ideally, use the same hierahical folder structure by downloading the whole dir with these commands.
gsutil -m cp -r gs://broad-dsp-monster-hca-prod-ebi-storage/prod/<project-uuid>/metadata <path/to/download/metadata>
gsutil -m cp -r gs://broad-dsp-monster-hca-prod-ebi-storage/prod/<project-uuid>/descriptors <path/to/download/descriptors>
Get metadata & descriptors from TDR
If staging area is cleaned up and no such file exist there, ask the import team or indexing team in initial slack thread to pull these json files from TDR and share.
5. amend json files according to dcp2 SOP
We want to edit the files so that TDR will understand that these are separate versions. We might also need to update the checksum as well.
Updating to higher version means adding a most recent update-date. It doesn’t have to be the actual date of update, just a most recent one.
The changes for TDR to percive this file as an update are described in dcp2 SOP. Here is are the precise changes that needs to be done:
metadatafile:provenance.update_dateto any more recent date in the format (ISO_8601 is used, for example:2025-08-07T16:15:39.822Z)- update the metadata file
file_namesuffix (notfile_namefield) with the same most recent date.
For example:
aaaaaaaa-5258-aaaa-84f3-aaaaaaaaaaaa_2025-08-22T13:43:40.996000Z.jsonshould be renamed toaaaaaaaa-5258-aaaa-84f3-aaaaaaaaaaaa_2025-09-23T14:45:40.996000Z.json
Note: trailing digits after decimal point compared to ISO_8601.
descriptorfile:file_versionto any more recent date in the format (ISO_8601 is used, for example:2025-08-07T16:15:39.822Z)file_namemight have the bundle uuid as a prefix to the actual file_name like30fb759a-546a-4cfa-a55a-195bc008621a/SRR1111111_1.fastq.gz. this uuid up to/character needs to be removed to leave clean file_name- make sure that
size,sha1,sha256andcrc32care updated as well (check script below) - update the descriptor file
file_namesuffix (notfile_namefield) with the same most recent date.
Similarly:
aaaaaaaa-5258-aaaa-84f3-aaaaaaaaaaaa_2025-08-22T13:43:40.996000Z.jsonshould be renamed toaaaaaaaa-5258-aaaa-84f3-aaaaaaaaaaaa_2025-09-23T14:45:40.996000Z.json
Note: trailing digits after decimal point compared to ISO_8601.
script to edit descriptor contents
If the documents is structured in a hierahical way as it is in staging area you could use the following script: (Update `today` variable with the desired update-date, or update the entity_type if file is not sequence file) ```python """ That's a script to replace (valid) descriptor contents with correct hash values and rename descriptor filename """ import os import re import json import hashlib import google_crc32c proj_uuids = [proj for proj in os.listdir() if os.path.isdir(proj)] # get filenames filenames = [] for proj in proj_uuids: filenames.extend([proj + '/data/' + f for f in os.listdir(proj + '/data/')]) # get hashes hashes = {} for filename in filenames: hashes[filename] = {} with open(filename, "rb") as file: while (byte:= file.read()): hashes[filename]['sha1'] = hashlib.sha1(byte).hexdigest() hashes[filename]['sha256'] = hashlib.sha256(byte).hexdigest() hashes[filename]['crc32c'] = f'{google_crc32c.value(byte):02x}'.zfill(8) hashes[filename]['size'] = os.path.getsize(filename) # get descriptors filenames descriptors = [] for proj in proj_uuids: descriptors.extend([proj + '/descriptors/sequence_file/' + f for f in os.listdir(proj + '/descriptors/sequence_file/')]) today = '2025-07-16T13:33' for descriptor in descriptors: proj = descriptor.split('/')[0] with open(descriptor, 'r') as file: d = json.load(file) d['file_version'] = today + d['file_version'][16:] seq_file = proj + '/data/' + d['file_name'].split("/")[1] d.update(hashes[seq_file]) os.remove(descriptor) descriptor = re.sub(r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}', today, descriptor) with open(descriptor, 'w') as file: json.dump(d, file) ```6. ensure staging area is cleaned
Make sure that staging area has been cleared from other files that might cause validation to fail. The following command will clear up the whole project from staging area.
CAUTION! Make sure that all contents of directory are intended to be removed (or not updated). If ingest is not working we would not be able to re-populate with files deleted here!
gsutil -m rm -r gs://broad-dsp-monster-hca-prod-ebi-storage/prod/<project-uuid>
7. populate staging area
Next step is to upload files into the staging area.
-
/dataUpload correct files into the staging area with the command:
gsutil cp <path/to/file_name> gs://broad-dsp-monster-hca-prod-ebi-storage/prod/<project-uuid>/data/<file_name> -
/metadata/*fileUpload amended metadata json files. Careful not to upload the descriptor file that share the same name.
gsutil cp <path/to/metadata_file_name> gs://broad-dsp-monster-hca-prod-ebi-storage/prod/<project-uuid>/metadata/<entity_name>/<metadata_file_name> -
/descriptors/*fileThe same with amended descriptor json. Again be careful not to upload metadata file that share the same name.
gsutil cp <path/to/descriptor_file_name> gs://broad-dsp-monster-hca-prod-ebi-storage/prod/<project-uuid>/descriptors/<entity_name>/<descriptor_file_name>
8. import form
As a final step, fill import form so Import team will include this project in the next release.