Working directly from a Transkribus export
When exporting a transcription from Transkribus, the user gets a directory containing the following elements:
document_name/
- metadata.xml
- mets.xml
- {images} (opt)
- alto/
- {transcriptions}
While metadata.xml
contains metadata on the document from which the transcriptions are exported and is not a standardized XML file, mets.xml
follows the METS XML specifications and contains the metadata of each single transcription and their associated image.
We want to use mets.xml
to automatically populated //sourceImageInformation/fileName
in the ALTO files.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:mets xmlns:ns2="http://www.w3.org/1999/xlink" xmlns:ns3="http://www.loc.gov/METS/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" OBJID="135216" LABEL="TRAIN_CITlab_AD_Seine_Prudhomme_1858_M1+_duplicated" PROFILE="TRP_V1" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd">
<ns3:metsHdr CREATEDATE="2020-08-28T09:48:33.036+02:00" LASTMODDATE="2020-08-28T09:48:33.036+02:00" RECORDSTATUS="SUBMITTED">
<ns3:agent ROLE="CREATOR" TYPE="ORGANIZATION">
<ns3:name>UIBK</ns3:name>
<ns3:note>This METS file was generated by Transkribus</ns3:note>
</ns3:agent>
</ns3:metsHdr>
<ns3:amdSec ID="SOURCE">
...
</ns3:amdSec>
<ns3:fileSec>
<ns3:fileGrp ID="MASTER">
<ns3:fileGrp ID="IMG">
<ns3:file ID="IMG_1" SEQ="1" MIMETYPE="image/jpeg" CREATED="2020-08-28T09:48:29.225+02:00">
<ns3:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" ns2:type="simple" ns2:href="PH 1858-520.jpg"/>
</ns3:file>
</ns3:fileGrp>
<ns3:fileGrp ID="ALTO">
<ns3:file ID="ALTO_1" SEQ="1" MIMETYPE="application/xml" CHECKSUM="" CHECKSUMTYPE="MD5">
<ns3:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" ns2:type="simple" ns2:href="alto/Get?id=OGSEQOMZJCSTUCDJKYXTAKDT"/>
</ns3:file>
</ns3:fileGrp>
</ns3:fileGrp>
</ns3:fileSec>
<ns3:structMap ID="TRP_STRUCTMAP" TYPE="MANUSCRIPT">
...
</ns3:structMap>
</ns3:mets>
Because //ns3:fileGrp[@ID="ALTO"]/ns3:file/ns3:Flocat/@ns2:href
doesn't point towards the local xml file but towards the download link (for which one has to provide an authentication token), we can't simply pair this value with that of //ns3:fileGrp[@ID="IMG"]/ns3:file/ns3:Flocat/@ns2:href
.
In stead, we'll work our way around this by trying to pair the local ALTO XML file with the corresponding image file in mets.xml
. Since we are working within an export from Transkribus, the basename should be identical and only the extension should differ.