models Archived
From Transkribus to Kraken
To transfer data from Transkribus to Kraken, we do the following:
- collect XML PAGE and images in a directory
- transform the XML PAGE with an XSLT scenario to produce HTML files
- process the HTML files with Ketos, Kraken's trainer module, to produce training data
- process the training data with Ketos to produce the model for Kraken
Instructions
This stylesheet converts PAGE XML to Ketos HTML
Caveat 1 You may want to change the paths configuration, because the stylesheet requires that the images related to the transcriptions are stored in the transcriptions' parent folder.
To do so the following function has to be changed:
<xsl:function name="my:rel_uri">
<xsl:param name="uri"/>
<xsl:value-of select="concat($XSLDir, '/../', $uri)"/>
</xsl:function>
If you use the Time us export solution, the images and the transcriptions will be in the same directory. Therefore, the xsl function should be:
<xsl:function name="my:rel_uri">
<xsl:param name="uri"/>
<xsl:value-of select="concat($XSLDir, '/', $uri)"/>
</xsl:function>
Caveat 2: It requires Saxon EE or Saxon PE.
If you don't have Saxon EE or PE, you can batch convert Transkribus ground truth to Ketos HTML with <oXygen/> Editor, via the Project View (create a new oXygen project for each batch transform process).
NB: Creating logical folders, as it is recommended in the above link, doesn't seem crucial here, as long as you create your oXygen project in the directory of your PAGE XML and images.
It is convenient to use an oXygen Transformation scenario. See (and adapt if necessary) this one. See the doc for more info
Troubleshooting
Be careful that every PAGE XML <textLine> element contains a <Coords> with a not empty @points attribute. These points are the coordinates that Ketos uses and ketos export
will sadly fail without them.
<TextLine id="r2l13" custom="readingOrder {index:10;}">
<Coords points="686,825 744,826 802,826 861,829 919,830 977,832 1036,835 1094,837 1152,840 1211,842 1269,845 1327,847 1386,850 1444,852 1502,855 1561,856 1619,859 1677,861 1736,862 1794,864 1853,865 1853,817 1794,816 1736,814 1677,813 1619,811 1561,808 1502,807 1444,804 1386,802 1327,799 1269,797 1211,794 1152,792 1094,789 1036,787 977,784 919,782 861,781 802,778 744,778 686,777"/>
<Baseline points="686,809 744,810 802,810 861,813 919,814 977,816 1036,819 1094,821 1152,824 1211,826 1269,829 1327,831 1386,834 1444,836 1502,839 1561,840 1619,843 1677,845 1736,846 1794,848 1853,849"/>
<TextEquiv>
<Unicode>à titre de dédit, la somme de leurs francs ; Le Condamne </Unicode>
</TextEquiv>
</TextLine>