... | ... | @@ -41,28 +41,7 @@ A basic example: |
|
|
</Layout>
|
|
|
</alto>`
|
|
|
```
|
|
|
## Abbyy XML
|
|
|
Same principle as Alto but the grammar is a bit different:
|
|
|
```xml
|
|
|
<document xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml" version="1.0" producer="kraken">
|
|
|
<page width="850" height="1083" resolution="0" originalCoords="1">
|
|
|
<block blockType="Text">
|
|
|
<text>
|
|
|
<par>
|
|
|
<line baseline="785" l="160" r="180" t="771" b="799">
|
|
|
<formatting lang="">
|
|
|
<charParams l="160" r="165" t="771" b="799" wordStart="1" charConfidence="0">T</charParams>
|
|
|
<charParams l="165" r="170" t="771" b="799" wordStart="0" charConfidence="0">e</charParams>
|
|
|
<charParams l="170" r="175" t="771" b="799" wordStart="0" charConfidence="0">s</charParams>
|
|
|
<charParams l="175" r="180" t="771" b="799" wordStart="0" charConfidence="0">t</charParams>
|
|
|
</formatting>
|
|
|
</line>
|
|
|
</par>
|
|
|
</text>
|
|
|
</block>
|
|
|
</page>
|
|
|
</document>
|
|
|
```
|
|
|
|
|
|
## Page XML
|
|
|
Page XML developed and maintained by the PRImA Research Lab at the University of Salford, UK. Unlike Alto, PageXML file can describe more than one page. We validate PageXML files against two versions of schema. [2018-07-15](https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2018-07-15/pagecontent.xsd) and [2013-07-15](https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd), we added some validations to handle files generated by Transkribus.
|
|
|
if you faced an error message like this
|
... | ... | @@ -74,8 +53,8 @@ By defaults the segmentation for the selected images, both regions and lines, wi |
|
|
TextRegion tag have a list of coordinates declared as `x1,y1 x2,y2...xn,yn`.
|
|
|
`Baseline` are optional in PageXml.
|
|
|
for the page content, it can be stored directly in `Textline` for example
|
|
|
|
|
|
|
|
|
|
|
|
```xml
|
|
|
<TextLine id="r2l1" custom="readingOrder {index:0;}">
|
|
|
<Coords points="150,64 346,60 425,81 "/>
|
|
|
<Baseline points="155,55 180,55 206,55 231,55 257,55 283,55 308,55"/>
|
... | ... | @@ -83,10 +62,10 @@ for the page content, it can be stored directly in `Textline` for example |
|
|
<Unicode>ܡ ܗܘܡ ܐܘ ܥܒ</Unicode>
|
|
|
</TextEquiv>
|
|
|
</TextLine>
|
|
|
```
|
|
|
or each `Word` is separated with its `Coords` and content like the example below
|
|
|
|
|
|
or each `Word` is separated with its `Coords` and content like the example below
|
|
|
|
|
|
|
|
|
```xml
|
|
|
<TextLine id="l1">
|
|
|
<Coords points="1550,422 1555,422"/>
|
|
|
<Word id="w122" language="Hebrew" primaryScript="Hebr - Hebrew"
|
... | ... | @@ -104,6 +83,7 @@ for the page content, it can be stored directly in `Textline` for example |
|
|
</TextEquiv>
|
|
|
</Word>
|
|
|
<TextLine>
|
|
|
```
|
|
|
|
|
|
example of full PageXML file :
|
|
|
```xml
|
... | ... | |