... | @@ -41,28 +41,7 @@ A basic example: |
... | @@ -41,28 +41,7 @@ A basic example: |
|
</Layout>
|
|
</Layout>
|
|
</alto>`
|
|
</alto>`
|
|
```
|
|
```
|
|
## Abbyy XML
|
|
|
|
Same principle as Alto but the grammar is a bit different:
|
|
|
|
```xml
|
|
|
|
<document xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml" version="1.0" producer="kraken">
|
|
|
|
<page width="850" height="1083" resolution="0" originalCoords="1">
|
|
|
|
<block blockType="Text">
|
|
|
|
<text>
|
|
|
|
<par>
|
|
|
|
<line baseline="785" l="160" r="180" t="771" b="799">
|
|
|
|
<formatting lang="">
|
|
|
|
<charParams l="160" r="165" t="771" b="799" wordStart="1" charConfidence="0">T</charParams>
|
|
|
|
<charParams l="165" r="170" t="771" b="799" wordStart="0" charConfidence="0">e</charParams>
|
|
|
|
<charParams l="170" r="175" t="771" b="799" wordStart="0" charConfidence="0">s</charParams>
|
|
|
|
<charParams l="175" r="180" t="771" b="799" wordStart="0" charConfidence="0">t</charParams>
|
|
|
|
</formatting>
|
|
|
|
</line>
|
|
|
|
</par>
|
|
|
|
</text>
|
|
|
|
</block>
|
|
|
|
</page>
|
|
|
|
</document>
|
|
|
|
```
|
|
|
|
## Page XML
|
|
## Page XML
|
|
Page XML developed and maintained by the PRImA Research Lab at the University of Salford, UK. Unlike Alto, PageXML file can describe more than one page. We validate PageXML files against two versions of schema. [2018-07-15](https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2018-07-15/pagecontent.xsd) and [2013-07-15](https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd), we added some validations to handle files generated by Transkribus.
|
|
Page XML developed and maintained by the PRImA Research Lab at the University of Salford, UK. Unlike Alto, PageXML file can describe more than one page. We validate PageXML files against two versions of schema. [2018-07-15](https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2018-07-15/pagecontent.xsd) and [2013-07-15](https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd), we added some validations to handle files generated by Transkribus.
|
|
if you faced an error message like this
|
|
if you faced an error message like this
|
... | @@ -74,8 +53,8 @@ By defaults the segmentation for the selected images, both regions and lines, wi |
... | @@ -74,8 +53,8 @@ By defaults the segmentation for the selected images, both regions and lines, wi |
|
TextRegion tag have a list of coordinates declared as `x1,y1 x2,y2...xn,yn`.
|
|
TextRegion tag have a list of coordinates declared as `x1,y1 x2,y2...xn,yn`.
|
|
`Baseline` are optional in PageXml.
|
|
`Baseline` are optional in PageXml.
|
|
for the page content, it can be stored directly in `Textline` for example
|
|
for the page content, it can be stored directly in `Textline` for example
|
|
|
|
|
|
|
|
```xml
|
|
<TextLine id="r2l1" custom="readingOrder {index:0;}">
|
|
<TextLine id="r2l1" custom="readingOrder {index:0;}">
|
|
<Coords points="150,64 346,60 425,81 "/>
|
|
<Coords points="150,64 346,60 425,81 "/>
|
|
<Baseline points="155,55 180,55 206,55 231,55 257,55 283,55 308,55"/>
|
|
<Baseline points="155,55 180,55 206,55 231,55 257,55 283,55 308,55"/>
|
... | @@ -83,10 +62,10 @@ for the page content, it can be stored directly in `Textline` for example |
... | @@ -83,10 +62,10 @@ for the page content, it can be stored directly in `Textline` for example |
|
<Unicode>ܡ ܗܘܡ ܐܘ ܥܒ</Unicode>
|
|
<Unicode>ܡ ܗܘܡ ܐܘ ܥܒ</Unicode>
|
|
</TextEquiv>
|
|
</TextEquiv>
|
|
</TextLine>
|
|
</TextLine>
|
|
|
|
```
|
|
|
|
or each `Word` is separated with its `Coords` and content like the example below
|
|
|
|
|
|
or each `Word` is separated with its `Coords` and content like the example below
|
|
```xml
|
|
|
|
|
|
|
|
|
|
<TextLine id="l1">
|
|
<TextLine id="l1">
|
|
<Coords points="1550,422 1555,422"/>
|
|
<Coords points="1550,422 1555,422"/>
|
|
<Word id="w122" language="Hebrew" primaryScript="Hebr - Hebrew"
|
|
<Word id="w122" language="Hebrew" primaryScript="Hebr - Hebrew"
|
... | @@ -104,6 +83,7 @@ for the page content, it can be stored directly in `Textline` for example |
... | @@ -104,6 +83,7 @@ for the page content, it can be stored directly in `Textline` for example |
|
</TextEquiv>
|
|
</TextEquiv>
|
|
</Word>
|
|
</Word>
|
|
<TextLine>
|
|
<TextLine>
|
|
|
|
```
|
|
|
|
|
|
example of full PageXML file :
|
|
example of full PageXML file :
|
|
```xml
|
|
```xml
|
... | | ... | |