|
|
# Imports
|
|
|
|
|
|
In the 'Images' tab of a Document you can find an 'Import' button which allows you to feed data from different sources to eScriptorium.
|
|
|
Note that you can **NOT** yet import both the images and the corresponding transcription at the same time, you need to do it in two steps.
|
|
|
|
|
|
## IIIF
|
|
|
Input a valid iiif manifesto uri to import all its images in full resolution along with metadatas.
|
|
|
|
|
|
## Alto XML
|
|
|
Upload a valid [ALTO XML](https://en.wikipedia.org/wiki/ALTO_(XML)) file for segmentation and transcriptions.
|
|
|
The file is strictly validated against [a future version of ALTO v4](https://gitlab.inria.fr/scripta/escriptorium/blob/develop/app/escriptorium/static/alto-4-1-baselines.xsd), if it's not valid an error message will (hopefully) help you fix the issue.
|
|
|
Input a valid iiif manifesto uri to import all its images in full resolution along with its metadatas.
|
|
|
|
|
|
## XML
|
|
|
The 'name' field is the name of the transcription in which the text content will be stored (you can select it above the transcription panel). It is possible to import content from different files in the same transcription this way.
|
|
|
By defaults the segmentation for the selected images, both regions and lines, will be deleted. You can disable this behavior by unchecking 'Override existing segmentation.', in which case the system will try to match the lines and regions by their `ID` attribute. The old content for matching lines is then stored in its history and new lines/regions are created when no matching existing element are found.
|
|
|
|
|
|
### Alto
|
|
|
Upload a valid [ALTO 4 XML](https://en.wikipedia.org/wiki/ALTO_(XML)) file for segmentation and transcriptions.
|
|
|
|
|
|
A basic example:
|
|
|
```xml
|
... | ... | @@ -40,30 +39,25 @@ A basic example: |
|
|
</Page>
|
|
|
</Layout>
|
|
|
</alto>`
|
|
|
```
|
|
|
```
|
|
|
|
|
|
## Page XML
|
|
|
Page XML developed and maintained by the PRImA Research Lab at the University of Salford, UK. Unlike Alto, PageXML file can describe more than one page. We validate PageXML files against two versions of schema. [2018-07-15](https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2018-07-15/pagecontent.xsd) and [2013-07-15](https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd), we added some validations to handle files generated by Transkribus.
|
|
|
if you faced an error message like this
|
|
|
> {"upload_file": ["Couldn't parse the given file or its validation failed: Document didn't validate. Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2016-07-15}PcGts': No matching global declaration available for the validation root., line 2"], "__all__": ["Choose one type of import."]}.
|
|
|
### PageXML
|
|
|
Page XML developed and maintained by the PRImA Research Lab at the University of Salford, UK. Unlike Alto, PageXML file can describe more than one page. Supported schemas range from 2013 to 2019.
|
|
|
|
|
|
Update the attributes xmlns and schemaLocation of `<PcGts>` to supported version as descirbed above.
|
|
|
By defaults the segmentation for the selected images, both regions and lines, will be overrided, `Uncheck Override existing segmentation to Undo`.
|
|
|
You can disable this behavior by unchecking 'Override existing segmentation.', in which case the system will try to match the lines and regions by their `ID` attribute. The old content for matching lines is then stored in its history and new lines/regions are created when no matching existing element are found.
|
|
|
TextRegion tag have a list of coordinates declared as `x1,y1 x2,y2...xn,yn`.
|
|
|
`Baseline` are optional in PageXml.
|
|
|
for the page content, it can be stored directly in `Textline` for example
|
|
|
`Baseline` is optional in PageXml.
|
|
|
for the page content, it can be stored directly in `Textline`, for example:
|
|
|
|
|
|
```xml
|
|
|
<TextLine id="r2l1" custom="readingOrder {index:0;}">
|
|
|
<Coords points="150,64 346,60 425,81 "/>
|
|
|
<TextLine id="r2l1">
|
|
|
<Coords points="150,64 346,60 425,81"/>
|
|
|
<Baseline points="155,55 180,55 206,55 231,55 257,55 283,55 308,55"/>
|
|
|
<TextEquiv>
|
|
|
<Unicode>ܡ ܗܘܡ ܐܘ ܥܒ</Unicode>
|
|
|
</TextEquiv>
|
|
|
</TextLine>
|
|
|
```
|
|
|
or each `Word` is separated with its `Coords` and content like the example below
|
|
|
or each `Word` can be separated with its `Coords` and content:
|
|
|
|
|
|
```xml
|
|
|
<TextLine id="l1">
|
... | ... | @@ -77,7 +71,7 @@ or each `Word` is separated with its `Coords` and content like the example below |
|
|
</Word>
|
|
|
<Word id="w45" language="Hebrew" primaryScript="Hebr - Hebrew"
|
|
|
readingDirection="right-to-left">
|
|
|
<Coords points="531,464 687,464 "/>
|
|
|
<Coords points="531,464 687,464"/>
|
|
|
<TextEquiv>
|
|
|
<Unicode>הוט</Unicode>
|
|
|
</TextEquiv>
|
... | ... | @@ -85,7 +79,7 @@ or each `Word` is separated with its `Coords` and content like the example below |
|
|
<TextLine>
|
|
|
```
|
|
|
|
|
|
example of full PageXML file :
|
|
|
example of full PageXML file:
|
|
|
```xml
|
|
|
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
|
|
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
|
... | ... | @@ -94,8 +88,6 @@ example of full PageXML file : |
|
|
</Creator>
|
|
|
<Created>2019-09-27T23:23:02.181+02:00</Created>
|
|
|
<LastChange>2019-09-30T19:47:30.802+02:00</LastChange>
|
|
|
<TranskribusMetadata docId="218635" pageId="9377652" pageNr="1" tsid="16181013"
|
|
|
status="IN_PROGRESS" userId="198" imgUrl="https://files.transkribus.eu/Get?id=IGCWFROGSHEQTNTPACSBRMTW" xmlUrl="https://files.transkribus.eu/Get?id=ZDOBVJAWMVGPUHOZVTSSWCSG" imageId="6101684"/>
|
|
|
</Metadata>
|
|
|
<Page imageFilename="test3.png" imageWidth="738" imageHeight="1139">
|
|
|
<ReadingOrder>
|
... | ... | @@ -104,23 +96,23 @@ example of full PageXML file : |
|
|
<RegionRefIndexed index="1" regionRef="r2"/>
|
|
|
</OrderedGroup>
|
|
|
</ReadingOrder>
|
|
|
<TextRegion orientation="0.0" id="r2" custom="readingOrder {index:1;}">
|
|
|
<TextRegion orientation="0.0" id="r2">
|
|
|
<Coords points="113,29 113,1021 697,1021 697,29"/>
|
|
|
<TextLine id="r2l1" custom="readingOrder {index:0;}">
|
|
|
<TextLine id="r2l1">
|
|
|
<Coords points="150,64 346,60 425,81 460,60 616,64 621,5 396,2 328,3 304,21 271,4 232,23 150,17"/>
|
|
|
<Baseline points="155,55 180,55 206,55 231,55 257,55 283,55 308,55 334,56 359,56 385,56 411,56 436,56 462,56 487,56 513,55 539,55 564,55 590,54 616,53"/>
|
|
|
<TextEquiv>
|
|
|
<Unicode>ܡ ܗܘܡ ܐܘ ܥܒ</Unicode>
|
|
|
</TextEquiv>
|
|
|
</TextLine>
|
|
|
<TextLine id="r2l7" custom="readingOrder {index:6;}">
|
|
|
<TextLine id="r2l7">
|
|
|
<Coords points="125,382 441,391 538,375 571,397 676,391 677,344 655,339 572,359 538,341 514,360 445,336 336,353 237,348 222,332 193,354 152,332 126,348"/>
|
|
|
<Baseline points="130,372 157,373 184,374 211,375 238,376 265,377 292,377 319,378 346,378 373,378 400,379 427,379 454,379 481,379 508,379 535,380 562,380 589,380 616,381 643,382 671,382"/>
|
|
|
<TextEquiv>
|
|
|
<Unicode>ܕܐܣܝܪ. ܥܠ ܗ̇ܘ ܕܠܐ ܐܢܫ</Unicode>
|
|
|
</TextEquiv>
|
|
|
</TextLine>
|
|
|
<TextLine id="r2l20" custom="readingOrder {index:19;}">
|
|
|
<TextLine id="r2l20">
|
|
|
<Coords points="129,1006 181,997 236,1017 261,998 338,997 499,1025 567,1010 659,1026 702,1010 703,985 576,971 531,984 500,959 469,980 399,958 324,987 297,964 246,972 211,954 193,968 175,952 130,975"/>
|
|
|
<Baseline points="134,992 162,992 190,993 218,994 246,995 274,996 302,997 331,998 359,999 387,1000 415,1001 443,1002 471,1003 499,1003 528,1004 556,1005 584,1006 612,1007 640,1008 668,1008 697,1009"/>
|
|
|
<TextEquiv>
|
... | ... | @@ -140,6 +132,6 @@ example of full PageXML file : |
|
|
|
|
|
```
|
|
|
|
|
|
## Zip
|
|
|
If you need to import more than one file you can do it by compressing them in a zip file, it needs to be flat (all the xml files at the root of the zip).
|
|
|
### Zip
|
|
|
If you need to import more than one file you can do it by compressing them in a zip file, it needs to be flat (all the xml files at the root of the zip). Images will be also extracted but since the files are extracted as they come, they need to be present in the zip **before** their respective xml in order for the transcription to be bound to it.
|
|
|
|