Page MenuHomeSoftware Heritage

Fix revision metadata redundancy in deposit metadata
Closed, ResolvedPublic

Description

In the IPOL deposit and all deposits since the 10th of December 2019 we have a property called extrinsic containing the copy of metadata that should be sent to origin_metadata table.

  1. delete this property in the revision metadata
  2. check that the metadata is still sent to the origin_metadata table

Here is the example:

}
    "@xmlns": "http://www.w3.org/2005/Atom",
    "@xmlns:codemeta": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0",
    "author": "Jose-Luis Lisani",
    "codemeta:applicationCategory": "Image Processing On Line (IPOL)",
    "codemeta:dateCreated": "2018-11-14",
    "codemeta:datePublished": "2018-12-07",
    "codemeta:description": "Implementation of Shape Preserving Local Histogram Modification Algorithm",
    "codemeta:identifier": "https://doi.org/10.5201/ipol.2018.236",
    "codemeta:keywords": [
        "contrast enhancement",
        "histogram modification",
        "local processing"
    ],
    "codemeta:license": {
        "codemeta:name": "LGPL-3.0-or-later",
        "codemeta:url": "https://spdx.org/licenses/LGPL-3.0-or-later.html"
    },
    "codemeta:operatingSystem": "Linux",
    "codemeta:programmingLanguage": "C++",
    "codemeta:referencePublication": {
        "codemeta:identifier": "236",
        "codemeta:name": "An Analysis and Implementation of the Shape Preserving\nLocal Histogram Modification Algorithm"
    },
    "codemeta:releaseNotes": "On-line demo available at IPOL webpage",
    "codemeta:url": "https://www.ipol.im/pub/art/2018/236/",
    "external_identifier": "236",
    "extrinsic": {
        "provider": "https://deposit.softwareheritage.org/1/private/551/meta/",
        "raw": {
            "branch_name": "master",
            "origin": {
                "type": "deposit",
                "url": "https://www.ipol.im/87afca27-ebbb-408a-845d-5277b6a92f49"
            },
            "origin_metadata": {
                "metadata": {
                    "@xmlns": "http://www.w3.org/2005/Atom",
                    "@xmlns:codemeta": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0",
                    "author": "Jose-Luis Lisani",
                    "codemeta:applicationCategory": "Image Processing On Line (IPOL)",
                    "codemeta:dateCreated": "2018-11-14",
                    "codemeta:datePublished": "2018-12-07",
                    "codemeta:description": "Implementation of Shape Preserving Local Histogram Modification Algorithm",
                    "codemeta:identifier": "https://doi.org/10.5201/ipol.2018.236",
                    "codemeta:keywords": [
                        "contrast enhancement",
                        "histogram modification",
                        "local processing"
                    ],
                    "codemeta:license": {
                        "codemeta:name": "LGPL-3.0-or-later",
                        "codemeta:url": "https://spdx.org/licenses/LGPL-3.0-or-later.html"
                    },
                    "codemeta:operatingSystem": "Linux",
                    "codemeta:programmingLanguage": "C++",
                    "codemeta:referencePublication": {
                        "codemeta:identifier": "236",
                        "codemeta:name": "An Analysis and Implementation of the Shape Preserving\nLocal Histogram Modification Algorithm"
                    },
                    "codemeta:releaseNotes": "On-line demo available at IPOL webpage",
                    "codemeta:url": "https://www.ipol.im/pub/art/2018/236/",
                    "external_identifier": "236",
                    "title": "mlheIPOL"
                },
                "provider": {
                    "metadata": {},
                    "provider_name": "",
                    "provider_type": "deposit_client",
                    "provider_url": "https://www.ipol.im/"
                },
                "tool": {
                    "configuration": {
                        "sword_version": 2
                    },
                    "name": "swh-deposit",
                    "version": "0.0.1"
                }
            }
        },
        "when": "2020-04-20T23:16:11.436881+00:00"
    },
    "original_artifact": [
        {
            "checksums": {
                "sha1": "cdce42697925513b15824f69bd461ec6f3bc9571",
                "sha256": "2a261e6b673f09f60185f6fbf874a828d3b5aeee946d38dd2f2bc4d899db0b1a"
            },
            "filename": "archive.zip",
            "length": 661595
        }
    ],
    "title": "mlheIPOL"
}

Event Timeline

moranegg triaged this task as High priority.Apr 22 2020, 3:36 PM
moranegg created this task.
ardumont added a comment.EditedApr 22 2020, 4:07 PM

I would not say bug, i would say change.

This was most probably introduced during the package loader refactoring [1]
(completion dates roughly match, landing date at around 29th of november 2019).

We structurally unified the revision metadata for the package loaders to follow
a common pattern. The discussion took place prior to the effective change [2].

[1] T2024

[2] T2017

moranegg added a comment.EditedApr 22 2020, 4:09 PM

I don't mind calling it change :-)

It looks later then October..
Here is the last correct revision: https://archive.softwareheritage.org/browse/revision/86d194239d23027db374e0f8b977289a7e13f9ca/

moranegg renamed this task from Fix revision metadata bug in deposit metadata to Fix revision metadata redundancy in deposit metadata.Apr 22 2020, 4:14 PM

Yes, seen thanks for the links.

Please, check my initial comment (refresh the page), i adapted it prior to you
replying (sorry, it missed details).

Can you please update the task description to also display what you actually
expect as revision metadata field?

You can let the actual revision which is the current one with redundancy, and
add a new entry with what you actually expect.

Thanks

ardumont added a comment.EditedApr 22 2020, 4:25 PM

For your description's bullet point 2., the code that does the metadata manipulation is within the following code [2]

[2] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/deposit/loader.py$123-143

The task description says that the property extrinsic should be deleted in the revision metadata.

The information under the extrinsic property were to be sent to the origin_metadata table and ended up in the revision.

So when deleting extrinsic, we need to make sure we don't delete the metadata sent to origin_metadata.

I see with your [2] it should still be in place the origin_metadata functionality.

Did you find the code that prepares the metadata? (I found the revision[3], but I can't find extrinsic..)

[3] https://forge.softwareheritage.org/source/swh-deposit/browse/master/swh/deposit/api/private/deposit_read.py

Did you find the code that prepares the metadata? (I found the revision[3], but I can't find extrinsic..)

That part is done loader side [1]

As a side note, that separation of code between the deposit_read api and the
deposit loader code is becoming quite annoying. It will eventually go away.
That is only building the revision loader side. While the deposit_read api
continues to send the necessary data for the loader to build the revision.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/deposit/loader.py$83-96

You might want to use [1] [2] links instead of the browse url to make explicit
the fields you want to show.

That avoids to click around to find the metadata fields ;)

[1] https://archive.softwareheritage.org/api/1/revision/c33910a29d53f4e137c225b21a8d59e43327cbf9/?fields=metadata

[2] https://archive.softwareheritage.org/api/1/revision/86d194239d23027db374e0f8b977289a7e13f9ca/?fields=metadata

If I understand correctly this is the concatenation of two different models:

  • one created for the deposit
  • one created for PyPi loader

I'm trying to figure out if it's logical to just delete extrinsic with the new model, or delete the rest :-)

ardumont added a comment.EditedApr 22 2020, 4:47 PM

Back to the subject at end, if we do as you propose, we are diverging from the other package loaders (T2017).

I might be more inclined to keep the extrinsic field and remove the other duplicated fields at the top of the revision.

That is the revision metadata field becomes:

"extrinsic": {
    "provider": "https://deposit.softwareheritage.org/1/private/551/meta/",
    "raw": {
        "branch_name": "master",
        "origin": {
            "type": "deposit",
            "url": "https://www.ipol.im/87afca27-ebbb-408a-845d-5277b6a92f49"
        },
        "origin_metadata": {
            "metadata": {
                "@xmlns": "http://www.w3.org/2005/Atom",
                "@xmlns:codemeta": "https://doi.org/10.5063/SCHEMA/CODEMETA-2.0",
                "author": "Jose-Luis Lisani",
                "codemeta:applicationCategory": "Image Processing On Line (IPOL)",
                "codemeta:dateCreated": "2018-11-14",
                "codemeta:datePublished": "2018-12-07",
                "codemeta:description": "Implementation of Shape Preserving Local Histogram Modification Algorithm",
                "codemeta:identifier": "https://doi.org/10.5201/ipol.2018.236",
                "codemeta:keywords": [
                    "contrast enhancement",
                    "histogram modification",
                    "local processing"
                ],
                "codemeta:license": {
                    "codemeta:name": "LGPL-3.0-or-later",
                    "codemeta:url": "https://spdx.org/licenses/LGPL-3.0-or-later.html"
                },
                "codemeta:operatingSystem": "Linux",
                "codemeta:programmingLanguage": "C++",
                "codemeta:referencePublication": {
                    "codemeta:identifier": "236",
                    "codemeta:name": "An Analysis and Implementation of the Shape Preserving\nLocal Histogram Modification Algorithm"
                },
                "codemeta:releaseNotes": "On-line demo available at IPOL webpage",
                "codemeta:url": "https://www.ipol.im/pub/art/2018/236/",
                "external_identifier": "236",
                "title": "mlheIPOL"
            },
            "provider": {
                "metadata": {},
                "provider_name": "",
                "provider_type": "deposit_client",
                "provider_url": "https://www.ipol.im/"
            },
            "tool": {
                "configuration": {
                    "sword_version": 2
                },
                "name": "swh-deposit",
                "version": "0.0.1"
            }
        }
    },
    "when": "2020-04-20T23:16:11.436881+00:00"
},
"original_artifact": [
    {
        "checksums": {
            "sha1": "cdce42697925513b15824f69bd461ec6f3bc9571",
            "sha256": "2a261e6b673f09f60185f6fbf874a828d3b5aeee946d38dd2f2bc4d899db0b1a"
        },
        "filename": "archive.zip",
        "length": 661595
    }
],

Good !
seems we are converging on this.

I'm trying to figure out if it's logical to just delete extrinsic with the new model, or delete the rest :-)

I might be more inclined to keep the extrinsic field and remove the other duplicated fields at the top of the revision.

Let's follow your proposition, if this doesn't causes other problems down the line with dates and other metadata usage.

@moranegg D3045 for your reviewing pleasure ;D

:-D
I'm on it..

As a side note, that separation of code between the deposit_read api and the
deposit loader code is becoming quite annoying. It will eventually go away.
That is only building the revision loader side. While the deposit_read api
continues to send the necessary data for the loader to build the revision.

For this, to remove some more redundancy and unnecessary indirections, see [1] [2]

[1] D3047 (deposit)

[2] D3048 (loader-deposit)

loader-core v0.0.92 tag with the landed change.

ardumont closed this task as Resolved.Apr 23 2020, 4:31 PM
ardumont claimed this task.

Deployed.