Page MenuHomeSoftware Heritage

Add missing fields to the Content object
ClosedPublic

Authored by jayeshv on Aug 12 2022, 12:24 PM.

Details

Summary

Get a content object either by SWHID or by hashes
Return file data as new node, now supports raw data as a binary text
Add dummy fields for file type, language and license

Related to T4310

Diff Detail

Repository
rDGQL GraphQL API
Branch
content-fixes
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 30827
Build 48200: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 48199: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8239 (id=29718)

Rebasing onto 3a1be9a212...

First, rewinding head to replay your work on top of it...
Applying: Add missing fields to the Content object
Changes applied before test
commit b03263b1ff5bd855cef1dd6d8353de86d729d241
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Return file data as new node, now supports raw data as a binary text
    Add fields for file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/94/ for more details.

vlorentz retitled this revision from WIP Add missing fields to the Content object to [WIP] Add missing fields to the Content object.Aug 12 2022, 12:50 PM

Add language and license fields

Build is green

Patch application report for D8239 (id=29720)

Rebasing onto 3a1be9a212...

First, rewinding head to replay your work on top of it...
Applying: Add missing fields to the Content object
Changes applied before test
commit 3032b139f20a438465f100c566f4392c657bd000
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Return file data as new node, now supports raw data as a binary text
    Add fields for file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/95/ for more details.

Get content by hash_type:hash_value string

Build is green

Patch application report for D8239 (id=29723)

Rebasing onto 3a1be9a212...

First, rewinding head to replay your work on top of it...
Applying: Add missing fields to the Content object
Changes applied before test
commit cbf2016a065b3b1a069adf1cf36862b6427dceb9
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Return file data as new node, now supports raw data as a binary text
    Add fields for file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/96/ for more details.

Build has FAILED

Patch application report for D8239 (id=29769)

Rebasing onto 84484e27c1...

Current branch diff-target is up to date.
Changes applied before test
commit 6622176076f7efd1961056f725cce01170a23e19
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now supports raw data as a binary text
    Add dummy fields for content file type, language and license
    
    Related to T4310

Link to build: https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/100/
See console output for more information: https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/100/console

Build is green

Patch application report for D8239 (id=29770)

Rebasing onto 84484e27c1...

Current branch diff-target is up to date.
Changes applied before test
commit 1c9d80be14bdabb2e1567632d9f7ec65a3bee922
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now supports raw data as a binary text
    Add dummy fields for content file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/101/ for more details.

Build is green

Patch application report for D8239 (id=29775)

Rebasing onto 84484e27c1...

Current branch diff-target is up to date.
Changes applied before test
commit bd93ef7c83e001d1c2d600d064cb80d49a8ea6f6
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now supports raw data as a binary text
    Add dummy fields for content file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/102/ for more details.

jayeshv retitled this revision from [WIP] Add missing fields to the Content object to Add missing fields to the Content object.Aug 17 2022, 11:21 AM
jayeshv edited the summary of this revision. (Show Details)

Add missing property decorator

Build is green

Patch application report for D8239 (id=29776)

Rebasing onto 84484e27c1...

Current branch diff-target is up to date.
Changes applied before test
commit 2dfc5430f99273232c70b8e6a627c3cdeb43fc16
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now supports raw data as a binary text
    Add dummy fields for content file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/103/ for more details.

Build is green

Patch application report for D8239 (id=29783)

Rebasing onto a9334a0e57...

Current branch diff-target is up to date.
Changes applied before test
commit f1a110703955a7624f588d4030b27f2493b52b3c
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now supports raw data as a binary text
    Add dummy fields for content file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/104/ for more details.

anlambert added inline comments.
swh/graphql/resolvers/content.py
34–39

I think it will be safer to reference content data by an URL here instead of raw bytes, you could use this Web API endpoint for instance.

Fetching content data could put some pressure on our object storage depending on the input GraphQL query and responses could also be very large depending on content sizes,
so better offering links here.

42

s/Conetent/Content/

swh/graphql/resolvers/content.py
34–39

I agree.
Do you think adding a link to https://archive.softwareheritage.org/api/1/content/<swhid>/raw/ will do?
This will require me to add a config entry for the archive URL.

Or is there a way to have a static URL instead, like an S3 one?

swh/graphql/resolvers/content.py
34–39

Do you think adding a link to https://archive.softwareheritage.org/api/1/content/<swhid>/raw/ will do?

I think so, this endpoint does not use SWHID by the way as they were not yet in place when it was implemented.
One advantage of using it is that there is rate limit in place to avoid anonymous users to have unlimited access
to raw archive contents.

This will require me to add a config entry for the archive URL.

For the record, you can define default config values using Python code (see swh-web config.py).
It enables to avoid re-declaring redundant config values in every config files.

Or is there a way to have a static URL instead, like an S3 one?

Not that I know.

swh/graphql/resolvers/content.py
34–39

Do you think adding a link to https://archive.softwareheritage.org/api/1/content/<swhid>/raw/ will do?

I think so, this endpoint does not use SWHID by the way as they were not yet in place when it was implemented.
One advantage of using it is that there is rate limit in place to avoid anonymous users to have unlimited access
to raw archive contents.

Ok. Here also data is served by an application server (Django in this case). So, too many requests could bring that too down.

About rate limiting, a basic rate limiter is in place for GraphQL and I'm planning to improve that soon.
Once that is fully implemented, it will be possible to have usage credits at the field level. (say accessing data field could cost a user 10 credits, where as accessing a string field could cost 1).

Add download URL for a content

Build is green

Patch application report for D8239 (id=29794)

Rebasing onto a9334a0e57...

Current branch diff-target is up to date.
Changes applied before test
commit f0e40e3d5b56ebe84b29237c225d6a65dfc8f51e
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now supports raw data as a binary text
    Add dummy fields for content file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/105/ for more details.

swh/graphql/resolvers/content.py
34–39

@anlambert For the time being I have added both the raw binary data and a URL to get file data.
It will cost a lot of credits to request 'raw' field and the per query max limit will restrict users from retrieving too much data in one request.
User level rate limiting across sessions will be in place after we make this system stateful.

I have added the API URL in the code instead of a config now. I will move that to a config in another diff that improves the config.

Build is green

Patch application report for D8239 (id=29797)

Rebasing onto a9334a0e57...

Current branch diff-target is up to date.
Changes applied before test
commit 4d824dc3636307a6bf1bd01aa49397ef914e4890
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now supports raw data as a binary text
    Add dummy fields for content file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/106/ for more details.

I could accept the diff but I think another team member should also review it and gives his opinion about the raw content data retrieval and integration in GraphQL responses.

swh/graphql/resolvers/content.py
34–39

I am still thinking that raw content data should not be included in GraphQL responses as it remains a flaw to easily DDoS us with a single request
(if a user requests a lot of huge contents for instance).

In Web API responses (being REST or GraphQL) we should only return metadata about the archived objects, getting raw content data should be done
separately with a dedicated HTTP request.

Anyway, do as you want but my guess is that we will not keep that feature in the future.

It will cost a lot of credits to request 'raw' field and the per query max limit will restrict users from retrieving too much data in one request.

Ideally the credits for getting a raw content should be proportional to its size in bytes.

51

s/Conetent/Content/

56

s/Conetent/Content/

I could accept the diff but I think another team member should also review it and gives his opinion about the raw content data retrieval and integration in GraphQL responses.

Ok. I think this needs further discussion.
Github let clients get file content as strings. eg query (https://forge.softwareheritage.org/P1430), but this is behind user authentication.

One possible solution is to make a raw content cost 1/2 of the possible max query cost. This will limit the client access to at most one raw string per request.
Calculating query cost dynamically could be a bit hard to do, but it is possible to return 'null' (along with an exception) if the requested content is too big.

What if we simply don't include this field for now, and add it later if needed?

This was we can take time to design it later when we have a use case; instead of designing it now in a way that cannot be changed without breaking backward compatibility

What if we simply don't include this field for now, and add it later if needed?

This was we can take time to design it later when we have a use case; instead of designing it now in a way that cannot be changed without breaking backward compatibility

Sounds like a plan, I will remove the field for now. Will add that after properly implementing the query cost calculator.

Remove raw field from content

Build is green

Patch application report for D8239 (id=29805)

Rebasing onto a9334a0e57...

Current branch diff-target is up to date.
Changes applied before test
commit 7d2c51b04d35388679e2eae49454b5ca76a17ada
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now only returns a URL to download data
    Add dummy fields for content file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/107/ for more details.

Looks good to me.

swh/graphql/tests/functional/test_content.py
13

s/contnet/content/

This revision is now accepted and ready to land.Aug 18 2022, 2:10 PM

Build is green

Patch application report for D8239 (id=29806)

Rebasing onto a9334a0e57...

Current branch diff-target is up to date.
Changes applied before test
commit e68700535510c899d0012f623d3d2af657dcb686
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Wed Aug 3 10:35:28 2022 +0200

    Add missing fields to the Content object
    
    Content can be requested either with a SWHID or with a list of checksums.
    Return file data as new node, now only returns a URL to download data
    Add dummy fields for content file type, language and license
    
    Related to T4310

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/108/ for more details.

This revision was automatically updated to reflect the committed changes.