Page MenuHomeSoftware Heritage
Paste P154

example: filtering origin on latest top file name matching pom.xml
ActivePublic

Authored by ardumont on Apr 18 2017, 3:39 PM.
select o.url, convert_from(def.name, 'utf-8')
from origin o
inner join occurrence occ on (occ.origin=o.id and occ.branch='refs/heads/master')
inner join revision rev on (occ.target_type='revision' and occ.target=rev.id)
inner join directory dir on rev.directory=dir.id
inner join directory_entry_file def on def.id = any(dir.file_entries)
where o.type='git' and def.name='pom.xml';

Event Timeline

Using ardumont's query with:

select o.url, convert_from(def.name, 'utf-8') , def.sha1

I want to link an origin to the file containing metadata,
should I use

def.data_url or def.sha1?

I want to link an origin to the file containing metadata, should I use def.data_url or def.sha1?

directory_entry_file only presents the sha1_git.
You want to use the sha1 to retrieve the raw content from the objstorage, so you'd need another join on content on the sha1_git column.

Adapting the initial query as such:

select o.url, convert_from(def.name, 'utf-8') , c.sha1
  from origin o 
    inner join occurrence occ on (occ.origin=o.id and occ.branch='refs/heads/master')
    inner join revision rev on (occ.target_type='revision' and occ.target=rev.id)
    inner join directory dir on rev.directory=dir.id
    inner join directory_entry_file def on def.id = any(dir.file_entries) 
    inner join content c on c.sha1_git=def.target
  where o.type='git' and def.name='pom.xml' 
  limit 10;

(limit is just for testing purpose)

Adapting to add the filename parameter to a function, that would look like:

create type origin_content_signature as (
    url       text,
    name   text,
    sha1    sha1
);

create or replace function swh_filter_content_pattern(filename bytea)
    returns setof origin_content_signature
    language sql
as $$
  select o.url, convert_from(def.name, 'utf-8') , c.sha1
  from origin o 
    inner join occurrence occ on (occ.origin=o.id and occ.branch='refs/heads/master')
    inner join revision rev on (occ.target_type='revision' and occ.target=rev.id)
    inner join directory dir on rev.directory=dir.id
    inner join directory_entry_file def on def.id = any(dir.file_entries) 
    inner join content c on c.sha1_git=def.target
  where o.type='git' and def.name=filename;
$$;
WHERE def.name IN (list of filenames matching metadata pattern)

List of filenames:

contextfilename
java- Mavenpom.xml
OctaveDESCRIPTION
ruby gems.gemspec or Rakefile
Javascript npmpackage.json, [ AUTHORS, README, CHANGES, LICENSE & NOTICE] files
Perl CPAN::METAMETA.json, META.yml, .spec
Dartpubspec.yaml
Debian packagedebian/upstream/metadata
puppetmetadata.json
PyPIsetup.py
Scientific softwareCITATION
CodeMetaCODE, code.json

other files to add to list

Java gradle'gradle.properties'
Jekyll'_config.yml'