diff --git a/.gitignore b/.gitignore deleted file mode 100644 index 33f4c0b..0000000 --- a/.gitignore +++ /dev/null @@ -1,5 +0,0 @@ -__pycache__ -/.coverage -*egg-info/ -version.txt -analysis/ diff --git a/AUTHORS b/AUTHORS deleted file mode 100644 index 2d0a34a..0000000 --- a/AUTHORS +++ /dev/null @@ -1,3 +0,0 @@ -Copyright (C) 2015 The Software Heritage developers - -See http://www.softwareheritage.org/ for more information. diff --git a/LICENSE b/LICENSE deleted file mode 100644 index 94a9ed0..0000000 --- a/LICENSE +++ /dev/null @@ -1,674 +0,0 @@ - GNU GENERAL PUBLIC LICENSE - Version 3, 29 June 2007 - - Copyright (C) 2007 Free Software Foundation, Inc. - Everyone is permitted to copy and distribute verbatim copies - of this license document, but changing it is not allowed. - - Preamble - - The GNU General Public License is a free, copyleft license for -software and other kinds of works. - - The licenses for most software and other practical works are designed -to take away your freedom to share and change the works. By contrast, -the GNU General Public License is intended to guarantee your freedom to -share and change all versions of a program--to make sure it remains free -software for all its users. We, the Free Software Foundation, use the -GNU General Public License for most of our software; it applies also to -any other work released this way by its authors. You can apply it to -your programs, too. - - When we speak of free software, we are referring to freedom, not -price. Our General Public Licenses are designed to make sure that you -have the freedom to distribute copies of free software (and charge for -them if you wish), that you receive source code or can get it if you -want it, that you can change the software or use pieces of it in new -free programs, and that you know you can do these things. - - To protect your rights, we need to prevent others from denying you -these rights or asking you to surrender the rights. Therefore, you have -certain responsibilities if you distribute copies of the software, or if -you modify it: responsibilities to respect the freedom of others. - - For example, if you distribute copies of such a program, whether -gratis or for a fee, you must pass on to the recipients the same -freedoms that you received. You must make sure that they, too, receive -or can get the source code. And you must show them these terms so they -know their rights. - - Developers that use the GNU GPL protect your rights with two steps: -(1) assert copyright on the software, and (2) offer you this License -giving you legal permission to copy, distribute and/or modify it. - - For the developers' and authors' protection, the GPL clearly explains -that there is no warranty for this free software. For both users' and -authors' sake, the GPL requires that modified versions be marked as -changed, so that their problems will not be attributed erroneously to -authors of previous versions. - - Some devices are designed to deny users access to install or run -modified versions of the software inside them, although the manufacturer -can do so. This is fundamentally incompatible with the aim of -protecting users' freedom to change the software. The systematic -pattern of such abuse occurs in the area of products for individuals to -use, which is precisely where it is most unacceptable. Therefore, we -have designed this version of the GPL to prohibit the practice for those -products. If such problems arise substantially in other domains, we -stand ready to extend this provision to those domains in future versions -of the GPL, as needed to protect the freedom of users. - - Finally, every program is threatened constantly by software patents. -States should not allow patents to restrict development and use of -software on general-purpose computers, but in those that do, we wish to -avoid the special danger that patents applied to a free program could -make it effectively proprietary. To prevent this, the GPL assures that -patents cannot be used to render the program non-free. - - The precise terms and conditions for copying, distribution and -modification follow. - - TERMS AND CONDITIONS - - 0. Definitions. - - "This License" refers to version 3 of the GNU General Public License. - - "Copyright" also means copyright-like laws that apply to other kinds of -works, such as semiconductor masks. - - "The Program" refers to any copyrightable work licensed under this -License. Each licensee is addressed as "you". "Licensees" and -"recipients" may be individuals or organizations. - - To "modify" a work means to copy from or adapt all or part of the work -in a fashion requiring copyright permission, other than the making of an -exact copy. The resulting work is called a "modified version" of the -earlier work or a work "based on" the earlier work. - - A "covered work" means either the unmodified Program or a work based -on the Program. - - To "propagate" a work means to do anything with it that, without -permission, would make you directly or secondarily liable for -infringement under applicable copyright law, except executing it on a -computer or modifying a private copy. Propagation includes copying, -distribution (with or without modification), making available to the -public, and in some countries other activities as well. - - To "convey" a work means any kind of propagation that enables other -parties to make or receive copies. Mere interaction with a user through -a computer network, with no transfer of a copy, is not conveying. - - An interactive user interface displays "Appropriate Legal Notices" -to the extent that it includes a convenient and prominently visible -feature that (1) displays an appropriate copyright notice, and (2) -tells the user that there is no warranty for the work (except to the -extent that warranties are provided), that licensees may convey the -work under this License, and how to view a copy of this License. If -the interface presents a list of user commands or options, such as a -menu, a prominent item in the list meets this criterion. - - 1. Source Code. - - The "source code" for a work means the preferred form of the work -for making modifications to it. "Object code" means any non-source -form of a work. - - A "Standard Interface" means an interface that either is an official -standard defined by a recognized standards body, or, in the case of -interfaces specified for a particular programming language, one that -is widely used among developers working in that language. - - The "System Libraries" of an executable work include anything, other -than the work as a whole, that (a) is included in the normal form of -packaging a Major Component, but which is not part of that Major -Component, and (b) serves only to enable use of the work with that -Major Component, or to implement a Standard Interface for which an -implementation is available to the public in source code form. A -"Major Component", in this context, means a major essential component -(kernel, window system, and so on) of the specific operating system -(if any) on which the executable work runs, or a compiler used to -produce the work, or an object code interpreter used to run it. - - The "Corresponding Source" for a work in object code form means all -the source code needed to generate, install, and (for an executable -work) run the object code and to modify the work, including scripts to -control those activities. However, it does not include the work's -System Libraries, or general-purpose tools or generally available free -programs which are used unmodified in performing those activities but -which are not part of the work. For example, Corresponding Source -includes interface definition files associated with source files for -the work, and the source code for shared libraries and dynamically -linked subprograms that the work is specifically designed to require, -such as by intimate data communication or control flow between those -subprograms and other parts of the work. - - The Corresponding Source need not include anything that users -can regenerate automatically from other parts of the Corresponding -Source. - - The Corresponding Source for a work in source code form is that -same work. - - 2. Basic Permissions. - - All rights granted under this License are granted for the term of -copyright on the Program, and are irrevocable provided the stated -conditions are met. This License explicitly affirms your unlimited -permission to run the unmodified Program. The output from running a -covered work is covered by this License only if the output, given its -content, constitutes a covered work. This License acknowledges your -rights of fair use or other equivalent, as provided by copyright law. - - You may make, run and propagate covered works that you do not -convey, without conditions so long as your license otherwise remains -in force. You may convey covered works to others for the sole purpose -of having them make modifications exclusively for you, or provide you -with facilities for running those works, provided that you comply with -the terms of this License in conveying all material for which you do -not control copyright. Those thus making or running the covered works -for you must do so exclusively on your behalf, under your direction -and control, on terms that prohibit them from making any copies of -your copyrighted material outside their relationship with you. - - Conveying under any other circumstances is permitted solely under -the conditions stated below. Sublicensing is not allowed; section 10 -makes it unnecessary. - - 3. Protecting Users' Legal Rights From Anti-Circumvention Law. - - No covered work shall be deemed part of an effective technological -measure under any applicable law fulfilling obligations under article -11 of the WIPO copyright treaty adopted on 20 December 1996, or -similar laws prohibiting or restricting circumvention of such -measures. - - When you convey a covered work, you waive any legal power to forbid -circumvention of technological measures to the extent such circumvention -is effected by exercising rights under this License with respect to -the covered work, and you disclaim any intention to limit operation or -modification of the work as a means of enforcing, against the work's -users, your or third parties' legal rights to forbid circumvention of -technological measures. - - 4. Conveying Verbatim Copies. - - You may convey verbatim copies of the Program's source code as you -receive it, in any medium, provided that you conspicuously and -appropriately publish on each copy an appropriate copyright notice; -keep intact all notices stating that this License and any -non-permissive terms added in accord with section 7 apply to the code; -keep intact all notices of the absence of any warranty; and give all -recipients a copy of this License along with the Program. - - You may charge any price or no price for each copy that you convey, -and you may offer support or warranty protection for a fee. - - 5. Conveying Modified Source Versions. - - You may convey a work based on the Program, or the modifications to -produce it from the Program, in the form of source code under the -terms of section 4, provided that you also meet all of these conditions: - - a) The work must carry prominent notices stating that you modified - it, and giving a relevant date. - - b) The work must carry prominent notices stating that it is - released under this License and any conditions added under section - 7. This requirement modifies the requirement in section 4 to - "keep intact all notices". - - c) You must license the entire work, as a whole, under this - License to anyone who comes into possession of a copy. This - License will therefore apply, along with any applicable section 7 - additional terms, to the whole of the work, and all its parts, - regardless of how they are packaged. This License gives no - permission to license the work in any other way, but it does not - invalidate such permission if you have separately received it. - - d) If the work has interactive user interfaces, each must display - Appropriate Legal Notices; however, if the Program has interactive - interfaces that do not display Appropriate Legal Notices, your - work need not make them do so. - - A compilation of a covered work with other separate and independent -works, which are not by their nature extensions of the covered work, -and which are not combined with it such as to form a larger program, -in or on a volume of a storage or distribution medium, is called an -"aggregate" if the compilation and its resulting copyright are not -used to limit the access or legal rights of the compilation's users -beyond what the individual works permit. Inclusion of a covered work -in an aggregate does not cause this License to apply to the other -parts of the aggregate. - - 6. Conveying Non-Source Forms. - - You may convey a covered work in object code form under the terms -of sections 4 and 5, provided that you also convey the -machine-readable Corresponding Source under the terms of this License, -in one of these ways: - - a) Convey the object code in, or embodied in, a physical product - (including a physical distribution medium), accompanied by the - Corresponding Source fixed on a durable physical medium - customarily used for software interchange. - - b) Convey the object code in, or embodied in, a physical product - (including a physical distribution medium), accompanied by a - written offer, valid for at least three years and valid for as - long as you offer spare parts or customer support for that product - model, to give anyone who possesses the object code either (1) a - copy of the Corresponding Source for all the software in the - product that is covered by this License, on a durable physical - medium customarily used for software interchange, for a price no - more than your reasonable cost of physically performing this - conveying of source, or (2) access to copy the - Corresponding Source from a network server at no charge. - - c) Convey individual copies of the object code with a copy of the - written offer to provide the Corresponding Source. This - alternative is allowed only occasionally and noncommercially, and - only if you received the object code with such an offer, in accord - with subsection 6b. - - d) Convey the object code by offering access from a designated - place (gratis or for a charge), and offer equivalent access to the - Corresponding Source in the same way through the same place at no - further charge. You need not require recipients to copy the - Corresponding Source along with the object code. If the place to - copy the object code is a network server, the Corresponding Source - may be on a different server (operated by you or a third party) - that supports equivalent copying facilities, provided you maintain - clear directions next to the object code saying where to find the - Corresponding Source. Regardless of what server hosts the - Corresponding Source, you remain obligated to ensure that it is - available for as long as needed to satisfy these requirements. - - e) Convey the object code using peer-to-peer transmission, provided - you inform other peers where the object code and Corresponding - Source of the work are being offered to the general public at no - charge under subsection 6d. - - A separable portion of the object code, whose source code is excluded -from the Corresponding Source as a System Library, need not be -included in conveying the object code work. - - A "User Product" is either (1) a "consumer product", which means any -tangible personal property which is normally used for personal, family, -or household purposes, or (2) anything designed or sold for incorporation -into a dwelling. In determining whether a product is a consumer product, -doubtful cases shall be resolved in favor of coverage. For a particular -product received by a particular user, "normally used" refers to a -typical or common use of that class of product, regardless of the status -of the particular user or of the way in which the particular user -actually uses, or expects or is expected to use, the product. A product -is a consumer product regardless of whether the product has substantial -commercial, industrial or non-consumer uses, unless such uses represent -the only significant mode of use of the product. - - "Installation Information" for a User Product means any methods, -procedures, authorization keys, or other information required to install -and execute modified versions of a covered work in that User Product from -a modified version of its Corresponding Source. The information must -suffice to ensure that the continued functioning of the modified object -code is in no case prevented or interfered with solely because -modification has been made. - - If you convey an object code work under this section in, or with, or -specifically for use in, a User Product, and the conveying occurs as -part of a transaction in which the right of possession and use of the -User Product is transferred to the recipient in perpetuity or for a -fixed term (regardless of how the transaction is characterized), the -Corresponding Source conveyed under this section must be accompanied -by the Installation Information. But this requirement does not apply -if neither you nor any third party retains the ability to install -modified object code on the User Product (for example, the work has -been installed in ROM). - - The requirement to provide Installation Information does not include a -requirement to continue to provide support service, warranty, or updates -for a work that has been modified or installed by the recipient, or for -the User Product in which it has been modified or installed. Access to a -network may be denied when the modification itself materially and -adversely affects the operation of the network or violates the rules and -protocols for communication across the network. - - Corresponding Source conveyed, and Installation Information provided, -in accord with this section must be in a format that is publicly -documented (and with an implementation available to the public in -source code form), and must require no special password or key for -unpacking, reading or copying. - - 7. Additional Terms. - - "Additional permissions" are terms that supplement the terms of this -License by making exceptions from one or more of its conditions. -Additional permissions that are applicable to the entire Program shall -be treated as though they were included in this License, to the extent -that they are valid under applicable law. If additional permissions -apply only to part of the Program, that part may be used separately -under those permissions, but the entire Program remains governed by -this License without regard to the additional permissions. - - When you convey a copy of a covered work, you may at your option -remove any additional permissions from that copy, or from any part of -it. (Additional permissions may be written to require their own -removal in certain cases when you modify the work.) You may place -additional permissions on material, added by you to a covered work, -for which you have or can give appropriate copyright permission. - - Notwithstanding any other provision of this License, for material you -add to a covered work, you may (if authorized by the copyright holders of -that material) supplement the terms of this License with terms: - - a) Disclaiming warranty or limiting liability differently from the - terms of sections 15 and 16 of this License; or - - b) Requiring preservation of specified reasonable legal notices or - author attributions in that material or in the Appropriate Legal - Notices displayed by works containing it; or - - c) Prohibiting misrepresentation of the origin of that material, or - requiring that modified versions of such material be marked in - reasonable ways as different from the original version; or - - d) Limiting the use for publicity purposes of names of licensors or - authors of the material; or - - e) Declining to grant rights under trademark law for use of some - trade names, trademarks, or service marks; or - - f) Requiring indemnification of licensors and authors of that - material by anyone who conveys the material (or modified versions of - it) with contractual assumptions of liability to the recipient, for - any liability that these contractual assumptions directly impose on - those licensors and authors. - - All other non-permissive additional terms are considered "further -restrictions" within the meaning of section 10. If the Program as you -received it, or any part of it, contains a notice stating that it is -governed by this License along with a term that is a further -restriction, you may remove that term. If a license document contains -a further restriction but permits relicensing or conveying under this -License, you may add to a covered work material governed by the terms -of that license document, provided that the further restriction does -not survive such relicensing or conveying. - - If you add terms to a covered work in accord with this section, you -must place, in the relevant source files, a statement of the -additional terms that apply to those files, or a notice indicating -where to find the applicable terms. - - Additional terms, permissive or non-permissive, may be stated in the -form of a separately written license, or stated as exceptions; -the above requirements apply either way. - - 8. Termination. - - You may not propagate or modify a covered work except as expressly -provided under this License. Any attempt otherwise to propagate or -modify it is void, and will automatically terminate your rights under -this License (including any patent licenses granted under the third -paragraph of section 11). - - However, if you cease all violation of this License, then your -license from a particular copyright holder is reinstated (a) -provisionally, unless and until the copyright holder explicitly and -finally terminates your license, and (b) permanently, if the copyright -holder fails to notify you of the violation by some reasonable means -prior to 60 days after the cessation. - - Moreover, your license from a particular copyright holder is -reinstated permanently if the copyright holder notifies you of the -violation by some reasonable means, this is the first time you have -received notice of violation of this License (for any work) from that -copyright holder, and you cure the violation prior to 30 days after -your receipt of the notice. - - Termination of your rights under this section does not terminate the -licenses of parties who have received copies or rights from you under -this License. If your rights have been terminated and not permanently -reinstated, you do not qualify to receive new licenses for the same -material under section 10. - - 9. Acceptance Not Required for Having Copies. - - You are not required to accept this License in order to receive or -run a copy of the Program. Ancillary propagation of a covered work -occurring solely as a consequence of using peer-to-peer transmission -to receive a copy likewise does not require acceptance. However, -nothing other than this License grants you permission to propagate or -modify any covered work. These actions infringe copyright if you do -not accept this License. Therefore, by modifying or propagating a -covered work, you indicate your acceptance of this License to do so. - - 10. Automatic Licensing of Downstream Recipients. - - Each time you convey a covered work, the recipient automatically -receives a license from the original licensors, to run, modify and -propagate that work, subject to this License. You are not responsible -for enforcing compliance by third parties with this License. - - An "entity transaction" is a transaction transferring control of an -organization, or substantially all assets of one, or subdividing an -organization, or merging organizations. If propagation of a covered -work results from an entity transaction, each party to that -transaction who receives a copy of the work also receives whatever -licenses to the work the party's predecessor in interest had or could -give under the previous paragraph, plus a right to possession of the -Corresponding Source of the work from the predecessor in interest, if -the predecessor has it or can get it with reasonable efforts. - - You may not impose any further restrictions on the exercise of the -rights granted or affirmed under this License. For example, you may -not impose a license fee, royalty, or other charge for exercise of -rights granted under this License, and you may not initiate litigation -(including a cross-claim or counterclaim in a lawsuit) alleging that -any patent claim is infringed by making, using, selling, offering for -sale, or importing the Program or any portion of it. - - 11. Patents. - - A "contributor" is a copyright holder who authorizes use under this -License of the Program or a work on which the Program is based. The -work thus licensed is called the contributor's "contributor version". - - A contributor's "essential patent claims" are all patent claims -owned or controlled by the contributor, whether already acquired or -hereafter acquired, that would be infringed by some manner, permitted -by this License, of making, using, or selling its contributor version, -but do not include claims that would be infringed only as a -consequence of further modification of the contributor version. For -purposes of this definition, "control" includes the right to grant -patent sublicenses in a manner consistent with the requirements of -this License. - - Each contributor grants you a non-exclusive, worldwide, royalty-free -patent license under the contributor's essential patent claims, to -make, use, sell, offer for sale, import and otherwise run, modify and -propagate the contents of its contributor version. - - In the following three paragraphs, a "patent license" is any express -agreement or commitment, however denominated, not to enforce a patent -(such as an express permission to practice a patent or covenant not to -sue for patent infringement). To "grant" such a patent license to a -party means to make such an agreement or commitment not to enforce a -patent against the party. - - If you convey a covered work, knowingly relying on a patent license, -and the Corresponding Source of the work is not available for anyone -to copy, free of charge and under the terms of this License, through a -publicly available network server or other readily accessible means, -then you must either (1) cause the Corresponding Source to be so -available, or (2) arrange to deprive yourself of the benefit of the -patent license for this particular work, or (3) arrange, in a manner -consistent with the requirements of this License, to extend the patent -license to downstream recipients. "Knowingly relying" means you have -actual knowledge that, but for the patent license, your conveying the -covered work in a country, or your recipient's use of the covered work -in a country, would infringe one or more identifiable patents in that -country that you have reason to believe are valid. - - If, pursuant to or in connection with a single transaction or -arrangement, you convey, or propagate by procuring conveyance of, a -covered work, and grant a patent license to some of the parties -receiving the covered work authorizing them to use, propagate, modify -or convey a specific copy of the covered work, then the patent license -you grant is automatically extended to all recipients of the covered -work and works based on it. - - A patent license is "discriminatory" if it does not include within -the scope of its coverage, prohibits the exercise of, or is -conditioned on the non-exercise of one or more of the rights that are -specifically granted under this License. You may not convey a covered -work if you are a party to an arrangement with a third party that is -in the business of distributing software, under which you make payment -to the third party based on the extent of your activity of conveying -the work, and under which the third party grants, to any of the -parties who would receive the covered work from you, a discriminatory -patent license (a) in connection with copies of the covered work -conveyed by you (or copies made from those copies), or (b) primarily -for and in connection with specific products or compilations that -contain the covered work, unless you entered into that arrangement, -or that patent license was granted, prior to 28 March 2007. - - Nothing in this License shall be construed as excluding or limiting -any implied license or other defenses to infringement that may -otherwise be available to you under applicable patent law. - - 12. No Surrender of Others' Freedom. - - If conditions are imposed on you (whether by court order, agreement or -otherwise) that contradict the conditions of this License, they do not -excuse you from the conditions of this License. If you cannot convey a -covered work so as to satisfy simultaneously your obligations under this -License and any other pertinent obligations, then as a consequence you may -not convey it at all. For example, if you agree to terms that obligate you -to collect a royalty for further conveying from those to whom you convey -the Program, the only way you could satisfy both those terms and this -License would be to refrain entirely from conveying the Program. - - 13. Use with the GNU Affero General Public License. - - Notwithstanding any other provision of this License, you have -permission to link or combine any covered work with a work licensed -under version 3 of the GNU Affero General Public License into a single -combined work, and to convey the resulting work. The terms of this -License will continue to apply to the part which is the covered work, -but the special requirements of the GNU Affero General Public License, -section 13, concerning interaction through a network will apply to the -combination as such. - - 14. Revised Versions of this License. - - The Free Software Foundation may publish revised and/or new versions of -the GNU General Public License from time to time. Such new versions will -be similar in spirit to the present version, but may differ in detail to -address new problems or concerns. - - Each version is given a distinguishing version number. If the -Program specifies that a certain numbered version of the GNU General -Public License "or any later version" applies to it, you have the -option of following the terms and conditions either of that numbered -version or of any later version published by the Free Software -Foundation. If the Program does not specify a version number of the -GNU General Public License, you may choose any version ever published -by the Free Software Foundation. - - If the Program specifies that a proxy can decide which future -versions of the GNU General Public License can be used, that proxy's -public statement of acceptance of a version permanently authorizes you -to choose that version for the Program. - - Later license versions may give you additional or different -permissions. However, no additional obligations are imposed on any -author or copyright holder as a result of your choosing to follow a -later version. - - 15. Disclaimer of Warranty. - - THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY -APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT -HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY -OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, -THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM -IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF -ALL NECESSARY SERVICING, REPAIR OR CORRECTION. - - 16. Limitation of Liability. - - IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING -WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS -THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY -GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE -USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF -DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD -PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), -EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF -SUCH DAMAGES. - - 17. Interpretation of Sections 15 and 16. - - If the disclaimer of warranty and limitation of liability provided -above cannot be given local legal effect according to their terms, -reviewing courts shall apply local law that most closely approximates -an absolute waiver of all civil liability in connection with the -Program, unless a warranty or assumption of liability accompanies a -copy of the Program in return for a fee. - - END OF TERMS AND CONDITIONS - - How to Apply These Terms to Your New Programs - - If you develop a new program, and you want it to be of the greatest -possible use to the public, the best way to achieve this is to make it -free software which everyone can redistribute and change under these terms. - - To do so, attach the following notices to the program. It is safest -to attach them to the start of each source file to most effectively -state the exclusion of warranty; and each file should have at least -the "copyright" line and a pointer to where the full notice is found. - - - Copyright (C) - - This program is free software: you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation, either version 3 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program. If not, see . - -Also add information on how to contact you by electronic and paper mail. - - If the program does terminal interaction, make it output a short -notice like this when it starts in an interactive mode: - - Copyright (C) - This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. - This is free software, and you are welcome to redistribute it - under certain conditions; type `show c' for details. - -The hypothetical commands `show w' and `show c' should show the appropriate -parts of the General Public License. Of course, your program's commands -might be different; for a GUI interface, you would use an "about box". - - You should also get your employer (if you work as a programmer) or school, -if any, to sign a "copyright disclaimer" for the program, if necessary. -For more information on this, and how to apply and follow the GNU GPL, see -. - - The GNU General Public License does not permit incorporating your program -into proprietary programs. If your program is a subroutine library, you -may consider it more useful to permit linking proprietary applications with -the library. If this is what you want to do, use the GNU Lesser General -Public License instead of this License. But first, please read -. diff --git a/MANIFEST.in b/MANIFEST.in index e7c46fc..4fe81f9 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -1,4 +1,5 @@ include Makefile include requirements.txt include requirements-swh.txt include version.txt +recursive-include swh/loader/tar/tests/resources * diff --git a/PKG-INFO b/PKG-INFO index 423a1c7..5baaf18 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,10 +1,79 @@ -Metadata-Version: 1.0 +Metadata-Version: 2.1 Name: swh.loader.tar -Version: 0.0.35 +Version: 0.0.38 Summary: Software Heritage Tarball Loader Home-page: https://forge.softwareheritage.org/diffusion/DLDTAR Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN -Description: UNKNOWN +Project-URL: Funding, https://www.softwareheritage.org/donate +Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest +Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-tar +Description: # SWH Tarball Loader + + The Software Heritage Tarball Loader is in charge of ingesting the + directory representation of the tarball into the Software Heritage + archive. + + ## Configuration + + This is the loader's (or task's) configuration file. + + *`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: + + ```YAML + working_dir: /home/storage/tmp/ + storage: + cls: remote + args: + url: http://localhost:5002/ + ``` + + ## API + + ### local + + Load local tarball directly from code or python3's toplevel: + + ``` Python + # Fill in those + repo = '8sync.tar.gz' + tarpath = '/home/storage/tar/%s' % repo + origin = {'url': 'file://%s' % repo, 'type': 'tar'} + visit_date = 'Tue, 3 May 2017 17:16:32 +0200' + last_modified = 'Tue, 10 May 2016 16:16:32 +0200' + import logging + logging.basicConfig(level=logging.DEBUG) + + from swh.loader.tar.tasks import LoadTarRepository + l = LoadTarRepository() + l.run_task(origin=origin, visit_date=visit_date, + last_modified=last_modified) + ``` + + ### remote + + Load remote tarball is the same sample + + ```Python + url = 'https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz' + origin = {'url': url, 'type': 'tar'} + visit_date = 'Tue, 3 May 2017 17:16:32 +0200' + last_modified = '2016-04-22 16:35' + import logging + logging.basicConfig(level=logging.DEBUG) + + from swh.loader.tar.tasks import LoadTarRepository + l = LoadTarRepository() + l.run_task(origin=origin, visit_date=visit_date, + last_modified=last_modified) + ``` + Platform: UNKNOWN +Classifier: Programming Language :: Python :: 3 +Classifier: Intended Audience :: Developers +Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) +Classifier: Operating System :: OS Independent +Classifier: Development Status :: 5 - Production/Stable +Description-Content-Type: text/markdown +Provides-Extra: testing diff --git a/README b/README deleted file mode 100644 index 4ef40c7..0000000 --- a/README +++ /dev/null @@ -1,141 +0,0 @@ -# SWH Tarball Loader - -The Software Heritage Tarball Loader is a tool and a library to -uncompress a local tarball and inject into the SWH dataset its tree -representation. - -## Configuration - -This is the loader's (or task's) configuration file. - -*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: - -```YAML -extraction_dir: /home/storage/tmp/ -storage: - cls: local - args: - db: service=swh-dev - objstorage: - cls: pathslicing - args: - root: /home/storage/swh-storage - slicing: 0:2/2:4/4:6 - -send_contents: True -send_directories: True -send_revisions: True -send_releases: True -send_occurrences: True - -content_packet_size: 10000 -content_packet_block_size_bytes: 104857600 -content_packet_size_bytes: 1073741824 -directory_packet_size: 25000 -revision_packet_size: 100000 -release_packet_size: 100000 -occurrence_packet_size: 100000 -``` - -## API - -Load tarball directly from code or python3's toplevel: - -``` Python - from swh.loader.tar.tasks import LoadTarRepository - - # Fill in those - tarpath = '/some/path/to/blah-7.8.3.tgz' - origin = {'url': 'some-origin', 'type': 'dir'} - visit_date = 'Tue, 3 May 2017 17:16:32 +0200' - revision = {} - occurrence = {} - - # Send message to the task queue - LoadTarRepository().run((tarpath, origin, visit_date, revision, [occurrence])) -``` - -## Celery - -Load tarball using celery. - -Providing you have a properly configured celery up and running, the -celery worker configuration file needs to be updated: - -*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/worker.yml*: - -``` YAML -task_modules: - - swh.loader.tar.tasks -task_queues: - - swh_loader_tar -``` - -cf. https://forge.softwareheritage.org/diffusion/DCORE/browse/master/README.md -for more details - - -## Tar Producer - -Its job is to compulse from a file or a folder a list of existing -tarballs. From this list, compute the corresponding messages to send -to the broker. - -### Configuration - -Message producer's configuration file (`tar.yml`): - -``` YAML -# Mirror's root directory holding tarballs to load into swh -mirror_root_directory: /srv/storage/space/mirrors/gnu.org/gnu/ -# Url scheme prefix used to create the origin url -url_scheme: http://ftp.gnu.org/gnu/ -type: ftp - -# File containing a subset list tarballs from mirror_root_directory to load. -# The file's format is one absolute path name to a tarball per line. -# NOTE: -# - This file must contain data consistent with the mirror_root_directory -# - if this option is not provided, the mirror_root_directory is scanned -# completely as usual -# mirror_subset_archives: /home/storage/missing-archives - -# Randomize blocks of messages and send for consumption -block_messages: 250 -``` - -### Run - -Trigger the message computations: - -```Shell -python3 -m swh.loader.tar.producer --config ~/.swh/producer/tar.yml -``` - -This will walk the `mirror_root_directory` folder and send encountered -tarball messages for the swh-loader-tar to uncompress (through -celery). - -If the `mirror_subset_archives` is provided, the tarball messages will -be computed from such file (the `mirror_root_directory` is still used -so please be consistent). - -If problem arises during tarball message computation, a message will -be outputed with the tarball that present a problem. - -It will displayed the number of tarball messages sent at the end. - -### Dry run - -``` Shell -python3 -m swh.loader.tar.producer --config-file ~/.swh/producer/tar.yml --dry-run -``` - -This will do the same as previously described but only display the -number of potential tarball messages computed. - -### Help - -``` Shell -python3 -m swh.loader.tar.producer --help -``` diff --git a/README.md b/README.md new file mode 100644 index 0000000..f43db04 --- /dev/null +++ b/README.md @@ -0,0 +1,59 @@ +# SWH Tarball Loader + +The Software Heritage Tarball Loader is in charge of ingesting the +directory representation of the tarball into the Software Heritage +archive. + +## Configuration + +This is the loader's (or task's) configuration file. + +*`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: + +```YAML +working_dir: /home/storage/tmp/ +storage: + cls: remote + args: + url: http://localhost:5002/ +``` + +## API + +### local + +Load local tarball directly from code or python3's toplevel: + +``` Python +# Fill in those +repo = '8sync.tar.gz' +tarpath = '/home/storage/tar/%s' % repo +origin = {'url': 'file://%s' % repo, 'type': 'tar'} +visit_date = 'Tue, 3 May 2017 17:16:32 +0200' +last_modified = 'Tue, 10 May 2016 16:16:32 +0200' +import logging +logging.basicConfig(level=logging.DEBUG) + +from swh.loader.tar.tasks import LoadTarRepository +l = LoadTarRepository() +l.run_task(origin=origin, visit_date=visit_date, + last_modified=last_modified) +``` + +### remote + +Load remote tarball is the same sample + +```Python +url = 'https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz' +origin = {'url': url, 'type': 'tar'} +visit_date = 'Tue, 3 May 2017 17:16:32 +0200' +last_modified = '2016-04-22 16:35' +import logging +logging.basicConfig(level=logging.DEBUG) + +from swh.loader.tar.tasks import LoadTarRepository +l = LoadTarRepository() +l.run_task(origin=origin, visit_date=visit_date, + last_modified=last_modified) +``` diff --git a/docs/.gitignore b/docs/.gitignore deleted file mode 100644 index f6b5c55..0000000 --- a/docs/.gitignore +++ /dev/null @@ -1,4 +0,0 @@ -_build/ -apidoc/ -*-stamp -README.md diff --git a/docs/Makefile b/docs/Makefile deleted file mode 100644 index c491218..0000000 --- a/docs/Makefile +++ /dev/null @@ -1,6 +0,0 @@ -include ../../swh-docs/Makefile.sphinx - -html: copy_md - -copy_md: - cp ../README README.md diff --git a/docs/_templates/.placeholder b/docs/_templates/.placeholder deleted file mode 100644 index e69de29..0000000 diff --git a/docs/conf.py b/docs/conf.py deleted file mode 100644 index 190deb7..0000000 --- a/docs/conf.py +++ /dev/null @@ -1 +0,0 @@ -from swh.docs.sphinx.conf import * # NoQA diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index d634664..0000000 --- a/docs/index.rst +++ /dev/null @@ -1,17 +0,0 @@ -.. _swh-loader-tar: - -Software Heritage - Development Documentation -============================================= - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - README.md - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/requirements-swh.txt b/requirements-swh.txt index 26ad0c3..65e0788 100644 --- a/requirements-swh.txt +++ b/requirements-swh.txt @@ -1,5 +1,6 @@ -swh.core >= 0.0.36 -swh.model >= 0.0.15 -swh.scheduler >= 0.0.14 +swh.core >= 0.0.46 +swh.model >= 0.0.27 +swh.scheduler >= 0.0.39 swh.storage >= 0.0.83 -swh.loader.dir >= 0.0.32 +swh.loader.core >= 0.0.35 +swh.loader.dir >= 0.0.33 diff --git a/requirements.txt b/requirements.txt index 8df856d..b1331a5 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,7 +1,8 @@ # Add here external Python modules dependencies, one per line. Module names # should match https://pypi.python.org/pypi names. For the full spec or # dependency lines, see https://pip.readthedocs.org/en/1.1/requirements.html +arrow vcversioner -retrying +requests click python-dateutil diff --git a/resources/producer/tar-gnu.yml b/resources/producer/tar-gnu.yml deleted file mode 100644 index 8ac4b49..0000000 --- a/resources/producer/tar-gnu.yml +++ /dev/null @@ -1,22 +0,0 @@ -# Mirror's root directory holding tarballs to load into swh -mirror_root_directory: /srv/softwareheritage/space/mirrors/gnu.org/gnu/ - -# Origin setup's possible scheme url -url_scheme: rsync://ftp.gnu.org/gnu/ - -# Origin type used for tarballs -type: ftp - -# File containing a subset list tarballs from mirror_root_directory to load. -# The file's format is one absolute path name to a tarball per line. -# NOTE: -# - This file must contain data consistent with the mirror_root_directory -# - if this option is not provided, the mirror_root_directory is scanned -# completely as usual -# mirror_subset_archives: /home/storage/missing-archives - -# Retrieval date information (rsync, etc...) -date: Fri, 28 Aug 2015 13:13:26 +0200 - -# Randomize blocks of messages and send for consumption -block_messages: 250 diff --git a/resources/producer/tar-old-gnu.yml b/resources/producer/tar-old-gnu.yml deleted file mode 100644 index bf4e5fe..0000000 --- a/resources/producer/tar-old-gnu.yml +++ /dev/null @@ -1,22 +0,0 @@ -# Mirror's root directory holding tarballs to load into swh -mirror_root_directory: /home/storage/space/mirrors/gnu.org/old-gnu/ - -# Origin setup's possible scheme url -url_scheme: rsync://ftp.gnu.org/old-gnu/ - -# Origin type used for tarballs -type: ftp - -# File containing a subset list tarballs from mirror_root_directory to load. -# The file's format is one absolute path name to a tarball per line. -# NOTE: -# - This file must contain data consistent with the mirror_root_directory -# - if this option is not provided, the mirror_root_directory is scanned -# completely as usual -# mirror_subset_archives: /home/tony/work/inria/repo/swh-environment/swh-loader-tar/old-gnu-missing - -# Retrieval date information (rsync, etc...) -date: Fri, 28 Aug 2015 13:13:26 +0200 - -# Randomize blocks of messages and send for consumption -block_messages: 100 diff --git a/setup.py b/setup.py old mode 100644 new mode 100755 index 22c71aa..8d2a7df --- a/setup.py +++ b/setup.py @@ -1,28 +1,65 @@ +#!/usr/bin/env python3 +# Copyright (C) 2015-2018 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + from setuptools import setup, find_packages +from os import path +from io import open + +here = path.abspath(path.dirname(__file__)) + +# Get the long description from the README file +with open(path.join(here, 'README.md'), encoding='utf-8') as f: + long_description = f.read() + + +def parse_requirements(name=None): + if name: + reqf = 'requirements-%s.txt' % name + else: + reqf = 'requirements.txt' -def parse_requirements(): requirements = [] - for reqf in ('requirements.txt', 'requirements-swh.txt'): - with open(reqf) as f: - for line in f.readlines(): - line = line.strip() - if not line or line.startswith('#'): - continue - requirements.append(line) + if not path.exists(reqf): + return requirements + + with open(reqf) as f: + for line in f.readlines(): + line = line.strip() + if not line or line.startswith('#'): + continue + requirements.append(line) return requirements setup( name='swh.loader.tar', description='Software Heritage Tarball Loader', + long_description=long_description, + long_description_content_type='text/markdown', author='Software Heritage developers', author_email='swh-devel@inria.fr', url='https://forge.softwareheritage.org/diffusion/DLDTAR', packages=find_packages(), scripts=[], - install_requires=parse_requirements(), + install_requires=parse_requirements() + parse_requirements('swh'), setup_requires=['vcversioner'], - vcversioner={}, + extras_require={'testing': parse_requirements('test')}, + vcversioner={'version_module_paths': ['swh/loader/tar/_version.py']}, include_package_data=True, + classifiers=[ + "Programming Language :: Python :: 3", + "Intended Audience :: Developers", + "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", + "Operating System :: OS Independent", + "Development Status :: 5 - Production/Stable", + ], + project_urls={ + 'Bug Reports': 'https://forge.softwareheritage.org/maniphest', + 'Funding': 'https://www.softwareheritage.org/donate', + 'Source': 'https://forge.softwareheritage.org/source/swh-loader-tar', + }, ) diff --git a/swh.loader.tar.egg-info/PKG-INFO b/swh.loader.tar.egg-info/PKG-INFO index 423a1c7..5baaf18 100644 --- a/swh.loader.tar.egg-info/PKG-INFO +++ b/swh.loader.tar.egg-info/PKG-INFO @@ -1,10 +1,79 @@ -Metadata-Version: 1.0 +Metadata-Version: 2.1 Name: swh.loader.tar -Version: 0.0.35 +Version: 0.0.38 Summary: Software Heritage Tarball Loader Home-page: https://forge.softwareheritage.org/diffusion/DLDTAR Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN -Description: UNKNOWN +Project-URL: Funding, https://www.softwareheritage.org/donate +Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest +Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-tar +Description: # SWH Tarball Loader + + The Software Heritage Tarball Loader is in charge of ingesting the + directory representation of the tarball into the Software Heritage + archive. + + ## Configuration + + This is the loader's (or task's) configuration file. + + *`{/etc/softwareheritage | ~/.config/swh | ~/.swh}`/loader/tar.yml*: + + ```YAML + working_dir: /home/storage/tmp/ + storage: + cls: remote + args: + url: http://localhost:5002/ + ``` + + ## API + + ### local + + Load local tarball directly from code or python3's toplevel: + + ``` Python + # Fill in those + repo = '8sync.tar.gz' + tarpath = '/home/storage/tar/%s' % repo + origin = {'url': 'file://%s' % repo, 'type': 'tar'} + visit_date = 'Tue, 3 May 2017 17:16:32 +0200' + last_modified = 'Tue, 10 May 2016 16:16:32 +0200' + import logging + logging.basicConfig(level=logging.DEBUG) + + from swh.loader.tar.tasks import LoadTarRepository + l = LoadTarRepository() + l.run_task(origin=origin, visit_date=visit_date, + last_modified=last_modified) + ``` + + ### remote + + Load remote tarball is the same sample + + ```Python + url = 'https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz' + origin = {'url': url, 'type': 'tar'} + visit_date = 'Tue, 3 May 2017 17:16:32 +0200' + last_modified = '2016-04-22 16:35' + import logging + logging.basicConfig(level=logging.DEBUG) + + from swh.loader.tar.tasks import LoadTarRepository + l = LoadTarRepository() + l.run_task(origin=origin, visit_date=visit_date, + last_modified=last_modified) + ``` + Platform: UNKNOWN +Classifier: Programming Language :: Python :: 3 +Classifier: Intended Audience :: Developers +Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) +Classifier: Operating System :: OS Independent +Classifier: Development Status :: 5 - Production/Stable +Description-Content-Type: text/markdown +Provides-Extra: testing diff --git a/swh.loader.tar.egg-info/SOURCES.txt b/swh.loader.tar.egg-info/SOURCES.txt index 4c71250..0a086ea 100644 --- a/swh.loader.tar.egg-info/SOURCES.txt +++ b/swh.loader.tar.egg-info/SOURCES.txt @@ -1,42 +1,27 @@ -.gitignore -AUTHORS -LICENSE MANIFEST.in Makefile -README +README.md requirements-swh.txt requirements.txt setup.py version.txt -debian/changelog -debian/compat -debian/control -debian/copyright -debian/rules -debian/source/format -docs/.gitignore -docs/Makefile -docs/conf.py -docs/index.rst -docs/_static/.placeholder -docs/_templates/.placeholder -resources/producer/tar-gnu.yml -resources/producer/tar-old-gnu.yml swh/__init__.py swh.loader.tar.egg-info/PKG-INFO swh.loader.tar.egg-info/SOURCES.txt swh.loader.tar.egg-info/dependency_links.txt swh.loader.tar.egg-info/requires.txt swh.loader.tar.egg-info/top_level.txt swh/loader/__init__.py swh/loader/tar/__init__.py +swh/loader/tar/_version.py swh/loader/tar/build.py -swh/loader/tar/db.py -swh/loader/tar/file.py swh/loader/tar/loader.py -swh/loader/tar/producer.py swh/loader/tar/tasks.py swh/loader/tar/utils.py +swh/loader/tar/tests/__init__.py +swh/loader/tar/tests/conftest.py swh/loader/tar/tests/test_build.py swh/loader/tar/tests/test_loader.py -swh/loader/tar/tests/test_utils.py \ No newline at end of file +swh/loader/tar/tests/test_tasks.py +swh/loader/tar/tests/test_utils.py +swh/loader/tar/tests/resources/sample-folder.tgz \ No newline at end of file diff --git a/swh.loader.tar.egg-info/requires.txt b/swh.loader.tar.egg-info/requires.txt index d761eca..2e69cf7 100644 --- a/swh.loader.tar.egg-info/requires.txt +++ b/swh.loader.tar.egg-info/requires.txt @@ -1,9 +1,16 @@ +arrow +vcversioner +requests click python-dateutil -retrying -swh.core>=0.0.36 -swh.loader.dir>=0.0.32 -swh.model>=0.0.15 -swh.scheduler>=0.0.14 +swh.core>=0.0.46 +swh.model>=0.0.27 +swh.scheduler>=0.0.39 swh.storage>=0.0.83 -vcversioner +swh.loader.core>=0.0.35 +swh.loader.dir>=0.0.33 + +[testing] +pytest<4 +swh-scheduler[testing] +requests-mock diff --git a/swh/loader/tar/_version.py b/swh/loader/tar/_version.py new file mode 100644 index 0000000..346dbad --- /dev/null +++ b/swh/loader/tar/_version.py @@ -0,0 +1,5 @@ + +# This file is automatically generated by setup.py. +__version__ = '0.0.38' +__sha__ = 'g3988b48' +__revision__ = 'g3988b48' diff --git a/swh/loader/tar/build.py b/swh/loader/tar/build.py index 02fd167..92d4090 100755 --- a/swh/loader/tar/build.py +++ b/swh/loader/tar/build.py @@ -1,101 +1,76 @@ # Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information -import os - -from swh.core import utils +import arrow # Static setup EPOCH = 0 UTC_OFFSET = 0 SWH_PERSON = { 'name': 'Software Heritage', 'fullname': 'Software Heritage', 'email': 'robot@softwareheritage.org' } -REVISION_MESSAGE = 'synthetic revision message' +REVISION_MESSAGE = 'swh-loader-tar: synthetic revision message' REVISION_TYPE = 'tar' -def compute_origin(url_scheme, url_type, root_dirpath, tarpath): - """Compute the origin. - - Args: - - url_scheme: scheme to build the origin's url - - url_type: origin's type - - root_dirpath: the top level root directory path - - tarpath: file's absolute path - - Returns: - Dictionary origin with keys: - - url: origin's url - - type: origin's type - - """ - relative_path = utils.commonname(root_dirpath, tarpath) - return { - 'url': ''.join([url_scheme, - os.path.dirname(relative_path)]), - 'type': url_type, - } - - -def _time_from_path(tarpath): +def _time_from_last_modified(last_modified): """Compute the modification time from the tarpath. Args: - tarpath (str|bytes): Full path to the archive to extract the - date from. + last_modified (str): Last modification time Returns: - dict representing a timestamp with keys seconds and microseconds keys. + dict representing a timestamp with keys {seconds, microseconds} """ - mtime = os.lstat(tarpath).st_mtime - if isinstance(mtime, float): - normalized_time = list(map(int, str(mtime).split('.'))) - else: # assuming int - normalized_time = [mtime, 0] - + last_modified = arrow.get(last_modified) + mtime = last_modified.float_timestamp + normalized_time = list(map(int, str(mtime).split('.'))) return { 'seconds': normalized_time[0], 'microseconds': normalized_time[1] } -def compute_revision(tarpath): +def compute_revision(tarpath, last_modified): """Compute a revision. Args: - tarpath: absolute path to the tarball + tarpath (str): absolute path to the tarball + last_modified (str): Time of last modification read from the + source remote (most probably by the lister) Returns: Revision as dict: - date (dict): the modification timestamp as returned by _time_from_path function - committer_date: the modification timestamp as returned by _time_from_path function - author: cf. SWH_PERSON - committer: cf. SWH_PERSON - type: cf. REVISION_TYPE - message: cf. REVISION_MESSAGE """ - ts = _time_from_path(tarpath) + ts = _time_from_last_modified(last_modified) + return { 'date': { 'timestamp': ts, 'offset': UTC_OFFSET, }, 'committer_date': { 'timestamp': ts, 'offset': UTC_OFFSET, }, 'author': SWH_PERSON, 'committer': SWH_PERSON, 'type': REVISION_TYPE, 'message': REVISION_MESSAGE, + 'synthetic': True, } diff --git a/swh/loader/tar/db.py b/swh/loader/tar/db.py deleted file mode 100644 index 961af03..0000000 --- a/swh/loader/tar/db.py +++ /dev/null @@ -1,52 +0,0 @@ -# Copyright (C) 2015 The Software Heritage developers -# See the AUTHORS file at the top-level directory of this distribution -# License: GNU General Public License version 3, or any later version -# See top-level LICENSE file for more information - -import psycopg2 - - -def connect(db_url): - """Open db connection. - """ - return psycopg2.connect(db_url) - - -def execute(cur, query_params): - """Execute the query_params. - query_params is expected to be either: - - a sql query (string) - - a tuple (sql query, params) - """ - if isinstance(query_params, str): - cur.execute(query_params) - else: - cur.execute(*query_params) - - -def entry_to_bytes(entry): - """Convert an entry coming from the database to bytes""" - if isinstance(entry, memoryview): - return entry.tobytes() - return entry - - -def line_to_bytes(line): - """Convert a line coming from the database to bytes""" - return line.__class__(entry_to_bytes(entry) for entry in line) - - -def cursor_to_bytes(cursor): - """Yield all the data from a cursor as bytes""" - yield from (line_to_bytes(line) for line in cursor) - - -def query_fetch(db_conn, query_params): - """Execute sql query which returns results. - query_params is expected to be either: - - a sql query (string) - - a tuple (sql query, params) - """ - with db_conn.cursor() as cur: - execute(cur, query_params) - yield from cursor_to_bytes(cur) diff --git a/swh/loader/tar/file.py b/swh/loader/tar/file.py deleted file mode 100644 index 57fd6b5..0000000 --- a/swh/loader/tar/file.py +++ /dev/null @@ -1,90 +0,0 @@ -# Copyright (C) 2015-2017 The Software Heritage developers -# See the AUTHORS file at the top-level directory of this distribution -# License: GNU General Public License version 3, or any later version -# See top-level LICENSE file for more information - -import itertools -import os - -from swh.core import tarball -from swh.loader.tar import utils - - -def archives_from_dir(path): - """Given a path to a directory, walk such directory and yield tuple of - tarpath, fname. - - Args: - path: top level directory - - Returns: - Generator of tuple tarpath, filename with tarpath a tarball. - - """ - for dirpath, dirnames, filenames in os.walk(path): - for fname in filenames: - tarpath = os.path.join(dirpath, fname) - if not os.path.exists(tarpath): - continue - - if tarball.is_tarball(tarpath): - yield tarpath, fname - - -def archives_from_file(mirror_file): - """Given a path to a file containing one tarball per line, yield a tuple of - tarpath, fname. - - Args: - mirror_file: path to the file containing list of tarpath. - - Returns: - Generator of tuple tarpath, filename with tarpath a tarball. - - """ - with open(mirror_file, 'r') as f: - for tarpath in f.readlines(): - tarpath = tarpath.strip() - if not os.path.exists(tarpath): - print('WARN: %s does not exist. Skipped.' % tarpath) - continue - - if tarball.is_tarball(tarpath): - yield tarpath, os.path.basename(tarpath) - - -def archives_from(path): - """From path, list tuple of tarpath, fname. - - Args: - path: top directory to list archives from or custom file format. - - Returns: - Generator of tuple tarpath, filename with tarpath a tarball. - - """ - if os.path.isfile(path): - yield from archives_from_file(path) - elif os.path.isdir(path): - yield from archives_from_dir(path) - else: - raise ValueError( - 'Input incorrect, %s must be a file or a directory.' % path) - - -def random_archives_from(path, block, limit=None): - """Randomize by size block the archives. - - Returns: - Generator of randomized tuple tarpath, filename with tarpath a tarball. - - """ - random_archives = utils.random_blocks(archives_from(path), - block, - fillvalue=(None, None)) - - if limit: - random_archives = itertools.islice(random_archives, limit) - - for tarpath, fname in ((t, f) for t, f in random_archives if t and f): - yield tarpath, fname diff --git a/swh/loader/tar/loader.py b/swh/loader/tar/loader.py index 7d4981f..095f6bb 100644 --- a/swh/loader/tar/loader.py +++ b/swh/loader/tar/loader.py @@ -1,154 +1,346 @@ # Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os import tempfile +import requests import shutil +from urllib.parse import urlparse + +from tempfile import mkdtemp from swh.core import tarball -from swh.loader.core.loader import SWHLoader -from swh.loader.dir import loader -from swh.loader.tar import utils -from swh.model import hashutil +from swh.loader.core.loader import BufferedLoader +from swh.loader.dir.loader import revision_from, snapshot_from +from swh.model.hashutil import MultiHash, HASH_BLOCK_SIZE +from swh.model.from_disk import Directory +from .build import compute_revision -class TarLoader(loader.DirLoader): - """Tarball loader implementation. +try: + from _version import __version__ +except ImportError: + __version__ = 'devel' - This is a subclass of the :class:DirLoader as the main goal of - this class is to first uncompress a tarball, then provide the - uncompressed directory/tree to be loaded by the DirLoader. - This will: +TEMPORARY_DIR_PREFIX_PATTERN = 'swh.loader.tar.' +DEBUG_MODE = '** DEBUG MODE **' + + +class LocalResponse: + """Local Response class with iter_content api + + """ + def __init__(self, path): + self.path = path + + def iter_content(self, chunk_size=None): + with open(self.path, 'rb') as f: + for chunk in f: + yield chunk + + +class ArchiveFetcher: + """Http/Local client in charge of downloading archives from a + remote/local server. + + Args: + temp_directory (str): Path to the temporary disk location used + for downloading the release artifacts + + """ + def __init__(self, temp_directory=None): + self.temp_directory = temp_directory + self.session = requests.session() + self.params = { + 'headers': { + 'User-Agent': 'Software Heritage Tar Loader (%s)' % ( + __version__ + ) + } + } + + def download(self, url): + """Download the remote tarball url locally. + + Args: + url (str): Url (file or http*) + + Raises: + ValueError in case of failing to query + + Returns: + Tuple of local (filepath, hashes of filepath) + + """ + url_parsed = urlparse(url) + if url_parsed.scheme == 'file': + path = url_parsed.path + response = LocalResponse(path) + length = os.path.getsize(path) + else: + response = self.session.get(url, **self.params, stream=True) + if response.status_code != 200: + raise ValueError("Fail to query '%s'. Reason: %s" % ( + url, response.status_code)) + length = int(response.headers['content-length']) + + filepath = os.path.join(self.temp_directory, os.path.basename(url)) + + h = MultiHash(length=length) + with open(filepath, 'wb') as f: + for chunk in response.iter_content(chunk_size=HASH_BLOCK_SIZE): + h.update(chunk) + f.write(chunk) + + actual_length = os.path.getsize(filepath) + if length != actual_length: + raise ValueError('Error when checking size: %s != %s' % ( + length, actual_length)) + + hashes = { + 'length': length, + **h.hexdigest() + } + return filepath, hashes + + +class BaseTarLoader(BufferedLoader): + """Base Tarball Loader class. + + This factorizes multiple loader implementations: + + - :class:`RemoteTarLoader`: New implementation able to deal with + remote archives. + + - :class:`TarLoader`: Old implementation which dealt with only + local archive. It also was only passing along objects to + persist (revision, etc...) - - creates an origin (if it does not exist) - - creates a fetch_history entry - - creates an origin_visit - - uncompress locally the tarball in a temporary location - - process the content of the tarballs to persist on swh storage - - clean up the temporary location - - write an entry in fetch_history to mark the loading tarball end (success - or failure) """ CONFIG_BASE_FILENAME = 'loader/tar' ADDITIONAL_CONFIG = { - 'extraction_dir': ('string', '/tmp') + 'working_dir': ('string', '/tmp'), + 'debug': ('bool', False), # NOT FOR PRODUCTION } def __init__(self, logging_class='swh.loader.tar.TarLoader', config=None): super().__init__(logging_class=logging_class, config=config) + self.local_cache = None self.dir_path = None + working_dir = self.config['working_dir'] + os.makedirs(working_dir, exist_ok=True) + self.temp_directory = mkdtemp( + suffix='-%s' % os.getpid(), + prefix=TEMPORARY_DIR_PREFIX_PATTERN, + dir=working_dir) + self.client = ArchiveFetcher(temp_directory=self.temp_directory) + os.makedirs(working_dir, 0o755, exist_ok=True) + self.dir_path = tempfile.mkdtemp(prefix='swh.loader.tar-', + dir=self.temp_directory) + self.debug = self.config['debug'] - def load(self, *, tar_path, origin, visit_date, revision, - branch_name=None): - """Load a tarball in `tarpath` in the Software Heritage Archive. - - Args: - tar_path: tarball to import - origin (dict): an origin dictionary as returned by - :func:`swh.storage.storage.Storage.origin_get_one` - visit_date (str): the date the origin was visited (as an - isoformatted string) - revision (dict): a revision as passed to - :func:`swh.storage.storage.Storage.revision_add`, excluding the - `id` and `directory` keys (computed from the directory) - branch_name (str): the optional branch_name to use for snapshot + def cleanup(self): + """Clean up temporary disk folders used. """ - # Shortcut super() as we use different arguments than the DirLoader. - return SWHLoader.load(self, tar_path=tar_path, origin=origin, - visit_date=visit_date, revision=revision, - branch_name=branch_name) + if self.debug: + self.log.warn('%s Will not clean up temp dir %s' % ( + DEBUG_MODE, self.temp_directory + )) + return + if os.path.exists(self.temp_directory): + self.log.debug('Clean up %s' % self.temp_directory) + shutil.rmtree(self.temp_directory) def prepare_origin_visit(self, *, origin, visit_date=None, **kwargs): + """Prepare the origin visit information. + + Args: + origin (dict): Dict with keys {url, type} + visit_date (str): Date representing the date of the + visit. None by default will make it the current time + during the loading process. + + """ self.origin = origin if 'type' not in self.origin: # let the type flow if present self.origin['type'] = 'tar' self.visit_date = visit_date - def prepare(self, *, tar_path, origin, revision, visit_date=None, - branch_name=None): - """1. Uncompress the tarball in a temporary directory. - 2. Compute some metadata to update the revision. + def get_tarball_url_to_retrieve(self): + """Compute the tarball url to allow retrieval """ - # Prepare the extraction path - extraction_dir = self.config['extraction_dir'] - os.makedirs(extraction_dir, 0o755, exist_ok=True) - self.dir_path = tempfile.mkdtemp(prefix='swh.loader.tar-', - dir=extraction_dir) + raise NotImplementedError() - # add checksums in revision + def fetch_data(self): + """Retrieve, uncompress archive and fetch objects from the tarball. + The actual ingestion takes place in the :meth:`store_data` + implementation below. - self.log.info('Uncompress %s to %s' % (tar_path, self.dir_path)) - nature = tarball.uncompress(tar_path, self.dir_path) + """ + url = self.get_tarball_url_to_retrieve() + filepath, hashes = self.client.download(url) + nature = tarball.uncompress(filepath, self.dir_path) - if 'metadata' not in revision: - artifact = utils.convert_to_hex(hashutil.hash_path(tar_path)) - artifact['name'] = os.path.basename(tar_path) - artifact['archive_type'] = nature - artifact['length'] = os.path.getsize(tar_path) - revision['metadata'] = { - 'original_artifact': [artifact], - } + dir_path = self.dir_path.encode('utf-8') + directory = Directory.from_disk(path=dir_path, save_path=True) + objects = directory.collect() + if 'content' not in objects: + objects['content'] = {} + if 'directory' not in objects: + objects['directory'] = {} + + # compute the full revision (with ids) + revision = self.build_revision(filepath, nature, hashes) + revision = revision_from(directory.hash, revision) + objects['revision'] = { + revision['id']: revision, + } - branch = branch_name if branch_name else os.path.basename(tar_path) + snapshot = self.build_snapshot(revision) + objects['snapshot'] = { + snapshot['id']: snapshot + } + self.objects = objects - super().prepare(dir_path=self.dir_path, - origin=origin, - visit_date=visit_date, - revision=revision, - release=None, - branch_name=branch) + def store_data(self): + """Store the objects in the swh archive. - def cleanup(self): - """Clean up temporary directory where we uncompress the tarball. + """ + objects = self.objects + self.maybe_load_contents(objects['content'].values()) + self.maybe_load_directories(objects['directory'].values()) + self.maybe_load_revisions(objects['revision'].values()) + snapshot = list(objects['snapshot'].values())[0] + self.maybe_load_snapshot(snapshot) + + +class RemoteTarLoader(BaseTarLoader): + """This is able to load from remote/local archive into the swh + archive. + + This will: + + - create an origin (if it does not exist) and a visit + - fetch the tarball in a temporary location + - uncompress it locally in a temporary location + - process the content of the tarball to persist on swh storage + - clean up the temporary location + + """ + def prepare(self, *, last_modified, **kwargs): + """last_modified is the time of last modification of the tarball. + + E.g https://ftp.gnu.org/gnu/8sync/: + [ ] 8sync-0.1.0.tar.gz 2016-04-22 16:35 217K + [ ] 8sync-0.1.0.tar.gz.sig 2016-04-22 16:35 543 + [ ] ... + + Args: + origin (dict): Dict with keys {url, type} + last_modified (str): The date of last modification of the + archive to ingest. + visit_date (str): Date representing the date of the + visit. None by default will make it the current time + during the loading process. """ - if self.dir_path and os.path.exists(self.dir_path): - shutil.rmtree(self.dir_path) - - -if __name__ == '__main__': - import click - import logging - logging.basicConfig( - level=logging.DEBUG, - format='%(asctime)s %(process)d %(message)s' - ) - - @click.command() - @click.option('--archive-path', required=1, help='Archive path to load') - @click.option('--origin-url', required=1, help='Origin url to associate') - @click.option('--visit-date', default=None, - help='Visit date time override') - def main(archive_path, origin_url, visit_date): - """Loading archive tryout.""" - import datetime - origin = {'url': origin_url, 'type': 'tar'} - commit_time = int(datetime.datetime.now( - tz=datetime.timezone.utc).timestamp()) - swh_person = { - 'name': 'Software Heritage', - 'fullname': 'Software Heritage', - 'email': 'robot@softwareheritage.org' + self.last_modified = last_modified + + def get_tarball_url_to_retrieve(self): + return self.origin['url'] + + def build_revision(self, filepath, nature, hashes): + """Build the revision with identifier + + We use the `last_modified` date provided by the caller to + build the revision. + + """ + return { + **compute_revision(filepath, self.last_modified), + 'metadata': { + 'original_artifact': [{ + 'name': os.path.basename(filepath), + 'archive_type': nature, + **hashes, + }], + } } - revision = { - 'date': {'timestamp': commit_time, 'offset': 0}, - 'committer_date': {'timestamp': commit_time, 'offset': 0}, - 'author': swh_person, - 'committer': swh_person, - 'type': 'tar', - 'message': 'swh-loader-tar: synthetic revision message', - 'metadata': {}, - 'synthetic': True, + + def build_snapshot(self, revision): + """Build the snapshot targeting the revision. + + """ + branch_name = os.path.basename(self.dir_path) + return snapshot_from(revision['id'], branch_name) + + +class LegacyLocalTarLoader(BaseTarLoader): + """This loads local tarball into the swh archive. It's using the + revision and branch provided by the caller as scaffolding to + create the full revision and snapshot (with identifiers). + + This is what's: + - been used to ingest our 2015 rsync copy of gnu.org + - still used by the loader deposit + + This will: + + - create an origin (if it does not exist) and a visit + - uncompress a tarball in a local and temporary location + - process the content of the tarball to persist on swh storage + - associate it to a passed revision and snapshot + - clean up the temporary location + + """ + def prepare(self, *, tar_path, revision, branch_name, **kwargs): + """Prepare the data prior to ingest it in SWH archive. + + Args: + tar_path (str): Path to the archive to ingest + revision (dict): The synthetic revision to associate the + archive to (no identifiers within) + branch_name (str): The branch name to use for the + snapshot. + + """ + self.tar_path = tar_path + self.revision = revision + self.branch_name = branch_name + + def get_tarball_url_to_retrieve(self): + return 'file://%s' % self.tar_path + + def build_revision(self, filepath, nature, hashes): + """Build the revision with identifier + + We use the revision provided by the caller as a scaffolding + revision. + + """ + return { + **self.revision, + 'metadata': { + 'original_artifact': [{ + 'name': os.path.basename(filepath), + 'archive_type': nature, + **hashes, + }], + } } - TarLoader().load(tar_path=archive_path, origin=origin, - visit_date=visit_date, revision=revision, - branch_name='master') - main() + def build_snapshot(self, revision): + """Build the snapshot targeting the revision. + + We use the branch_name provided by the caller as a scaffolding + as well. + + """ + return snapshot_from(revision['id'], self.branch_name) diff --git a/swh/loader/tar/producer.py b/swh/loader/tar/producer.py deleted file mode 100755 index 21db54f..0000000 --- a/swh/loader/tar/producer.py +++ /dev/null @@ -1,102 +0,0 @@ -# Copyright (C) 2015-2018 The Software Heritage developers -# See the AUTHORS file at the top-level directory of this distribution -# License: GNU General Public License version 3, or any later version -# See top-level LICENSE file for more information - -import click -import dateutil.parser - -from swh.scheduler.utils import get_task - -from swh.core import config -from swh.loader.tar import build, file - - -TASK_QUEUE = 'swh.loader.tar.tasks.LoadTarRepository' - - -def produce_archive_messages_from( - conf, root_dir, visit_date, mirror_file=None, dry_run=False): - """From root_dir, produce archive tarball messages to celery. - - Will print error message when some computation arise on archive - and continue. - - Args: - conf: dictionary holding static metadata - root_dir: top directory to list archives from. - visit_date: override origin's visit date of information - mirror_file: a filtering file of tarballs to load - dry_run: will compute but not send messages - - Returns: - Number of messages generated - - """ - - limit = conf.get('limit') - block = int(conf['block_messages']) - count = 0 - - path_source_tarballs = mirror_file if mirror_file else root_dir - - visit_date = dateutil.parser.parse(visit_date) - if not dry_run: - task = get_task(TASK_QUEUE) - - for tarpath, _ in file.random_archives_from( - path_source_tarballs, block, limit): - try: - origin = build.compute_origin( - conf['url_scheme'], conf['type'], root_dir, tarpath) - revision = build.compute_revision(tarpath) - - if not dry_run: - task.delay(tar_path=tarpath, origin=origin, - visit_date=visit_date, - revision=revision) - - count += 1 - except ValueError: - print('Problem with the following archive: %s' % tarpath) - - return count - - -@click.command() -@click.option('--config-file', required=1, - help='Configuration file path') -@click.option('--dry-run/--no-dry-run', default=False, - help='Dry run (print repo only)') -@click.option('--limit', default=None, - help='Number of origins limit to send') -def main(config_file, dry_run, limit): - """Tarball producer of local fs tarballs. - - """ - conf = config.read(config_file) - url_scheme = conf['url_scheme'] - mirror_dir = conf['mirror_root_directory'] - - # remove trailing / in configuration (to ease ulterior computation) - if url_scheme[-1] == '/': - conf['url_scheme'] = url_scheme[0:-1] - - if mirror_dir[-1] == '/': - conf['mirror_root_directory'] = mirror_dir[0:-1] - - if limit: - conf['limit'] = int(limit) - - nb_tarballs = produce_archive_messages_from( - conf=conf, - root_dir=conf['mirror_root_directory'], - visit_date=conf['date'], - mirror_file=conf.get('mirror_subset_archives'), - dry_run=dry_run) - - print('%s tarball(s) sent to worker.' % nb_tarballs) - - -if __name__ == '__main__': - main() diff --git a/swh/loader/tar/tasks.py b/swh/loader/tar/tasks.py index a4cd70d..dc3edce 100644 --- a/swh/loader/tar/tasks.py +++ b/swh/loader/tar/tasks.py @@ -1,27 +1,17 @@ # Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information -from swh.scheduler.task import Task +from celery import current_app as app -from swh.loader.tar.loader import TarLoader +from swh.loader.tar.loader import RemoteTarLoader -class LoadTarRepository(Task): - """Import a directory to Software Heritage - +@app.task(name=__name__ + '.LoadTarRepository') +def load_tar(origin, visit_date, last_modified): + """Import a remote or local archive to Software Heritage """ - task_queue = 'swh_loader_tar' - - def run_task(self, *, tar_path, origin, visit_date, revision, - branch_name=None): - """Import a tarball into swh. - - Args: see :func:`TarLoader.load`. - """ - loader = TarLoader() - loader.log = self.log - return loader.load(tar_path=tar_path, origin=origin, - visit_date=visit_date, revision=revision, - branch_name=branch_name) + loader = RemoteTarLoader() + return loader.load( + origin=origin, visit_date=visit_date, last_modified=last_modified) diff --git a/docs/_static/.placeholder b/swh/loader/tar/tests/__init__.py similarity index 100% rename from docs/_static/.placeholder rename to swh/loader/tar/tests/__init__.py diff --git a/swh/loader/tar/tests/conftest.py b/swh/loader/tar/tests/conftest.py new file mode 100644 index 0000000..972dd2f --- /dev/null +++ b/swh/loader/tar/tests/conftest.py @@ -0,0 +1,10 @@ +import pytest + +from swh.scheduler.tests.conftest import * # noqa + + +@pytest.fixture(scope='session') +def celery_includes(): + return [ + 'swh.loader.tar.tasks', + ] diff --git a/swh/loader/tar/tests/resources/sample-folder.tgz b/swh/loader/tar/tests/resources/sample-folder.tgz new file mode 100644 index 0000000..cc84894 Binary files /dev/null and b/swh/loader/tar/tests/resources/sample-folder.tgz differ diff --git a/swh/loader/tar/tests/test_build.py b/swh/loader/tar/tests/test_build.py index 592f792..61c7379 100644 --- a/swh/loader/tar/tests/test_build.py +++ b/swh/loader/tar/tests/test_build.py @@ -1,92 +1,58 @@ # Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import unittest - -from nose.tools import istest from unittest.mock import patch from swh.loader.tar import build class TestBuildUtils(unittest.TestCase): - @istest - def compute_origin(self): - # given - expected_origin = { - 'url': 'rsync://some/url/package-foo', - 'type': 'rsync', - } - - # when - actual_origin = build.compute_origin( - 'rsync://some/url/', - 'rsync', - '/some/root/path/', - '/some/root/path/package-foo/package-foo-1.2.3.tgz') - - # then - self.assertEquals(actual_origin, expected_origin) - - @patch('swh.loader.tar.build._time_from_path') - @istest - def compute_revision(self, mock_time_from_path): - mock_time_from_path.return_value = 'some-other-time' + @patch('swh.loader.tar.build._time_from_last_modified') + def test_compute_revision(self, mock_time_from_last_modified): + mock_time_from_last_modified.return_value = 'some-other-time' # when - actual_revision = build.compute_revision('/some/path') + actual_revision = build.compute_revision('/some/path', 'last-modified') expected_revision = { 'date': { 'timestamp': 'some-other-time', 'offset': build.UTC_OFFSET, }, 'committer_date': { 'timestamp': 'some-other-time', 'offset': build.UTC_OFFSET, }, 'author': build.SWH_PERSON, 'committer': build.SWH_PERSON, 'type': build.REVISION_TYPE, 'message': build.REVISION_MESSAGE, + 'synthetic': True, } # then - self.assertEquals(actual_revision, expected_revision) - - mock_time_from_path.assert_called_once_with('/some/path') + self.assertEqual(actual_revision, expected_revision) - @patch('swh.loader.tar.build.os') - @istest - def time_from_path_with_float(self, mock_os): - class MockStat: - st_mtime = 1445348286.8308342 - mock_os.lstat.return_value = MockStat() + mock_time_from_last_modified.assert_called_once_with( + 'last-modified') - actual_time = build._time_from_path('some/path') + def test_time_from_last_modified_with_float(self): + actual_time = build._time_from_last_modified( + '2015-10-20T13:38:06.830834+00:00') - self.assertEquals(actual_time, { + self.assertEqual(actual_time, { 'seconds': 1445348286, - 'microseconds': 8308342 + 'microseconds': 830834 }) - mock_os.lstat.assert_called_once_with('some/path') - - @patch('swh.loader.tar.build.os') - @istest - def time_from_path_with_int(self, mock_os): - class MockStat: - st_mtime = 1445348286 + def test_time_from_last_modified_with_int(self): + actual_time = build._time_from_last_modified( + '2015-10-20T13:38:06+00:00') - mock_os.lstat.return_value = MockStat() - - actual_time = build._time_from_path('some/path') - - self.assertEquals(actual_time, { + self.assertEqual(actual_time, { 'seconds': 1445348286, 'microseconds': 0 }) - - mock_os.lstat.assert_called_once_with('some/path') diff --git a/swh/loader/tar/tests/test_loader.py b/swh/loader/tar/tests/test_loader.py index d64d7d5..30a2e70 100644 --- a/swh/loader/tar/tests/test_loader.py +++ b/swh/loader/tar/tests/test_loader.py @@ -1,208 +1,239 @@ # Copyright (C) 2017-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import os -from unittest import TestCase +import pytest +import requests_mock -from nose.plugins.attrib import attr -from nose.tools import istest +from swh.model import hashutil -from swh.loader.tar.loader import TarLoader +from swh.loader.core.tests import BaseLoaderTest +from swh.loader.tar.build import SWH_PERSON +from swh.loader.tar.loader import RemoteTarLoader, LegacyLocalTarLoader -class LoaderNoStorageForTest: - """Mixin class to inhibit the persistence and keep in memory the data - sent for storage. +TEST_CONFIG = { + 'working_dir': '/tmp/tests/loader-tar/', # where to extract the tarball + 'debug': False, + 'storage': { # we instantiate it but we don't use it in test context + 'cls': 'memory', + 'args': { + } + }, + 'send_contents': True, + 'send_directories': True, + 'send_revisions': True, + 'send_releases': True, + 'send_snapshot': True, + 'content_packet_size': 100, + 'content_packet_block_size_bytes': 104857600, + 'content_packet_size_bytes': 1073741824, + 'directory_packet_size': 250, + 'revision_packet_size': 100, + 'release_packet_size': 100, + 'content_size_limit': 1000000000 +} + + +class RemoteTarLoaderForTest(RemoteTarLoader): + def parse_config_file(self, *args, **kwargs): + return TEST_CONFIG - cf. SWHTarLoaderNoStorage + +@pytest.mark.fs +class PrepareDataForTestLoader(BaseLoaderTest): + """Prepare the archive to load (test fixture). """ - def __init__(self): - super().__init__() - # Init the state - self.all_contents = [] - self.all_directories = [] - self.all_revisions = [] - self.all_releases = [] - self.all_snapshots = [] - - def send_origin(self, origin): - self.origin = origin - - def send_origin_visit(self, origin_id, ts): - self.origin_visit = { - 'origin': origin_id, - 'ts': ts, - 'visit': 1, - } - return self.origin_visit + def setUp(self): + super().setUp('sample-folder.tgz', + start_path=os.path.dirname(__file__), + uncompress_archive=False) + self.tarpath = self.destination_path - def update_origin_visit(self, origin_id, visit, status): - self.status = status - self.origin_visit = visit + def assert_data_ok(self): + # then + self.assertCountContents(8, "3 files + 5 links") + self.assertCountDirectories(6, "4 subdirs + 1 empty + 1 main dir") + self.assertCountRevisions(1, "synthetic revision") - def maybe_load_contents(self, all_contents): - self.all_contents.extend(all_contents) + rev_id = hashutil.hash_to_bytes( + '67a7d7dda748f9a86b56a13d9218d16f5cc9ab3d') + actual_revision = next(self.storage.revision_get([rev_id])) + self.assertTrue(actual_revision['synthetic']) + self.assertEqual(actual_revision['parents'], []) + self.assertEqual(actual_revision['type'], 'tar') + self.assertEqual(actual_revision['message'], + b'swh-loader-tar: synthetic revision message') + self.assertEqual(actual_revision['directory'], + b'\xa7A\xfcM\x96\x8c{\x8e<\x94\xff\x86\xe7\x04\x80\xc5\xc7\xe5r\xa9') # noqa + + self.assertEqual( + actual_revision['metadata']['original_artifact'][0], + { + 'sha1_git': 'cc848944a0d3e71d287027347e25467e61b07428', + 'archive_type': 'tar', + 'blake2s256': '5d70923443ad36377cd58e993aff0e3c1b9ef14f796c69569105d3a99c64f075', # noqa + 'name': 'sample-folder.tgz', + 'sha1': '3ca0d0a5c6833113bd532dc5c99d9648d618f65a', + 'length': 555, + 'sha256': '307ebda0071ca5975f618e192c8417161e19b6c8bf581a26061b76dc8e85321d' # noqa + }) - def maybe_load_directories(self, all_directories): - self.all_directories.extend(all_directories) + self.assertCountReleases(0) + self.assertCountSnapshots(1) - def maybe_load_revisions(self, all_revisions): - self.all_revisions.extend(all_revisions) - def maybe_load_releases(self, releases): - self.all_releases.extend(releases) +class TestRemoteTarLoader(PrepareDataForTestLoader): + """Test the remote loader scenario (local/remote) - def maybe_load_snapshot(self, snapshot): - self.all_snapshots.append(snapshot) + """ + def setUp(self): + super().setUp() + self.loader = RemoteTarLoaderForTest() + self.storage = self.loader.storage - def open_fetch_history(self): - return 1 + def test_load_local(self): + """Load a local tarball should result in persisted swh data - def close_fetch_history_success(self, fetch_history_id): - pass + """ + # given + origin = { + 'url': self.repo_url, + 'type': 'tar' + } + visit_date = 'Tue, 3 May 2016 17:16:32 +0200' + last_modified = '2018-12-05T12:35:23+00:00' - def close_fetch_history_failure(self, fetch_history_id): - pass + # when + self.loader.load( + origin=origin, visit_date=visit_date, last_modified=last_modified) + # then + self.assert_data_ok() -TEST_CONFIG = { - 'extraction_dir': '/tmp/tests/loader-tar/', # where to extract the tarball - 'storage': { # we instantiate it but we don't use it in test context - 'cls': 'remote', - 'args': { - 'url': 'http://127.0.0.1:9999', # somewhere that does not exist + @requests_mock.Mocker() + def test_load_remote(self, mock_requests): + """Load a remote tarball should result in persisted swh data + + """ + # setup the mock to stream the content of the tarball + local_url = self.repo_url.replace('file:///', '/') + url = 'https://nowhere.org/%s' % local_url + with open(local_url, 'rb') as f: + data = f.read() + mock_requests.get(url, content=data, headers={ + 'content-length': str(len(data)) + }) + + # given + origin = { + 'url': url, + 'type': 'tar' } - }, - 'send_contents': False, - 'send_directories': False, - 'send_revisions': False, - 'send_releases': False, - 'send_snapshot': False, - 'content_packet_size': 100, - 'content_packet_block_size_bytes': 104857600, - 'content_packet_size_bytes': 1073741824, - 'directory_packet_size': 250, - 'revision_packet_size': 100, - 'release_packet_size': 100, -} + visit_date = 'Tue, 3 May 2016 17:16:32 +0200' + last_modified = '2018-12-05T12:35:23+00:00' + # when + self.loader.load( + origin=origin, visit_date=visit_date, last_modified=last_modified) -def parse_config_file(base_filename=None, config_filename=None, - additional_configs=None, global_config=True): - return TEST_CONFIG + self.assert_data_ok() + @requests_mock.Mocker() + def test_load_remote_download_failure(self, mock_requests): + """Load a remote tarball with download failure should result in no data -# Inhibit side-effect loading configuration from disk -TarLoader.parse_config_file = parse_config_file + """ + # setup the mock to stream the content of the tarball + local_url = self.repo_url.replace('file:///', '/') + url = 'https://nowhere.org/%s' % local_url + with open(local_url, 'rb') as f: + data = f.read() + wrong_length = len(data) - 10 + mock_requests.get(url, content=data, headers={ + 'content-length': str(wrong_length) + }) + # given + origin = { + 'url': url, + 'type': 'tar' + } + visit_date = 'Tue, 3 May 2016 17:16:32 +0200' + last_modified = '2018-12-05T12:35:23+00:00' -class SWHTarLoaderNoStorage(LoaderNoStorageForTest, TarLoader): - """A TarLoader with no persistence. + # when + r = self.loader.load( + origin=origin, visit_date=visit_date, + last_modified=last_modified) - Context: - Load a tarball with a persistent-less tarball loader + self.assertEqual(r, {'status': 'failed'}) + self.assertCountContents(0) + self.assertCountDirectories(0) + self.assertCountRevisions(0) + self.assertCountSnapshots(0) - """ - pass +class TarLoaderForTest(LegacyLocalTarLoader): + def parse_config_file(self, *args, **kwargs): + return TEST_CONFIG -PATH_TO_DATA = '../../../../..' +class TestTarLoader(PrepareDataForTestLoader): + """Test the legacy tar loader + + """ -class SWHTarLoaderITTest(TestCase): def setUp(self): super().setUp() + self.loader = TarLoaderForTest() + self.storage = self.loader.storage - self.loader = SWHTarLoaderNoStorage() - - @attr('fs') - @istest - def load(self): - """Process a new tarball should be ok + def test_load(self): + """Load a local tarball should result in persisted swh data """ # given - start_path = os.path.dirname(__file__) - tarpath = os.path.join( - start_path, PATH_TO_DATA, - 'swh-storage-testdata/dir-folders/sample-folder.tgz') - origin = { - 'url': 'file:///tmp/sample-folder', - 'type': 'dir' + 'url': self.repo_url, + 'type': 'tar' } visit_date = 'Tue, 3 May 2016 17:16:32 +0200' import datetime - commit_time = int(datetime.datetime.now( - tz=datetime.timezone.utc).timestamp() - ) - - swh_person = { - 'name': 'Software Heritage', - 'fullname': 'Software Heritage', - 'email': 'robot@softwareheritage.org' - } + commit_time = int(datetime.datetime( + 2018, 12, 5, 13, 35, 23, 0, + tzinfo=datetime.timezone(datetime.timedelta(hours=1)) + ).timestamp()) revision_message = 'swh-loader-tar: synthetic revision message' revision_type = 'tar' revision = { 'date': { 'timestamp': commit_time, 'offset': 0, }, 'committer_date': { 'timestamp': commit_time, 'offset': 0, }, - 'author': swh_person, - 'committer': swh_person, + 'author': SWH_PERSON, + 'committer': SWH_PERSON, 'type': revision_type, 'message': revision_message, 'synthetic': True, } - branch_name = os.path.basename(tarpath) + branch_name = os.path.basename(self.tarpath) # when - self.loader.load(tar_path=tarpath, origin=origin, + self.loader.load(tar_path=self.tarpath, origin=origin, visit_date=visit_date, revision=revision, branch_name=branch_name) # then - self.assertEquals(len(self.loader.all_contents), 8, - "8 contents: 3 files + 5 links") - self.assertEquals(len(self.loader.all_directories), 6, - "6 directories: 4 subdirs + 1 empty + 1 main dir") - self.assertEquals(len(self.loader.all_revisions), 1, - "synthetic revision") - - actual_revision = self.loader.all_revisions[0] - self.assertTrue(actual_revision['synthetic']) - self.assertEquals(actual_revision['parents'], - []) - self.assertEquals(actual_revision['type'], - 'tar') - self.assertEquals(actual_revision['message'], - b'swh-loader-tar: synthetic revision message') - self.assertEquals(actual_revision['directory'], - b'\xa7A\xfcM\x96\x8c{\x8e<\x94\xff\x86\xe7\x04\x80\xc5\xc7\xe5r\xa9') # noqa - - self.assertEquals( - actual_revision['metadata']['original_artifact'][0], - { - 'sha1_git': 'cc848944a0d3e71d287027347e25467e61b07428', - 'archive_type': 'tar', - 'blake2s256': '5d70923443ad36377cd58e993aff0e3c1b9ef14f796c69569105d3a99c64f075', # noqa - 'name': 'sample-folder.tgz', - 'sha1': '3ca0d0a5c6833113bd532dc5c99d9648d618f65a', - 'length': 555, - 'sha256': '307ebda0071ca5975f618e192c8417161e19b6c8bf581a26061b76dc8e85321d' # noqa - }) - - self.assertEquals(len(self.loader.all_releases), 0) - self.assertEquals(len(self.loader.all_snapshots), 1) + self.assert_data_ok() diff --git a/swh/loader/tar/tests/test_tasks.py b/swh/loader/tar/tests/test_tasks.py new file mode 100644 index 0000000..3e5daac --- /dev/null +++ b/swh/loader/tar/tests/test_tasks.py @@ -0,0 +1,27 @@ +# Copyright (C) 2015-2018 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +from unittest.mock import patch + + +@patch('swh.loader.tar.loader.RemoteTarLoader.load') +def test_tar_loader_task(mock_loader, swh_app, celery_session_worker): + mock_loader.return_value = {'status': 'eventful'} + + res = swh_app.send_task( + 'swh.loader.tar.tasks.LoadTarRepository', + ('origin', 'visit_date', 'last_modified')) + assert res + res.wait() + assert res.successful() + + # given + actual_result = res.result + + assert actual_result == {'status': 'eventful'} + + mock_loader.assert_called_once_with( + origin='origin', visit_date='visit_date', + last_modified='last_modified') diff --git a/swh/loader/tar/tests/test_utils.py b/swh/loader/tar/tests/test_utils.py index 05b43fc..2b965e9 100644 --- a/swh/loader/tar/tests/test_utils.py +++ b/swh/loader/tar/tests/test_utils.py @@ -1,45 +1,43 @@ -# Copyright (C) 2015-2017 The Software Heritage developers +# Copyright (C) 2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information +import random import unittest -from nose.tools import istest - from swh.loader.tar import utils -class TestUtils(unittest.TestCase): - @istest - def convert_to_hex(self): +class UtilsLib(unittest.TestCase): + + def assert_ok(self, actual_data, expected_data): + """Check that actual_data and expected_data matched. + + Actual data is a random block of data. We want to check its + contents match exactly but not the order within. + + """ + out = [] + random.shuffle(expected_data) + for d in actual_data: + self.assertIn(d, expected_data) + out.append(d) + self.assertEqual(len(out), len(expected_data)) + + def test_random_block(self): + _input = list(range(0, 9)) + # given + actual_data = utils.random_blocks(_input, 2) + self.assert_ok(actual_data, expected_data=_input) + + def test_random_block2(self): + _input = list(range(9, 0, -1)) # given - input_dict = { - 'sha1_git': b'\xf6\xb7 \x8b+\xcd \x9fq5E\xe6\x03\xffg\x87\xd7\xb9D\xa1', # noqa - 'sha1': b'\xf4O\xf0\xd4\xc0\xb0\xae\xca\xe4C\xab%\x10\xf7\x12h\x1e\x9f\xac\xeb', # noqa - 'sha256': b'\xa8\xf9=\xf3\xfek\xa2$\xee\xc7\x1b\xc2\x83\xca\x96\xae8\xaf&\xab\x08\xfa\xb1\x13\xec(.s]\xf6Yb', # noqa - 'length': 10, - } # noqa - - expected_dict = {'sha1_git': 'f6b7208b2bcd209f713545e603ff6' - '787d7b944a1', - 'sha1': 'f44ff0d4c0b0aecae443ab2510f712681e' - '9faceb', - 'sha256': 'a8f93df3fe6ba224eec71bc283ca96ae3' - '8af26ab08fab113ec282e735df65962', - 'length': 10} - - # when - actual_dict = utils.convert_to_hex(input_dict) - - # then - self.assertDictEqual(actual_dict, expected_dict) - - @istest - def convert_to_hex_edge_cases(self): - # when - actual_dict = utils.convert_to_hex({}) - # then - self.assertDictEqual(actual_dict, {}) - - self.assertIsNone(utils.convert_to_hex(None)) + actual_data = utils.random_blocks(_input, 4) + self.assert_ok(actual_data, expected_data=_input) + + def test_random_block_with_fillvalue(self): + _input = [(i, i+1) for i in range(0, 9)] + actual_data = utils.random_blocks(_input, 2) + self.assert_ok(actual_data, expected_data=_input) diff --git a/swh/loader/tar/utils.py b/swh/loader/tar/utils.py index b728b0a..2b46989 100644 --- a/swh/loader/tar/utils.py +++ b/swh/loader/tar/utils.py @@ -1,74 +1,35 @@ -# Copyright (C) 2015-2017 The Software Heritage developers +# Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information -import itertools import random -from swh.model import hashutil +from swh.core.utils import grouper -def convert_to_hex(d): - """Convert a flat dictionary with bytes in values to the same dictionary - with hex as values. - - Args: - dict: flat dictionary with sha bytes in their values. - - Returns: - Mirror dictionary with values as string hex. - - """ - if not d: - return d - - checksums = {} - for key, h in d.items(): - if isinstance(h, bytes): - checksums[key] = hashutil.hash_to_hex(h) - else: - checksums[key] = h - - return checksums - - -def grouper(iterable, n, fillvalue=None): - """Collect data into fixed-length chunks or blocks. - - Args: - iterable: an iterable - n: size of block - fillvalue: value to use for the last block - - Returns: - fixed-length chunks of blocks as iterables - - """ - args = [iter(iterable)] * n - return itertools.zip_longest(*args, fillvalue=fillvalue) +def random_blocks(iterable, block=100): + """Randomize iterable per block of size block. + Given an iterable: -def random_blocks(iterable, block=100, fillvalue=None): - """Given an iterable: - slice the iterable in data set of block-sized elements - - randomized the data set - - yield each element + - randomized the block-sized elements + - yield each element of that randomized block-sized + - continue onto the next block-sized block Args: - iterable: iterable of data - block: number of elements per block - fillvalue: a fillvalue for the last block if not enough values in - last block + iterable (Iterable): an iterable + block (int): number of elements per block - Returns: - An iterable of randomized per block-size elements. + Yields: + random element of the iterable """ count = 0 - for iterable in grouper(iterable, block, fillvalue=fillvalue): + for iter_ in grouper(iterable, block): count += 1 - lst = list(iterable) + lst = list(iter_) random.shuffle(lst) for e in lst: yield e diff --git a/version.txt b/version.txt index 6da7cfa..975b284 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.35-0-gd4bd5e1 \ No newline at end of file +v0.0.38-0-g3988b48 \ No newline at end of file